Monday, January 26, 2009

Is that Service Really a Scalable Cloud or Just Full-Service Web Hosting?

A lot of cloud stacks, or cloud app platforms promise scalability for your app, "With a little EC2 in every box!" (TM). There is a big catch and a little catch, though, and if your app gets big, then either or both of these may be a deal-breaker.

First, and most important: Running a vanilla RDBMS (e.g. MySQL) in a VM somewhere does not make it magically scalable. Read that sentence one more time.

Some cloud offerings integrate tightly to the traditional sort of DB instance you might attach to your web app on a single server. Examples include Heroku, which applies your Rails migrations to a PostgreSQL instance, and Stax, which offers MySQL.

The great thing about these environments is that they don't require significant changes to your standard app built on their supported platforms (mostly Rails and Java variants). Upload, minimal admin, and IJW (it just works).

That's turn-key, full-service web hosting, right there. It's beautiful -- in fact, in an OO and Rails course I wrote, I chose Heroku for deployment as a way to let students get something up and running on the web without getting into the operations/deployment/tuning aspects of Rails which deserve their own course.

But if your app gets large -- or just uses large datasets -- the database is rapidly going to be a bottleneck. Scaling out an app logic tier to a dozen EC2 instances automatically may sound good, but it won't do a thing for a DB-bound app (it may make it worse). And these databases don't scale out without a little architecture, planning, configuration -- all of the things which these cloud platforms are designed to avoid. And which, on some platforms, you cannot do at all.

For example, so far as I can tell on Heroku or Stax, there is no way to even configure multiple servers and replication, which is just a minimum starting point for scaling a DB to multiple machines. Stax may allow for a logical sharding setup, but it's not clear how one would control which VMs and disks the databases run on. Rightscale seems like the kind of firm which would specialize in the management scripts / meta-API that one would need to automate sharding, but the sharding option doesn't appear in any of the models on their website. With replication, which Rightscale does offer (though they're not exactly an app platform, more an infrastructure play), you get to this, still limited, picture:

Other cloud platforms offer datastores specifically designed to scale out, including Google App Engine, 10gen, and others. These platforms offer a non-relational or pseudo-relational datastore, with different data access APIs and a variety of restrictions relative to what you may be used to. These datastores are architected to scale easily, but there are real tradeoffs that must be considered. In fact, if you don't know these tradeoffs cold, you are not the right person to be making this platform decision. Get on craigslist and hire (or borrow) someone who knows the stuff.

The other catch is that whichever approach you choose, these vendors are offering you convenience, some outsourced operations management, and (in some tiers) elasticity and scalability ... but they are not offering cheap compute cycles. That is, if you know you'll need a large, predictable amount of raw compute time, then know also that you're paying a premium to do that computation in one of these environments.

A friend who has designed, built and operated feature film renderfarms for a number of studios confirmed that he has, on a semi-regular basis, analyzed the costs of remote VM-based datacenters (e,g. EC2) compared to their physical ones. Because the studios use these machines intensely, and are consistently consuming raw compute power, the local physical servers have always made more sense.

What does this have to do with your web app and datastore? Well, suppose you have designed your app to leverage a scalable datastore. These may not be tunable, may not perform fast, and may require you to do certain operations in code which traditionally are done in the DB. You may never see these slow queries or operations ... until they show up in your bill. That is, if the system is truly elastic and scalable, it will apply resources as needed to handle your work. If your query or sort or filter takes a lot of CPU cycles, the cycles will be made (almost) instantly available, so the user always sees your app perform well. And then you'll pay for all those cycles or instances at the end of the month.

Either way, there is no free lunch on the data persistence side. Which is not in itself a reason to avoid cloud environments. But it should be a bigger part of the conversation than it is today. And it absolutely must be part of the conversation, if larger businesses are going to move their services into the cloud.

Wednesday, January 21, 2009

Using AppEngine -- Or Similar Datastore -- To Integrate Complex Legacy Data Formats

I gave a lightning talk last night at the SF Bay Area App Engine Developers, showing some work I've been doing to represent gnarly legacy records in AppEngine so as to maintain source fidelity, minimize upfront analysis, and make them easy to integrate with other systems.

I had started with an XML record that I wanted to parse and represent in the datastore -- without knowing which tags and structures would be present, since this format had, ahem, evolved to obscurity over time, as often happens with real-world legacy records.

Before I talk about my approach, here's why I thought this effort might be interesting to the group: a lot of data structures have a tree structure in common with XML. From C structs and file blocks that include a header, telling which types to cast the next n bytes to (and so on inside of those) ... to mainframe "structured data" records I've encountered which consist of nested records, parsed recursively, with their meanings occasionally opaque, lost to history, or belonging to some partner company.

My approach -- which is simply to create a mapping of how to assemble and disassemble the records -- enables a record to be stored in a single App Engine record. But not as a block (or blob) -- rather with fine-grained addressable fields that are easy to talk to using the GAE Datastore API.

In my case, since my original was XML, I created a mechanism similar to a tiny subset of XPath describing the sequence of tags where a data element lived -- but with the characters changed so that it would be Python and GAE-friendly. That is, instead of "/foo/bar[2]/baz" I used _Foo_Bar__2_Baz.

This let me "flatten" the XML into a set of key-value pairs, while allowing that the XML might contain arbitrary structures injected by others ... and that I might want to inject my own extra structures. This arrangement is perfect for the Expando models in App Engine Datastore, or any similar store (e.g. Hypertable, which is modeled after BigTable, or Microsoft SQL Data Services which uses SQL 2008's sparse tables to similar effect).

So now I can store and retrieve my records. Any fields/subrecords which I understand and care about, I can easily work with from other systems, by mapping to the appropriate "key" in the stored record.

For example, if I'm storing a bunch of catalog data, and another system just cares about enumerating each "Product" with "Name" and "Price," then I can create a facade or wrapper in GAE that maps, say, Price to _Strange_Old_Way_To_Represent_Current_Price, and we're all set.

To be sure, there could be performance issues if you tried to use this to create arbitrary queries and reports against the data. That's not really the purpose and, in my experience, if there are no "shortcuts" to processing these legacy records, then the business folks are not used to being able to make an OLAP cube out of them either. (They probably have a batch or offline extraction process.)

Nonetheless, it's another tool in our chest when we need to work with systems and data that have been out in enough real-world battles to come home scarred with lots of cruft.

Monday, January 19, 2009

Twitter's Underwherlming (Former?) Architecture Problem

I recently came across this post from May 2008 comparing Twitter traffic and the Options Price Reporting Authority data feed. Needless to say, the stock market feed is many orders of magnitude larger, at 700,000+ messages per second(!)

It's also not the fairest comparison in the world on its face, for a variety of reasons: the OPRA data system was planned (Twitter met success more or less by accident), Twitter is minimally funded, etc.

A more relevant comparison, in my opinion, is that provided by newzwag, which presented its performance challenges, triumphs, and secrets at a recent SF Ruby meetup.

newzwag's site and trivia game is built on Rails, started small, and had to grow to meet traffic driven by Yahoo and the Beijing Olympics to 9 million pageviews per hour (using a total of a half-dozen machines or so). And lest you think this is a content site served out of a cache, most of the traffic consists of data writes by game players that then need to be ranked, published, etc.

As far as I can tell, that's somewhat larger than Twitter, even considering that Twitter has grown 3-4x since last May's stats.

newzwag's solutions, which they share here, are a study in sanity, reasoned problem solving, and smart efficient architecture.

Without the timelines or resources of a stock-market app, newzwag produced a nice solution that -- at least in hindsight -- appears drama-free.

Interestingly, a newzwag - Twitter comparison can be enlisted to support a variety of different startup social narratives.

One narrative is that an amateur-hour effort yields amateur-hour results, and aspiring startups shouldn't fool themselves into thinking that they won't need old-time Architecture and Sophistication to scale.

A different narrative says it doesn't matter -- if Twitter's success is your worst-case scenario, you still win. That is, build it fast, get it out there where people can try it, and you should be so lucky as to need a real re-arch to fix your scaling problems. In this model, both Twitter and newzwag played it right -- newzwag because they knew the Olympics would provide a narrower time window to showcase their system, so they managed risk against that stricter goal.

And yet another narrative says if you accept these two stories, you still wouldn't want your brokerage transaction flowing through a system built to "see what sticks," and hence Web 2.0 startup methodologies stare at mission-critical business apps from across a huge chasm.

I see this last story as persuasive but also as a big opportunity: there is a chasm, to be sure, but it needn't be quite so big. There are legacy mainframe apps that can speak webservices. Every manager in a big company wants their product to be "100% critical" even if they could create more value by admitting that a lot of nice-to-have two-nines business apps are the real bricks in the wall. If enterprises can get better at separating their Twitters from their OPRAs, they can make more money and have both.

Wednesday, January 14, 2009

New iPhone App Store Rules Take a Step Closer to Scriptable Apps

A lot of folks commented today on newly-approved web browsers appearing in the App Store. Or, more precisely, a handful of apps using the existing web browser widget to offer a slightly tweaked browser experience.

While iPhone apps could include the UIWebView component before -- and indeed this has proven a popular route to getting hybrid native/web apps up and running quickly -- today's change is about allowing apps that "duplicate" a built-in feature of the phone. And one of the fundamental characteristics of any web browser nowadays is that it is thoroughly scriptable.

If you build an app this way, it already includes a scripting environment ... so the question (since scripting and dynamic apps are verboten on un-jailbroken iPhones) is how far one can let the scripting go and still pass muster with the App Store overlords.

Using stringByEvaluatingJavaScriptFromString we can inject script into the browser ... including script that pulls data back out.

And although the JavaScript bridge is "one directional" compared to the OSX desktop API, there are workarounds such as registering a protocol handler to receive scripted "requests" from inside the page ... or by hooking decidePolicyForNavigationAction with a script-initiated navigation request (disclaimer: I haven't checked to see if this is in the phone API, but it seems plausible) to signal the availability of data.

So native code becomes effectively scriptable. Or, for an even less controversial but perhaps equally powerful route: just inject a bunch of JavaScript API libraries into the browser and keep the scripting (and more of your app) in Safari. That's not too different from pointing a browser at a web site (where the page loads various scripting libraries) ... except that underneath it all we are in native-caps mode ...

Unless I'm missing something here, a somewhat ambiguous situation has gotten thornier with the admission of this new class of general purpose browser apps.

Monday, January 12, 2009

Windows 7 Product Name is Missing a Feature

I didn't feel strongly one way or the other about the Windows 7 product name (i.e. "Windows 7") ... until recently when I wanted to troubleshoot the Azure SDK on Windows 7. (Apparently Azure on 7 has worked with the M3 build for at least one intrepid forum poster, but it's not behaving with the beta build for me at the moment.)

I started searching newsgroups, forums, blogs, etc., and realized that "Windows 7" is not a great search term.

On an engine like Google you can put quotes around it, specifying exact phrase, but some other full-text search systems don't seem to want to keep the Windows and the 7 together. Or perhaps they have an index by single words, and they link the results together to match your phrase later, but once you throw in other terms like SDK and Azure, the matching engine becomes a little more promiscuous, offering you a "promising" combo of Azure, SDK, and Windows ... or SDK and 7 ... as a higher-ranked match. Making it, in any case, rather harder to find what you want.

One-word product names, like "Silverlight," "Vista," and "XP" work a lot better for this kind of search.

Which is perhaps a reason that folks include the release name with the version number on products such as Ubuntu (Hardy Heron, Intrepid Ibex, etc.)

So ... what would be a good nickname to put next to Window 7?

Ruby and Python as Cloud Lingue Franche; Ruby/Rails on 10gen

Not sure how this one slipped past me, but 10gen announced support for the Ruby language and most of the Rails framework APIs on their open-source cloud service last month.

This addition is great news for 10gen and for cloud computing (the hosted-application-platform flavor, not the hosted-hardware/datacenter flavor).

For 10gen, support for a well-known API and app model is a huge bonus, which makes it easy for people to move an app into the cloud without learning and coding to new APIs, and also lowers the perceived "lock-in" involved, should the move not work out.

Their original JavaScript platform approach, as I've written before, is problematic not only because folks are unlikely to have meaningful (for their business) apps lying around to try mounting in the cloud, but more so because there is no standard server-side JS API set. A half-dozen companies offer a JS app server or cloud and they all have different platform APIs for even the simplest things, such as reading HTTP request variables, or deleting a session.

10gen takes a big step forward, joining Stax, Heroku, and morph labs in supporting Ruby on Rails in the cloud.

This move also reinforces another emerging trend: Ruby and Python serving as lingue franche for cloud app stacks. While many cloud offerings support JavaScript or other languages, Ruby and Python seem to be emerging as the ones with broadest support: 10gen will support both; AppEngine supports Python and a language-to-be-named-later; Stax supports both; Azure will likely support IronRuby and IronPython (some Python apps can already work in Azure).

Of course, the language is only half of the battle -- there are the APIs to deal with as well, and issues will typically arise where the impedance mismatch is highest with cloud-related infrastructure. E.g., cloud databases are mostly non-relational and don't support SQL ... so an ActiveRecord or SQLAlchemy API won't work on 10gen's 'grid database' (a reasonable tradeoff for simpler scalability.)

Even so, it is starting to appear as though one could write a lot of core business logic using, say, Python, and expect it to run unmodified on most vendors' clouds. Not a bad position to be in for the Python folks.

Sunday, January 11, 2009

Another Windows 7 Milestone: Bounce vs. Hibernate vs. VM Suspend Times

Lately I've started running the Windows 7 Beta for some development experiments, using VMWare's fantastic dual-screen support. As I've written before, the general experience is great, even under virtualization with 1GB of RAM, and having it wall-to-wall on multiple displays makes the illusion more convincing.

An interesting thing I've noticed is that my old habit, when I want to stop working in a VM and free up the resources, is to suspend the VM. This action is roughly (but not exactly, depending on the VM you're using) equivalent to "hibernating" a laptop (S4 power state) -- memory is mapped out to a file and the device is powered off.

I usually do this because this hibernate/wake is faster than a shutdown/boot-up, not because I'm trying to save my actual work state (open apps, etc.) Especially with Windows server, but even with XP and Linux, this approach is the faster way to hop in and out of a work session. On the laptop, it's a way to save the battery power involved in a longish hard boot.

In Windows 7, VM suspend/resume (==hibernate/wake) seems to be slower than shutdown/boot. That is, even with no user apps running (which could take up an arbitrary amount of memory and thus lengthen the map-out / map-in time), boot seems faster. I say "seems" because I have only 2 machines to play with, and they are not clean images just for this test, so I won't pretend they represent absolute objective truth.

What does this mean?

It would appear that the boot process has been cleverly streamlined so that a cold machine gets to a running, usable state before all of the additional services and apps have fully loaded and gotten running, and that this is orchestrated using knowledge that a white-box VM player doesn't have.

Some folks may point out that having to reboot an OS is itself questionable ... and indeed the boot is optional -- I only reboot my XP desktop every couple of months when some security patch or other requires a restart.

But in the world of laptops and netbooks, things are different: every minute of juice is valuable, so there's always the consideration of the cost of a hibernate/wake vs. sleep vs. leaving it on with the LCD off. And that equation has just gotten a little more interesting: for Win 7 on a laptop, if you're not going to be using the machine for a while, it may turn out to be faster and use less power to do a shutdown, and just reboot later.

While this may seem like a fairly inconsequential gimmick about boot times, it is a step in the right direction as we look at the huge array of gadgets we all use and which eat a ton of phantom power. The Windows PC is kind of the holy grail for a fully-off / instant-on experience, and Win 7 appears to take a measurable step in that direction.

Wednesday, January 07, 2009

Embarrassment of Meetup Riches and a Suggestion for Compelling Talks

In the Bay Area, we are fortunate enough to have so many great tech meetup groups that there are frequent collisions, and it's always a bummer to pass on what promises to be a great presentation.

On January 20, the SF Java group has three sharp guys presenting on Scala.

Meanwhile, down in Mountain View, the Google App Engine group has a "hacking..." talk that with a handful of presentations including an update from Google AppEngine Product Manager Pete Koomen.

What to do? Hmmmm...

Ultimately, I decided on AppEngine.

Principally, it appeared that more new, not-readily-available-today-on-the-net material would be presented at the AppEngine group. Lightning talks give a forum to quirky thinkers, very early startups, and other interesting folks, while representation from the mothership might be able to offer a little detail or timeline on upcoming features, like large BLOB support and the next language.

The Scala talk just seems less likely to include info that I can't get from existing resources.

Which leads to the following conclusion about what makes for more compelling talks, at least for an audience of me:

A focus on information that is not readily available, and which gains from the presence and experience of the speaker.

So, for example, I've seen many talks on "my cool [insert language] library that does [insert function]."

A fine topic. Now in the execution, perhaps best to talk about the problem being solved, how you solved it, what tradeoffs were made, constraints dealt with, any magic foo inside ... rather that a bunch of examples showing how clever/elegant the external API is and examples of what one can do with it.

Not that the latter is unimportant, but that the latter is (should?!) be readily available from the online docs/examples or the presentation notes; whereas the former represents the specialized knowledge and experience of the library's creator.