Wednesday, March 19, 2008

Stored Procedures and Code in the Cloud

For modest-sized Internet applications, the allure of cloud services has two elements.

First, there's simplicity of implementation and maintenance -- the hassle of real-world ops is the sort of problem startup CEOs dream of having, while startup engineers (and engineering budgets) are ill-equipped to deal with it. Second is the promise of easy scalability -- another problem the CEOs dream of, and the engineers secretly hope will become Somebody Else's Problem.

Storage in the cloud is conceptually easy. Especially with ActiveRecord patterns that ignore (at their own peril, but that's an article for another day) 35 years' worth of learnings about data integrity in the relational model. And for those who need to write things more complicated than 37signals' latest masterpiece, true structured data services in the cloud are coming.

What about application logic, though? There's raw EC2, which works on the level of provisioned VM images, and makes you design for clustering, manage your instances (while they're up), and keep your dynamic data somewhere else. Fabulous infrastructure but non-trivial to use.

Folks like heroku have value-added application-level services above EC2, which offer the elasticity with less hassle.

But what about going even higher level, and defining a unit of work, or a service module that can be deployed into a scalable container, preferable "nearby" the data it needs?

Real world example: in a recent project, I needed to be able to run Dijkstra's algorithm on big (250,000+ nodes) graphs in a persistent store. It would be great to use SimpleDB or SSDS (Astoria) for storage, but what about running the algorithm? It's not practical to extract a representation of the graph over the network, and then run Dijkstra on it just to find some interesting nodes each and every time. Changing the algorithm or using a different one? Maybe ... But what I really wanted to do was create a small module that I could ship over to the data, and run there. Even better, I'd like to be able to compute on the data "in place" in a storage facility, rather than extract. Conceptually a bit like a stored procedure.

I believe that the solution -- and an easier way to start shipping computing into a cloud facility -- is to create a module definition that one can code to, and then just upload. I think Python or Ruby would be ideal languages as they are popular, truly cross-platform, and not encumbered with a IP issues. Plus the modules would be provided as source, so that they could be scanned for, e.g., insecure or computationally intensive uses of stuff like eval. In fact, given Google's investment in Python and in interesting tooling in general, they may already be most of the way there.

I just need a better name than "stored procedures" -- that one's not getting any points for cool.

2 comments:

Andy Steingruebl said...

Do none of the commercial grid offerings do what you'd like? Some of them have batch job submission interfaces with the ability to soak up massive amounts of CPU for relatively cheap prices...

Warna Cat Untuk Kamar Mandi said...

I think this is among the most significant information for me.And i’m glad reading your article. But want to remark on few general things, The web site style is great, the articles is really nice : D. Good job, cheers