I gave a lightning talk last night at the SF Bay Area App Engine Developers, showing some work I've been doing to represent gnarly legacy records in AppEngine so as to maintain source fidelity, minimize upfront analysis, and make them easy to integrate with other systems.
I had started with an XML record that I wanted to parse and represent in the datastore -- without knowing which tags and structures would be present, since this format had, ahem, evolved to obscurity over time, as often happens with real-world legacy records.
Before I talk about my approach, here's why I thought this effort might be interesting to the group: a lot of data structures have a tree structure in common with XML. From C structs and file blocks that include a header, telling which types to cast the next n bytes to (and so on inside of those) ... to mainframe "structured data" records I've encountered which consist of nested records, parsed recursively, with their meanings occasionally opaque, lost to history, or belonging to some partner company.
My approach -- which is simply to create a mapping of how to assemble and disassemble the records -- enables a record to be stored in a single App Engine record. But not as a block (or blob) -- rather with fine-grained addressable fields that are easy to talk to using the GAE Datastore API.
In my case, since my original was XML, I created a mechanism similar to a tiny subset of XPath describing the sequence of tags where a data element lived -- but with the characters changed so that it would be Python and GAE-friendly. That is, instead of "/foo/bar/baz" I used _Foo_Bar__2_Baz.
This let me "flatten" the XML into a set of key-value pairs, while allowing that the XML might contain arbitrary structures injected by others ... and that I might want to inject my own extra structures. This arrangement is perfect for the Expando models in App Engine Datastore, or any similar store (e.g. Hypertable, which is modeled after BigTable, or Microsoft SQL Data Services which uses SQL 2008's sparse tables to similar effect).
So now I can store and retrieve my records. Any fields/subrecords which I understand and care about, I can easily work with from other systems, by mapping to the appropriate "key" in the stored record.
For example, if I'm storing a bunch of catalog data, and another system just cares about enumerating each "Product" with "Name" and "Price," then I can create a facade or wrapper in GAE that maps, say, Price to _Strange_Old_Way_To_Represent_Current_Price, and we're all set.
To be sure, there could be performance issues if you tried to use this to create arbitrary queries and reports against the data. That's not really the purpose and, in my experience, if there are no "shortcuts" to processing these legacy records, then the business folks are not used to being able to make an OLAP cube out of them either. (They probably have a batch or offline extraction process.)
Nonetheless, it's another tool in our chest when we need to work with systems and data that have been out in enough real-world battles to come home scarred with lots of cruft.