Monday, February 12, 2007

Yipes it's Y! Pipes

Super cool: there is no reason that a human should need to handwrite HTTP/XML/mashup/filtering logic for simple cases. Even with the highest-level toolkit, it still requires time, introduces bugs, needs to be hosted...
Systems like this are about moving toward a declarative specification for extracting semantics from web services (in this case RSS).

This particular implementation is a bit fancy on the graphics, which makes it run slowly, and it seems like it needs to extract data from RSS only. That is, if you try it out, it expects every URL "fetch" result to look like an RSS formatted collection of "somethings" ... which is nice, but it would be cool if you could also process XML from REST queries, or build SOAP queries as well. My first inclination was to ask for some kind of RegEx widget, but perhaps the Y! Pipes team intentionally doesn't want to allow us to go down that route ... over time they want more structure, not less structure in the data. They probably feel like RegEx has already been done in the HTML scraping world, although there is certainly lots more work to do there.

If you are interested in this stuff, check out some other approaches and flavors of this notion too:

- Dapper which tries to build web services on top of any web page as a data source. These guys have a "virtual browser" which lets you point and click your way through existing pages to build a service

- Kapow and OpenKapow -- enterprise and "free online" design tools for scraping, mixing, mashing and republishing the web

- QL2 an "old-school" enterprise software product used for industrial strength scraping, it implements a query language so that you can treat the web data sources that are being used as a virtual database (!) (frighteningly enough for an "unstructured data" query tool, this system is used in some large mission critical apps)

- YubNub: this souped-up version of wget lets you define "commands" (aka abbreviations) for issuing web queries, can substitute parameters, and pipe things together. It's arbitrarily extensible since you can always write a servlet/ashx/&c. to provide any data access or transformation you might want. On the other hand, it's more about plaintext (or human readable anyway) than XML

No comments: