Tuesday, February 20, 2007

Grepping in PowerShell

I originally wrote this for the company wiki the other day, and thought it might be useful to a wider audience. The context is parsing and processing an iTunes library.xml file (just a one-off task), which I thought might a be a fun and educational opportunity to slice, dice, and ... how does that Ron Popeil commercial go? ... with PowerShell.

PowerShell is the new shell for Windows. New, and supported, but not "the official" in the sense that it doesn't ship with Vista, although I'm guessing it will ship in the Longhorn Server rev.

If you're used to Unix shells, then you'll probably be floored by the power of PowerShell and somewhat annoyed by the syntax, which, despite liberal aliases to familiar things like ls takes some getting used to.

.net framework integration means you can easily access any object in the .net base class library, and there are some special tricks that do some of this for you too. The canonical example seems to be this one, a quickie rss reader:

$wc = new-object System.Net.WebClient
$rssdata = [xml]$wc.DownloadString(‘http://foo.bar/rss.xml’)
write-host $rssdata.rss.channel.title
$rssdata.rss.channel.item | foreach { write-host $_.title }

Since the source file is xml, I had thought the XML parsing would come in handy, but it turned out that there was no real data model to the XML. Basically, there is just a big nested map structure (key-value pairs in blocks) in the item list. Sort of XML for the "takes-void*-returns-void*" crowd. So then grep looked promising because the keys and values (and their tags) were grouped on individual lines.

Grepping is a little counterintuitive with PowerShell because the pipeline between commandlets in PowerShell is filled with full-on objects not strings. If you just want text, you can use Get-Content, which provides its output as a bunch of string objects, 1 per line, which is convenient. Here's an example I came up with after struggling a little bit to get a grep type of functionality. I throw a sort and unique on here for fun:

Get-Content Library.xml | ForEach-Object { if ($_ -match [regex]"(?<=Artist\<.{13}).*(?=\<\/)" ) { $matches[0] }} | Sort-Object | Get-Unique | Out-File lib.txt

Many of these things can be abbreviated too, so if you want your script to read a little tighter, you can use

gc Library.xml | % { if ($_ -match [regex]"(?<=Artist\<.{13}).*(?=\<\/)") { $matches[0] }} | sort | unique

Isn't that sweet?

Since the regex uses zero-width lookahead and lookbehind assertions instead of extracting a marked subexpression, I'm curious if anyone has input on whether one approach is faster / better / shinier than the other.

My first guess is that they are similar, since my first cut at implementing lookahead + lookbehind would probably be to match the whole outer expression while naming the non-zero-width-bit in the middle, and assigning the value of that to the expression.


Seacanoeist Mark said...

I liked your article, I will share your article to everyone!!

WoW gold|Diablo 3 Gold|RS Gold|Cheap Diablo 3 Gold

Love Kpop said...

It seems I'm on the right track, I hope I can do well. The result was something I did and was doing to implement it.