samedi 3 octobre 2009

File parsing in Scala with Pattern Matching and Tail-Call optimization

In the process of learning Scala, I had to find a subject for my very first "Non Hello World" program in Scala. It was a good opportunity to code a script I wanted to write a long time ago in order to automatize the sync of the podcasts I download for my Mp3 player (I have very specific use cases that are not addressed by the usual podcast downloader software).

As my player is an MTP player,the script launches some mtp-tools command line utilities of libmtp to drive the player.

In particular the mtp-tracks command lists some information for the tracks stored into the player. The output of the command looks like :

Attempting to connect device(s)
mtp-tracks: Successfully connected
Friendly name: My Zen
Track ID: 40337
Title: Java Posse #273 - Roundup 09 - Managing Technical Debt
Artist: Tor Norbye, Carl Quinn, Joe Nuxoll, Dick Wall
Genre: Podcast
Album: The Java Posse
Origfilename: Tor Norbye, Carl Quinn, Joe Nuxoll, Dick Wall-Java Posse #273 - Roundup 09 - Managing Technical Debt.mp3
Track number: 0
Duration: 3809000 milliseconds
File size 45766104 bytes
Filetype: ISO MPEG-1 Audio Layer 3
Use count: 9 times
Track ID: 41737
Title: Java Posse #272 - Newscast for July 30th 2009
Artist: Tor Norbye, Carl Quinn, Joe Nuxoll, Dick Wall
Genre: Podcast
...
So basically, in order to parse the file I must skip the header (there's a footer too) and find each track information. It looks like a job for Pattern Matching on list with a mix of RegExp matching. And as he goal is also to learn Scala (And improve my Functional Programming skills) I wanted to code the parsing without any mutable object (yes, I'm trying to follow the rules described in Programming in Scala)

So here is the part of the code that parses the mtp-tracks output :

The list of lines of the mtp-tracks command output are stored in the lines list. This recursive function can then be called with collect(lines,Nil), it returns a list of parsed MtpTrack object.

To explain the three pattern matching cases of the collect function, I must first explain how pattern matching works for regexp.

In this code, four matchers are defined for four regexp : TrackId, Title, Album and Filename. For instance, if one line matches on the expression Track(id) the first group of the regexp is captures in the id variable.

So, now, the description of the three cases after lines match
  • case TrackId(id) :: Title(title) :: _ :: _:: Album(album) :: Filename(filename) :: rest
    this one says "match on list if list first element matches the regexp TrackId, then regexp Title, then whatever twice, then regexp Album, then regexp Filename. And also, if it matches extract the rest of the list in the rest variable. Also, capture all regexp groups in the declared variables id,title,album and filename". Rather powerfull.
    Then, with the matched information we create an MtpTrack object that is added to tracks, the already collected list of MtpTrack (actually a new list is created as a List in Scala is immutable).
  • case _ :: rest
    this one is tried when the previous case does not match. It says "if list starts with any element, extract the rest of the list after this element in a rest variable". rest may also by empty.
  • case Nil
    this one is the stop clause for the recursion, it matches when the list is empty.
Thanks to the power of pattern matching composition (RexgExp in list), header and footer are automatically ignored. And as the function ends with a direct recursive call, the Scala compiler has compiled the code with tail-call optimisation, so not stack overflow, even with very large command output.

I must say that I was amazed by how much can be expressed in Scala with so little lines of code. And as I'm a beginner in Scala, there are surely even smaller and idiomatic ways to code the same functionnality.

Aucun commentaire: