Blog

08 Jul / Unit testing Hadoop client code using Scala and SBT

At my company we are developing an Hadoop based product, we use Scala extensively in combination with Akka, Play and our great open source contribution Eventsourced.

We are very happy with this choice and everything is going pretty smoothly. We try as much as we can to be disciplined in testing our code, we write unit tests for both the back-end components and the javascript based user interface. Being a product based on Hadoop we  constantly need to test our Hadoop client code in the most realistic way.

As Hadoop client code I mean code that either access the HDFS filesystem or builds and runs some Map/Reduce job. Testing this kind of code is tough, Hadoop provides, for example, MRUnit, which is a unit testing framework for map reduce jobs. Moreover, HDFS provides same APIs for accessing a local file system, so in general, it’s possible to reach a decent coverage of the code from a functional standpoint.

Unfortunately, this is not enough, Hadoop is a distributed platform, so it becomes very important to test the code simulating an actual distributed set up. On the other end preserving the capability to run the tests in insulation without any dependency on an external systems is extremely important for keeping the development environment clean and agile.

Fortunately, Hadoop provides full support for creating and running an embedded cluster inside your code, even if it’s not actually distributed, however it’s possible to run this embedded cluster with multiple data nodes and multiple task trackers trying to be as close as possible to an actual distributed configuration.

Hadoop provides two classes for achieving this target: org.apache.hadoop.hdfs.MiniDFSCluster and org.apache.hadoop.mapred.MiniMRCluster. If you Google them you can find some example of their usage, but all the example I found were based on Maven. Being a Scala shop I wanted to set up an SBT based project with Scalatest as a unit testing framework.

As a starting point I used this blog post and I adapted the mini cluster example to be run under SBT and Scalatest. The tricky part was getting the right dependencies on the SBT build file, lot of attempts to find the minimal set up that does the work, sometime SBT can be very tricky. Another strange thing that I didn’t have time to investigate further is the fact that the tests run only with this option: fork in test := true otherwise the classpath used for running the embedded map reduce jobs is not propagated properly.

I don’t want to bother you too much with too much writing, so below you can find the SBT build file and the Scalatest based example:

 

and the Scalatest code:

}

By David Greco in Hadoop, scala
No Comment

Sorry, the comment form is closed at this time.

  • trance

  • techno

  • synth-pop

  • soundtrack

  • smooth-jazz

  • rock

  • rap

  • r-b

  • psychedelic

  • pop-rock

  • pop

  • new-age

  • musicians

  • metal

  • melodic-metal

  • lounge

  • jazz-funk

  • jazz

  • index.php

  • house

  • hip-hop

  • heavy-metal

  • hard-rock

  • get.php

  • electronic

  • dubstep

  • drumbass

  • downtempo

  • disco

  • country

  • clubdance

  • classical

  • chillout

  • chanson

  • breakbeat

  • blues

  • ambient

  • alternative-rock