Monday, August 19, 2013

How to deal with "ClassNotFound" Exceptions in Map Reduce Jobs

There are times when you require external or user defined libraries in MR Jobs. Just putting them in CLASSPATH does not help. 

There are couple of ways to include them depending upon the use cases:-

The `hadoop` shell script command builds the classpath from core libraries located at $HADOOP_HOME/lib directory. Putting your external JAR here will make it available to Hadoop Cluster when it demands. It is assumed that the JAR lib is copied to every cluster at some pre-defined path on each cluster node. The other way of doing it is by adding the HADOOP_CLASSPATH variable in hadoop-env.sh which is located where your hadoop configurations are. On Ubuntu it is /etc/hadoop/hadoop-env.sh.

HADOOP_CLASSPATH=<COLON SEPERATED PATHS TO YOUR JAR>

The other variables available in hadoop-env.sh file gives you the option to add JARs to classpath for other hadoop daemons. For e.g. if you want to add JARs to be specifically for task tracker nodes you can do so by:-

HADOOP_TASKTRACKER_OPTS="-classpath<COLON SEPERATED PATHS TO YOUR JAR>"

Use distributed cache option provided by hadoop to distribute the JAR on every tasktracker node and makes it available in classpath everytime a job is executed by rudimentary software distribution mechanism. You can do so by setting up DistributedCache while setting up a job.

DistributedCache.addFileToClassPath(new Path("<path to jar>"), job)

What does'nt work
You can also pass on generic options like -libjars to hadoop commands like 

hadoop jar -libjars sometestlib.jar <your jar> <main class>

but it never works for me. 'hadoop jar...' command does not support generic options like other hadoop commands.

You can also checkout my other blogs at http://www.technologywithvineet.com

No comments:

Post a Comment