Saturday, June 14, 2014

Running hadoop 2.2.0 wordcount example under Windows

Recently I went to Hadoop Summit in San Jose (http://hadoopsummit.org/san-jose/). The conference was quite interesting (excluding a few boring talks). I found out that HortonWorks is trying to push hard Hadoop into enterprise environment with Hadoop 2.x and Yarn. I love this idea, since there seems to be no good standard for distributed containers in Java these days (forget about JEE clustering).

Surprisingly enough, it looks like Hadoop 2.2.0 is supported natively on Windows, which IMO is a great achievement and is a sign of platform getting more mature.
In this article I show how to run a simple WordCount example in Hadoop 2.2.0 under Windows.

First of all, you need to compile hadoop 2.2.0 distribution, which takes a lot of time (and sometimes tweaking pom files). I uploaded a precompiled version here. You need to edit windows environment variables and add path to bin dir and HADOOP_HOME variable pointing to the dir.

Then you need to format node and run example. Following commands do that:

E:\test>hdfs namenode -format

E:\hadoop-2.2.0\sbin>start-dfs

E:\hadoop-2.2.0\sbin>start-yarn
starting yarn daemons

E:\test>hdfs dfs -mkdir /input

E:\test>hdfs dfs -copyFromLocal file1.txt input

E:\test>hdfs dfs -cat /input/words.txt
...

E:\test>yarn jar E:\hadoop-2.2.0\share\hadoop\mapreduce\hadoop-mapreduce-examples-2.2.0.jar wordcount /input/words.txt /output
14/06/14 14:29:47 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
14/06/14 14:29:47 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
14/06/14 14:29:48 INFO input.FileInputFormat: Total input paths to process : 1
...

E:\test>hdfs dfs -cat  /output/part-r-00000
abc1    1
abc2    3
abc3    1

I hope that Hadoop 2.x gains wide adoption in Enterprise Environment, since the industry needs the next gen standard for distributed apps.