Run a simple map reduce job in hadoop pseudo distributed setup.

Posted By : Rohan Jain | 17-Nov-2014

"In this blog I will describe, how you can run a simple map reduce job in a single-node Hadoop cluster. for this i am going to use a WordCountexample which reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occured, separated by a tab. - if you don't have hadoop setup, Read -

 

http://www.oodlestechnologies.com/blogs/Install-%26-Configure-Apache-Hadoop-2.x.x-On-Ubuntu-%28Single-Node-Cluster-or-Pseudo-Distributed-Setup%29 "

First step for running a map reduce job is that you should have an example input data (a large text file)

1.You can download one using the following links (download plain text format).

http://www.gutenberg.org/cache/epub/20417/pg20417.txt
http://www.gutenberg.org/cache/epub/5000/pg5000.txt

OR

you can supply your own input text file.

Store the file in a local directory of your choice (e.g. /home/hadoopinput)

2.login as hduser change to dir /usr/local/hadoop/bin

Start your Hadoop cluster if it’s not running already.

3. create a dir in hdfs

hduser@rohan-Vostro-3446:/usr/local/hadoop/bin$ hadoop fs -mkdir /hadoopinput
 

check for your newly created dir

hduser@rohan-Vostro-3446:/usr/local/hadoop/bin$ hadoop fs -ls /
 

4.Copy local example data to HDFS Before we run the actual MapReduce job, we first have to copy the files from our local file system to Hadoop’s HDFS

hduser@rohan-Vostro-3446:/usr/local/hadoop/bin$ hadoop fs -copyFromLocal /home/rohan/hadoopinput /hdfsinput
 

Check your hdfs dir for input file

hduser@rohan-Vostro-3446:/usr/local/hadoop/bin$ hadoop fs -ls /hdfsinput
 

5.Run the Mapreduce job

hduser@rohan-Vostro-3446:/usr/local/hadoop/bin$ hadoop jar ../share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0.jar wordcount /hadoopinput /hdfsoutput
 

This command will read all the files in the HDFS directory /hadoopinput, process it, and store the result in the HDFS directory you specified.

Output of the previous command

14/11/17 13:43:17 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
14/11/17 13:43:18 INFO input.FileInputFormat: Total input paths to process : 1
14/11/17 13:43:18 INFO mapreduce.JobSubmitter: number of splits:1
14/11/17 13:43:18 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1416199619656_0004
14/11/17 13:43:19 INFO impl.YarnClientImpl: Submitted application application_1416199619656_0004
14/11/17 13:43:19 INFO mapreduce.Job: The url to track the job: http://rohan-Vostro-3446:8088/proxy/application_1416199619656_0004/
14/11/17 13:43:19 INFO mapreduce.Job: Running job: job_1416199619656_0004
14/11/17 13:43:26 INFO mapreduce.Job: Job job_1416199619656_0004 running in uber mode : false
14/11/17 13:43:26 INFO mapreduce.Job:  map 0% reduce 0%
14/11/17 13:43:34 INFO mapreduce.Job:  map 100% reduce 0%
14/11/17 13:43:42 INFO mapreduce.Job:  map 100% reduce 100%
14/11/17 13:43:43 INFO mapreduce.Job: Job job_1416199619656_0004 completed successfully
14/11/17 13:43:43 INFO mapreduce.Job: Counters: 49
    File System Counters
        FILE: Number of bytes read=267013
        FILE: Number of bytes written=728217
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=661905
        HDFS: Number of bytes written=196183
        HDFS: Number of read operations=6
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
    Job Counters 
        Launched map tasks=1
        Launched reduce tasks=1
        Data-local map tasks=1
        Total time spent by all maps in occupied slots (ms)=6012
        Total time spent by all reduces in occupied slots (ms)=5116
        Total time spent by all map tasks (ms)=6012
        Total time spent by all reduce tasks (ms)=5116
        Total vcore-seconds taken by all map tasks=6012
        Total vcore-seconds taken by all reduce tasks=5116
        Total megabyte-seconds taken by all map tasks=6156288
        Total megabyte-seconds taken by all reduce tasks=5238784
    Map-Reduce Framework
        Map input records=12760
        Map output records=109844
        Map output bytes=1086544
        Map output materialized bytes=267013
        Input split bytes=98
        Combine input records=109844
        Combine output records=18039
        Reduce input groups=18039
        Reduce shuffle bytes=267013
        Reduce input records=18039
        Reduce output records=18039
        Spilled Records=36078
        Shuffled Maps =1
        Failed Shuffles=0
        Merged Map outputs=1
        GC time elapsed (ms)=92
        CPU time spent (ms)=7090
        Physical memory (bytes) snapshot=439283712
        Virtual memory (bytes) snapshot=1424547840
        Total committed heap usage (bytes)=355467264
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=661807
    File Output Format Counters 
        Bytes Written=196183
 

Check your hdfs dir for file

hduser@rohan-Vostro-3446:/usr/local/hadoop/bin$ hadoop fs -ls /hdfsoutput
 
Found 2 items
-rw-r--r--   1 hduser supergroup          0 2014-11-06 23:21 /hdfsoutput/_SUCCESS
-rw-r--r--   1 hduser supergroup      40923 2014-11-06 23:21 /hdfsoutput/part-r-00000
 

You can see the result which is stored in hdfs directory /hadoopoutput

hduser@rohan-Vostro-3446:/usr/local/hadoop/bin$ hadoop fs -cat  /hdfsoutput/part-r-00000
 

i have used the link http://www.gutenberg.org/cache/epub/20417/pg20417.txt for input file.

The output of previous command would be (only a part of output is shown)

works--the    1
works.    6
works;    1
world    45
world!    1
world's    1
world,    11
world--even    1
world--weighs    1
world-cloud    1

 

About Author

Author Image
Rohan Jain

Rohan is a bright and experienced web app developer with expertise in Groovy and Grails development.

Request for Proposal

Name is required

Comment is required

Sending message..