Run a simple map reduce job in hadoop pseudo distributed setup.
Posted By : Rohan Jain | 17-Nov-2014
"In this blog I will describe, how you can run a simple map reduce job in a single-node Hadoop cluster. for this i am going to use a WordCountexample which reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occured, separated by a tab. - if you don't have hadoop setup, Read -
First step for running a map reduce job is that you should have an example input data (a large text file)
1.You can download one using the following links (download plain text format).
http://www.gutenberg.org/cache/epub/20417/pg20417.txt
http://www.gutenberg.org/cache/epub/5000/pg5000.txt
OR
you can supply your own input text file.
Store the file in a local directory of your choice (e.g. /home/hadoopinput)
2.login as hduser change to dir /usr/local/hadoop/bin
Start your Hadoop cluster if it’s not running already.
3. create a dir in hdfs
hduser@rohan-Vostro-3446:/usr/local/hadoop/bin$ hadoop fs -mkdir /hadoopinput
check for your newly created dir
hduser@rohan-Vostro-3446:/usr/local/hadoop/bin$ hadoop fs -ls /
4.Copy local example data to HDFS Before we run the actual MapReduce job, we first have to copy the files from our local file system to Hadoop’s HDFS
hduser@rohan-Vostro-3446:/usr/local/hadoop/bin$ hadoop fs -copyFromLocal /home/rohan/hadoopinput /hdfsinput
Check your hdfs dir for input file
hduser@rohan-Vostro-3446:/usr/local/hadoop/bin$ hadoop fs -ls /hdfsinput
5.Run the Mapreduce job
hduser@rohan-Vostro-3446:/usr/local/hadoop/bin$ hadoop jar ../share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0.jar wordcount /hadoopinput /hdfsoutput
This command will read all the files in the HDFS directory /hadoopinput, process it, and store the result in the HDFS directory you specified.
Output of the previous command
14/11/17 13:43:17 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 14/11/17 13:43:18 INFO input.FileInputFormat: Total input paths to process : 1 14/11/17 13:43:18 INFO mapreduce.JobSubmitter: number of splits:1 14/11/17 13:43:18 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1416199619656_0004 14/11/17 13:43:19 INFO impl.YarnClientImpl: Submitted application application_1416199619656_0004 14/11/17 13:43:19 INFO mapreduce.Job: The url to track the job: http://rohan-Vostro-3446:8088/proxy/application_1416199619656_0004/ 14/11/17 13:43:19 INFO mapreduce.Job: Running job: job_1416199619656_0004 14/11/17 13:43:26 INFO mapreduce.Job: Job job_1416199619656_0004 running in uber mode : false 14/11/17 13:43:26 INFO mapreduce.Job: map 0% reduce 0% 14/11/17 13:43:34 INFO mapreduce.Job: map 100% reduce 0% 14/11/17 13:43:42 INFO mapreduce.Job: map 100% reduce 100% 14/11/17 13:43:43 INFO mapreduce.Job: Job job_1416199619656_0004 completed successfully 14/11/17 13:43:43 INFO mapreduce.Job: Counters: 49 File System Counters FILE: Number of bytes read=267013 FILE: Number of bytes written=728217 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=661905 HDFS: Number of bytes written=196183 HDFS: Number of read operations=6 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=6012 Total time spent by all reduces in occupied slots (ms)=5116 Total time spent by all map tasks (ms)=6012 Total time spent by all reduce tasks (ms)=5116 Total vcore-seconds taken by all map tasks=6012 Total vcore-seconds taken by all reduce tasks=5116 Total megabyte-seconds taken by all map tasks=6156288 Total megabyte-seconds taken by all reduce tasks=5238784 Map-Reduce Framework Map input records=12760 Map output records=109844 Map output bytes=1086544 Map output materialized bytes=267013 Input split bytes=98 Combine input records=109844 Combine output records=18039 Reduce input groups=18039 Reduce shuffle bytes=267013 Reduce input records=18039 Reduce output records=18039 Spilled Records=36078 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=92 CPU time spent (ms)=7090 Physical memory (bytes) snapshot=439283712 Virtual memory (bytes) snapshot=1424547840 Total committed heap usage (bytes)=355467264 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=661807 File Output Format Counters Bytes Written=196183
Check your hdfs dir for file
hduser@rohan-Vostro-3446:/usr/local/hadoop/bin$ hadoop fs -ls /hdfsoutput
Found 2 items -rw-r--r-- 1 hduser supergroup 0 2014-11-06 23:21 /hdfsoutput/_SUCCESS -rw-r--r-- 1 hduser supergroup 40923 2014-11-06 23:21 /hdfsoutput/part-r-00000
You can see the result which is stored in hdfs directory /hadoopoutput
hduser@rohan-Vostro-3446:/usr/local/hadoop/bin$ hadoop fs -cat /hdfsoutput/part-r-00000
i have used the link http://www.gutenberg.org/cache/epub/20417/pg20417.txt for input file.
The output of previous command would be (only a part of output is shown)
works--the 1 works. 6 works; 1 world 45 world! 1 world's 1 world, 11 world--even 1 world--weighs 1 world-cloud 1
Cookies are important to the proper functioning of a site. To improve your experience, we use cookies to remember log-in details and provide secure log-in, collect statistics to optimize site functionality, and deliver content tailored to your interests. Click Agree and Proceed to accept cookies and go directly to the site or click on View Cookie Settings to see detailed descriptions of the types of cookies and choose whether to accept certain cookies while on the site.
About Author
Rohan Jain
Rohan is a bright and experienced web app developer with expertise in Groovy and Grails development.