Installing Hadoop (Pseudo Mode) on ubuntu and other Linux environments
November 15, 2012 by abhimanyu
HADOOP enables applications to work with thousands of computation-independent application and petabytes of data . Hadoop is used mainly for distributed data processing but we can installed it in standalone mode and pseudo-distributed mode . In this Blog we will look to install Hadoop in pseudo-distributed mode .
Installation Modes of Hadoop
Standalone Mode - Hadoop will installed on local system .
Pseudo-Distributed - This mode of installation is sufficient for a cluster of few systems.
Fully-Distributed Mode - In this mode of Installation Hadoop is fully utilized . In this mode mode Hadoop is used to processed the data distributed over a number of systems.
As we are looking for Pseudo-distributed mode installation of Hadoop , Before going further we have a look on configuration of Hadoop in this mode.
Pre-requisites for Hadoop installation
- Operating-system - GNU/Linux
- JAVA - Java must be installed on your operating system for hadoop working. Java 1.6 or later
- ssh must be enabled on the machine.
Hadoop Daemons run on JVM . Namenode , Jobtracker , Secondary Namenode , Tasktracker and Datanode are the daemons. Lets us look on functionality of these daemon. It will give a little idea about Hadoop working.
- Namenode : The Namenode is the master server in Hadoop and manages the file system namespace and access to the file stored in cluster.
- Secondary Namenode : It provides period checkpoints and housekeeping tasks.
- Datanode : The Datanode manages the storage attached to a node of which there can be multiple node attached to the cluster. Each node storing data will have a datanode daemon running.
- Jobtracker : It is responsible for scheduling work at the data nodes. Jobtracker distributes work across data nodes.
- Tasktracker: Tasktracker performs the actual task.
- Install JAVA on the machine.
-
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java7-installer
Set JAVA HOME and PATH environment variable for java in ~/.bashrc
- Run command “# ssh localhost “ to check whether ssh is installed or not .
- If ssh is not installed run command sudo apt-get install openssh-server openssh-client
- Create a new file # vi /etc/apt/sources.list.d/cloudera.list
- Add the following below line in above file
- Add the gnu-public license key using following command
- Now update using following command
- Search Hadoop related packages using following command
- Install lipzip1
- Install Hadoop and all daemons on the machine
- GO to {hadoop_installation_directory}/conf
- Starting all the Daemons on the machine - run the below command in your terminal
- Testing Hadoop installation
- Checking the status of live datanodes
- checking the status of finished job
INSTALLATION
(on Ubuntu)
deb http://archive.cloudera.com/debian -cdh3 contrib
deb-src http://archive.cloudera.com/debian -cdh3 contrib
where: RELEASE is the name of your distribution, which you can find by running lsb_release -c. For example, to install CDH3 for Ubuntu Lucid, use lucid-cdh3 in the command above.
I used release : maverick-cdh3
# curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add -
It gives “ok” on successful execution.
sudo apt-get update
# apt-cache search hadoop
# wget http://security.ubuntu.com/ubuntu/pool/main/libz/libzip/libzip1_0.9-3ubuntu0.1_i386.deb
# sudo dpkg -i libzip1_0.9-3ubuntu0.1_i386.deb
#sudo apt-get install hadoop-0.20-conf-pseudo
edit hadoop-env.sh
add export JAVA_HOME=
# /etc/init.d/hadoop-0.20-conf-
where daemons are datanode,secondarynamenode,jobtracker,tasktracker,namenode
run the below command in CLI
# hadoop jar /usr/lib/hadoop-0.20/hadoop-examples.jar pi 2 100000
if the above command print the pi value successfully is sign of hadoop successful installation.
output of above command :
oodles@oodles-Inspiron-N5050:/usr/lib/hadoop$ hadoop jar /usr/lib/hadoop-0.20/hadoop-examples.jar pi 2 100000 Number of Maps = 2 Samples per Map = 100000 Wrote input for Map #0 Wrote input for Map #1 Starting Job 12/11/05 10:51:20 INFO mapred.FileInputFormat: Total input paths to process : 2 12/11/05 10:51:21 INFO mapred.JobClient: Running job: job_201211050809_0001 12/11/05 10:51:22 INFO mapred.JobClient: map 0% reduce 0% 12/11/05 10:51:31 INFO mapred.JobClient: map 100% reduce 0% 12/11/05 10:51:41 INFO mapred.JobClient: map 100% reduce 100% 12/11/05 10:51:43 INFO mapred.JobClient: Job complete: job_201211050809_0001 12/11/05 10:51:43 INFO mapred.JobClient: Counters: 27 12/11/05 10:51:43 INFO mapred.JobClient: Job Counters 12/11/05 10:51:43 INFO mapred.JobClient: Launched reduce tasks=1 12/11/05 10:51:43 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=13288 12/11/05 10:51:43 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/11/05 10:51:43 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/11/05 10:51:43 INFO mapred.JobClient: Launched map tasks=2 12/11/05 10:51:43 INFO mapred.JobClient: Data-local map tasks=2 12/11/05 10:51:43 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=10035 12/11/05 10:51:43 INFO mapred.JobClient: FileSystemCounters 12/11/05 10:51:43 INFO mapred.JobClient: FILE_BYTES_READ=50 12/11/05 10:51:43 INFO mapred.JobClient: HDFS_BYTES_READ=482 12/11/05 10:51:43 INFO mapred.JobClient: FILE_BYTES_WRITTEN=180803 12/11/05 10:51:43 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=215 12/11/05 10:51:43 INFO mapred.JobClient: Map-Reduce Framework 12/11/05 10:51:43 INFO mapred.JobClient: Map input records=2 12/11/05 10:51:43 INFO mapred.JobClient: Reduce shuffle bytes=56 12/11/05 10:51:43 INFO mapred.JobClient: Spilled Records=8 12/11/05 10:51:43 INFO mapred.JobClient: Map output bytes=36 12/11/05 10:51:43 INFO mapred.JobClient: Total committed heap usage (bytes)=379191296 12/11/05 10:51:43 INFO mapred.JobClient: CPU time spent (ms)=2200 12/11/05 10:51:43 INFO mapred.JobClient: Map input bytes=48 12/11/05 10:51:43 INFO mapred.JobClient: Combine input records=0 12/11/05 10:51:43 INFO mapred.JobClient: SPLIT_RAW_BYTES=246 12/11/05 10:51:43 INFO mapred.JobClient: Reduce input records=4 12/11/05 10:51:43 INFO mapred.JobClient: Reduce input groups=2 12/11/05 10:51:43 INFO mapred.JobClient: Combine output records=0 12/11/05 10:51:43 INFO mapred.JobClient: Physical memory (bytes) snapshot=399249408 12/11/05 10:51:43 INFO mapred.JobClient: Reduce output records=0 12/11/05 10:51:43 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1193230336 12/11/05 10:51:43 INFO mapred.JobClient: Map output records=4 Job Finished in 23.885 seconds Estimated value of Pi is 3.14118000000000000000
Test 2
1. #hadoop fs -mkdir /foo
2. #hadoop fs -ls /
output ?
drwxr-xr-x - oodles supergroup 0 2012-11-05 10:55 /foo
drwxr-xr-x - oodles supergroup 0 2012-11-03 21:24 /user
drwxr-xr-x - mapred supergroup 0 2012-11-03 21:24 /var
3. #hadoop fs -rmr /foo
output ?
Deleted hdfs://localhost:8020/foo
http://localhost:50070/dfshealth.jsp
http://localhost:50030/jobtracker.jsp
you could see pi estimator job in completed the job if task 1 is run successfully on your machine.

