Installing Hadoop (Pseudo Mode) on ubuntu and other Linux environments

Posted By : Abhimanyu Singh | 15-Nov-2012

HADOOP enables applications to work with thousands of computation-independent application and petabytes of data . Hadoop is used mainly for distributed data processing but we can installed it in standalone mode and pseudo-distributed mode . In this Blog we will look to install Hadoop in pseudo-distributed mode .

Installation Modes of Hadoop

Standalone Mode - Hadoop will installed on local system .

Pseudo-Distributed - This mode of installation is sufficient for a cluster of few systems.

Fully-Distributed Mode - In this mode of Installation Hadoop is fully utilized . In this mode mode Hadoop is used to processed the data distributed over a number of systems.

As we are looking for Pseudo-distributed mode installation of Hadoop , Before going further we have a look on configuration of Hadoop in this mode.

hadoop configuration in pseudo-mode

Pre-requisites for Hadoop installation

  • Operating-system - GNU/Linux
  • JAVA - Java must be installed on your operating system for hadoop working. Java 1.6 or later
  • ssh must be enabled on the machine.

Hadoop Daemons run on JVM . Namenode , Jobtracker , Secondary Namenode , Tasktracker and Datanode are the daemons. Lets us look on functionality of these daemon. It will give a little idea about Hadoop working.

  1. Namenode : The Namenode is the master server in Hadoop and manages the file system namespace and access to the file stored in cluster.
  2. Secondary Namenode : It provides period checkpoints and housekeeping tasks.
  3. Datanode : The Datanode manages the storage attached to a node of which there can be multiple node attached to the cluster. Each node storing data will have a datanode daemon running.
  4. Jobtracker : It is responsible for scheduling work at the data nodes. Jobtracker distributes work across data nodes.
  5. Tasktracker: Tasktracker performs the actual task.
  6. INSTALLATION

    (on Ubuntu)
    1. Install JAVA on the machine.
    2. sudo add-apt-repository ppa:webupd8team/java

      sudo apt-get update

      sudo apt-get install oracle-java7-installer

      Set JAVA HOME and PATH environment variable for java in ~/.bashrc

    3. Run command “# ssh localhost “ to check whether ssh is installed or not .
    4. If ssh is not installed run command
    5. sudo apt-get install openssh-server openssh-client
    6. Create a new file
    7. # vi /etc/apt/sources.list.d/cloudera.list
    8. Add the following below line in above file
    9.     deb http://archive.cloudera.com/debian -cdh3 contrib
          deb-src http://archive.cloudera.com/debian -cdh3 contrib
          

      where: RELEASE is the name of your distribution, which you can find by running lsb_release -c. For example, to install CDH3 for Ubuntu Lucid, use lucid-cdh3 in the command above.

      I used release : maverick-cdh3

    10. Add the gnu-public license key using following command
    11.     #  curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add - 
          

      It gives “ok” on successful execution.

    12. Now update using following command
    13.     sudo apt-get update
      
    14. Search Hadoop related packages using following command
    15.    # apt-cache search hadoop
         
    16. Install lipzip1
    17.     # wget http://security.ubuntu.com/ubuntu/pool/main/libz/libzip/libzip1_0.9-3ubuntu0.1_i386.deb
          # sudo dpkg -i libzip1_0.9-3ubuntu0.1_i386.deb
      
    18. Install Hadoop and all daemons on the machine
    19.      #sudo apt-get install hadoop-0.20-conf-pseudo
      
    20. GO to {hadoop_installation_directory}/conf
    21. edit hadoop-env.sh

      add export JAVA_HOME= as last line.

    22. Starting all the Daemons on the machine - run the below command in your terminal
    23.       
      # /etc/init.d/hadoop-0.20-conf-
      

      where daemons are datanode,secondarynamenode,jobtracker,tasktracker,namenode

    24. Testing Hadoop installation
    25. run the below command in CLI

            # hadoop jar /usr/lib/hadoop-0.20/hadoop-examples.jar pi 2 100000
      

      if the above command print the pi value successfully is sign of hadoop successful installation.

      output of above command :

      oodles@oodles-Inspiron-N5050:/usr/lib/hadoop$ hadoop jar /usr/lib/hadoop-0.20/hadoop-examples.jar pi 2 100000
      Number of Maps  = 2
      Samples per Map = 100000
      Wrote input for Map #0
      Wrote input for Map #1
      Starting Job
      12/11/05 10:51:20 INFO mapred.FileInputFormat: Total input paths to process : 2
      12/11/05 10:51:21 INFO mapred.JobClient: Running job: job_201211050809_0001
      12/11/05 10:51:22 INFO mapred.JobClient:  map 0% reduce 0%
      12/11/05 10:51:31 INFO mapred.JobClient:  map 100% reduce 0%
      12/11/05 10:51:41 INFO mapred.JobClient:  map 100% reduce 100%
      12/11/05 10:51:43 INFO mapred.JobClient: Job complete: job_201211050809_0001
      12/11/05 10:51:43 INFO mapred.JobClient: Counters: 27
      12/11/05 10:51:43 INFO mapred.JobClient:   Job Counters 
      12/11/05 10:51:43 INFO mapred.JobClient:     Launched reduce tasks=1
      12/11/05 10:51:43 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=13288
      12/11/05 10:51:43 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
      12/11/05 10:51:43 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
      12/11/05 10:51:43 INFO mapred.JobClient:     Launched map tasks=2
      12/11/05 10:51:43 INFO mapred.JobClient:     Data-local map tasks=2
      12/11/05 10:51:43 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=10035
      12/11/05 10:51:43 INFO mapred.JobClient:   FileSystemCounters
      12/11/05 10:51:43 INFO mapred.JobClient:     FILE_BYTES_READ=50
      12/11/05 10:51:43 INFO mapred.JobClient:     HDFS_BYTES_READ=482
      12/11/05 10:51:43 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=180803
      12/11/05 10:51:43 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=215
      12/11/05 10:51:43 INFO mapred.JobClient:   Map-Reduce Framework
      12/11/05 10:51:43 INFO mapred.JobClient:     Map input records=2
      12/11/05 10:51:43 INFO mapred.JobClient:     Reduce shuffle bytes=56
      12/11/05 10:51:43 INFO mapred.JobClient:     Spilled Records=8
      12/11/05 10:51:43 INFO mapred.JobClient:     Map output bytes=36
      12/11/05 10:51:43 INFO mapred.JobClient:     Total committed heap usage (bytes)=379191296
      12/11/05 10:51:43 INFO mapred.JobClient:     CPU time spent (ms)=2200
      12/11/05 10:51:43 INFO mapred.JobClient:     Map input bytes=48
      12/11/05 10:51:43 INFO mapred.JobClient:     Combine input records=0
      12/11/05 10:51:43 INFO mapred.JobClient:     SPLIT_RAW_BYTES=246
      12/11/05 10:51:43 INFO mapred.JobClient:     Reduce input records=4
      12/11/05 10:51:43 INFO mapred.JobClient:     Reduce input groups=2
      12/11/05 10:51:43 INFO mapred.JobClient:     Combine output records=0
      12/11/05 10:51:43 INFO mapred.JobClient:     Physical memory (bytes) snapshot=399249408
      12/11/05 10:51:43 INFO mapred.JobClient:     Reduce output records=0
      12/11/05 10:51:43 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1193230336
      12/11/05 10:51:43 INFO mapred.JobClient:     Map output records=4
      Job Finished in 23.885 seconds
      Estimated value of Pi is 3.14118000000000000000
      

      Test 2

                      1. #hadoop fs -mkdir /foo
                      2. #hadoop fs -ls /
                          output ? 
                                    drwxr-xr-x   - oodles supergroup          0 2012-11-05 10:55 /foo
                                    drwxr-xr-x   - oodles supergroup          0 2012-11-03 21:24 /user
                                    drwxr-xr-x   - mapred supergroup          0 2012-11-03 21:24 /var
                       3. #hadoop fs -rmr /foo
                           output ? 
                                      Deleted hdfs://localhost:8020/foo
      
    26. Checking the status of live datanodes
    27. http://localhost:50070/dfshealth.jsp

    28. checking the status of finished job
    29. http://localhost:50030/jobtracker.jsp

      you could see pi estimator job in completed the job if task 1 is run successfully on your machine.

About Author

Author Image
Abhimanyu Singh

Abhimanyu is an seasoned technologist . He always keeps himself ahead in embracing and adapting new technologies/frameworks to solve business problems. He specialise in Blockchain technology , Video Content Management & enterprise software .

Request for Proposal

Name is required

Comment is required

Sending message..