Mongo hadoop connector

Posted By : Nishtha Singh | 01-Apr-2014

When to use Hadoop and MongoDB together

 

MongoDB itself has the aggregation functionality since it is a document-oriented database which helps in data analysis but if there is a need of complex data analysis, complex data aggregation is required which is provided by Hadoop. This is referred as batch-aggregation which is the most useful feature provided by Hadoop.

In batch-aggregation, data is fetched from MongoDB, processed via Hadoop and results are written back to MongoDB. But if the data is not too large or complex its better to avoid the use of hadoop as HDFS is not native to  MongoDB and also MongoDB has its own way of scaling data  and working with data stored over multiple machines.

 

Also, considering the production environment where application’s data is stored in multiple datacentres having their own functionality and query language, Hadoop can solve this problem by acting as the centralized repository and MapReduce jobs can be used to load data from MongoDb to Hadoop for processing it.

What is Mongo-Hadoop connector

The MongoDB-Hadoop Connector  is an open-source plugin for Hadoop that allows MongoDB to be used, instead of HDFS, as a source and sink of data. Using this connector, user to specify a query  and breaks the results of that query into input splits for Hadoop. Results are written back to MongoDB by the Hadoop reducer and also there is no role of  HDFS in any one of these operations.

Why use Mongo-Hadoop connector

An alternative approach to using Mongo-hadoop connector is :

either running MapReduce in MongoDB directly, or performing a three-stage operation i.e loading the data from MongoDB to HDFS,running Hadoop MapReduce, and importing the output back into MongoDB. Both of these approaches have drawbacks for complex operations on large data sets. The problems with the Mongo-MapReduce approach are:

 

(1) the language for MR scripts is JavaScript, which is slow and has poor analytics libraries, and

(2) the SpiderMonkey Javascript implementation used by MongoDB, is not thread-safe, so only one MapReduce program can run at a time

 

Also the three-stage approach is inconvenient and requires a large database and HDFS I/O. The MongoDB-Hadoop Connector, which allows the user to leave the input data in database, is thus an attractive option to explore. The connector can optionally leave the output in HDFS, which allows for different combinations of read and write resources

Steps to use Mongo-Hadoop connector:

1)Setup MongoDB version 2.4.9

2) Setup Hadoop on your system from one of the following versions:

  • 0.20/0.20.x

  • 1.0/1.0.x

  • 0.21/0.21.x

  • CDH3configure Mongo

  • CDH4

and follow the link  Install and setup Hadoop to setup hadoop on your system.

 

3) Next step is to build the Mongo-Hadoop adapter. The prequisite is hadoop should be up and running. Also git and JDK 1.6 should be installed.

 

4)The below link is a step-by-step guide for running Map-Reduce on some of the examples on git using Mongo-Hadoop connector. Just follow the steps and you will be able to run Map-Reduce using Mongo-Hadoop connector.

 

Mongo-Hadoop Connector

About Author

Author Image
Nishtha Singh

Nistha is a bright Groovy and Grails developer and have worked on development of various SaaS applications using Grails technologies. Nistha's hobbies are poetry and glass painting.

Request for Proposal

Name is required

Comment is required

Sending message..