Blog

  • In this blog I am going to share some conceptual aspects of Sharding in MongoDB. If you want to jump to implementation of sharding please follow next blog Scaling MongoDB and setting up sharding for MongoDB on Ubuntu

    But before you should understand some basic facts about sharding.

    Question: What is sharding?

    Answer:
    Sharding is splitting the data of a database collection on more than one server.MongoDB handles all the following complexities automatically :

    • Distributing data over all database servers
    • Balancing data load at all servers
    • Cope up if a DB server gets down
    • Fetching the data for a query from a server in an optimized way



    Question: What is the logic behind splitting collection data over more than one server?
    Answer:
    For implementing sharding over a collection, you have to select a shard key. Shard key can be a field or set of fields in collection. Every server (shard) has a unique range in shard key value.
    I can explain the logic with an example:
    let us say you have a collection called userList.
    You choose the shard key as username.
    Lets assume that you have 3 servers for distributing collection data (Server1, Server2, Server3).

    So initially you have all the data on single server(shard) i.e. Server1.
    range for shard key for server1  is [ -∞ , ∞)
    [ a, b )  means a =< x < b

    As your data increases enormously, data starts dividing on multiple servers (Mongodb handles this dividing and load balancing automatically).
    Now the shard key set divides itself into 3 subsets.For example it can be like this:
    server 1 : [ -∞ , f )
    server 2 : [ f , p )
    server 3 : [ p , ∞ )

    Shard key splits itself into ranges of shard key value.That is why this method is known as “Range Based Splitting”.
    All subsets of a shard key set are mutually exclusive.

    So now all user names starting with character value less than “f” are stored on server1.
    While searching for the data it searches with a query “get document be username” and depending on the result it goes to the proper server.


    Question: What are the important facts about shard key?
    Answer:
    A value in shard key cannot be null.
    You cannot update the value of shard key.
    Selecting a proper shard key is one of the most important aspect of sharding a collection.Shard key effects on two main aspects of clustering data:

    • balancing data over the servers
    • fetching data over queries on a collection

    That is why schema design for database in MongoDB becomes very important.


    Question: How MongoDB sorts data for a field?
    Answer:
    The way of sorting data in a field is:
    null < numbers < strings < objects < arrays < binary data < ObjectIds < booleans < dates < regular expression


    Question: What is a Shard in MongoDB?
    Answer:
    A shard is one or more servers in a cluster that are responsible for some subset of the data.For example if we had a cluster that have 5,000,000 documents than one shard might contain 1,000,000 documents.


    Question: What are chunks in MongoDB?
    Answer:
    Each Shard is divided into one more chunks of data.Following points will elaborate information about chunks:

    • If data increases on a shard it automatically creates a new chunk and starts filling it up.
    • If the difference between number of chunks of two shard increases a limit, it automatically transfers some chunks to the shard having less chunks to make a balance of data between two shards.
    • Usually the size of a chunk is about 64 MB. If difference in number of chunks between two shards is 9 or more than chunks start transferring from one shard to another.(Note: These values are just estimates. I am not 100% sure about this as mongodb handles all these complexities automatically.)
    • If you add a new shard to the cluster than some chunks from all the previously existing shards are transferred to the new chunk to make a balance in cluster.
    • The reason of choosing chunk size as 64 MB is that, this value is neither too small nor too large.If chunk size become smaller than it will reduce performance by transferring chunks all the time.If chunk size becomes larger than all data will start piling up on single shard and make the data disbalanced.

    NOTE: In mongodb you can set the size of a chunk to a value as your wish.But be sure that it can effect on balancing the load of data. So try not to do it.


    Question: What is mongos ?
    Answer:
    mongos is the interaction point between users and the cluster.Its job is to have a single server interface with the user.
    When you use a cluster, you connect to mongos and issue all reads and writes to that mongos.You should never access the shards directly (although you can if you want).
    mongos forwards all user requests to the appropriate shards. If a user inserts a document,
    mongos looks at the document’s shard key, looks at the chunks, and sends the
    document to the shard holding the correct chunk.
    For example, say we insert {"foo" : "bar"} and we’re sharding on foo. mongos looks
    at the chunks available and sees that there is a chunk with the range ["a”, “c”), which
    is the chunk that should contain “bar”.
     





    Question: What is difference between shard server and config server?
    Answer:
    Shard server is used to store the actual data.
    Config server are special mongod processes that are used to store information about cluster.It stores all the information related to shards , mongos process and system administrator.
    For a successful migration of data to shard server all config servers must be up.If any config server is down, while chunk migration from one shard to another, whole process is revert back.


    Anatomy of Cluster
    A MongoDB cluster basically consists of three types of processes: the shards for actually
    storing data, the mongos processes for routing requests to the correct data, and the
    config servers, for keeping track of the cluster’s state.

     

    Each of the components above is not “a machine.” mongos processes are usually run
    on appservers. Config servers, for all their importance, are pretty lightweight and can
    basically be run on any machine available. Each shard usually consists of multiple machines,as that’s where the data actually lives.

     

     

    Akash Sharma

    akash.sharma@oodlestechnologies.com

Tags: nosql , bigdata , mongodb , sharding , clustering