Node and Disk Balancer in hadoop

Node and disk balancer in hadoop is an important concept used by cluster admins to ensure that all nodes and the volumes (disks in those nodes) are in equilibrium state. Node balancing is different from disk balancing, you can think of node balancing as ensuring equal storage utilization across the data nodes (inter node concept) whereas disk balancing is ensuring the disks of a data node are used proportionately (intra node). To understand this concept of node and disk balancer in hadoop refer figure A below which shows current and ideal state of a cluster.

Figure A: Ideal Situation (Node and Disk Balancer in hadoop)
Disk Balancer Figure A

As depicted in the figure above you can see that all disks in a node are almost equally utilized and same is the case when we compare the data nodes. But imagine a scenario when you do either of these two actions:

    1. Delete a huge file from hdfs
    2. Add a new volume/disk to the datanode.

This would create an imbalance in the hdfs storage utilization within the node which now means that the disks are not utilized proportionately instead a few disks are heavily loaded and the other one is almost empty. Refer figure B below which shows what happens when a new volume is added to a node and the problem it generates:

 

Figure B: Storage Imbalance (Node and Disk Balancer in hadoop)
Disk Balancer Figure B

Hadoop provides two options to store upcoming blocks:

  1. Fair or round robin allocation wherein when namenode decides to store next set of blocks on a datanode then all disks get equal share in terms of the number of blocks to be stored on them. Do you see any drawback here? Yes, the problem with this scheme is that the utilization gap between the disks of a datanode will be always be there.
  2. Available space based  allocation wherein datanode service will try to minimize the intra node disk space utilization gap by storing all new blocks to the disk with greatest available space. Do you see any problem here as well? Ofcourse, this will create hotspot problem where all the writes for that datanode are being handled by only one disk which means we have reduced the parallelism (be it for write requests or read requests).

Considering the pros and cons of the options available as well as the main benefit of hadoop which is maintaining parallelism, the recommended option is to use round robin alternative and at the same time proactively engage in intra node disk balancing. The solution is simple wherein we need to check and rebalance the disks with each datanode. We would use hadoop provided functionality to rebalance the datanodes without taking them offline. The steps are:

    1. Enable disk rebalancing in your cluster by setting dfs.disk.balancer.enabled property to true on all datanodes (hdfs-site.xml).
    2. Check if it is enabled by issuing hdfs command on the terminal. It must exist in the output and if it doesn’t then ensure that you redeploy the latest configuration to all the datanodes.
       Figure C: Disk Balancer Utility (Node and Disk Balancer in hadoop)
      Disk Balancer Figure C
    3. Execute the below script from any host in your cluster to trigger the intra node disk rebalancing:

 

Refer below for explanation on how this script works

  1. Using hdfs dfsadmin -report command it extracts a list of all the live datanodes in the cluster.
  2. For each datanode in the cluster it tries to check and create a plan if disk balancing is required. A json file is generated with a map of how many bytes have to be moved from an identified disk to another disk in the same datanode. The plan file looks like:

  3. And finally it submits the plan using hdfs disk balancer functionality to initiate the balancing job which runs in the background.

 

I hope this article would have helped you in understanding the difference between inter and intra node data balancing (usually referred as Node and disk balancer in hadoop) and how to implement it generically for all the clusters in place.

 

3 thoughts on “Node and Disk Balancer in hadoop

  1. Excellent blog Ritwick. Specific and crisp solution strategies for a complex topic like Node Balancing.
    I like the no-nonsense way of getting to the point and the screen captures add a very real feel that many technical blogs lack.

    Looking forward to more blogs from you on Big Data & Advanced Analytics!

Leave a Reply

Your email address will not be published.