HDFS - Block Replication

About

HDFS stores each file as a sequence of blocks. The blocks of a file are replicated for fault tolerance.

The NameNode makes all decisions regarding replication of blocks. It periodically receives a Blockreport from each of the DataNodes in the cluster. A Blockreport contains a list of all blocks on a DataNode.

DataNode death may cause the replication factor of some blocks to fall below their specified value.

The NameNode constantly tracks which blocks need to be replicated and initiates replication whenever necessary.

Articles Related

State

A block is considered safely replicated when the minimum number of replicas of that data block has checked in with the NameNode through a blockreport.

Algorithm

The section below describes an idealistic situation.

The algorithm may be influenced by the Storage Types and Storage Policies. The NameNode will take the policy into account for replica placement in addition to the rack awareness.

Replication of data blocks does not occur when the NameNode is in the Safemode state.

Replication factor = 3

When the replication factor is three, HDFS’s placement policy is to put:

one replica:
- on the local machine if the writer is on a datanode,
  - otherwise on a random datanode,
another replica on a node in a different (remote) rack,
and the last on a different node in the same remote rack.

This policy cuts the inter-rack write traffic which generally improves write performance. The chance of rack failure is far less than that of node failure.

Replication factor > 3

If the replication factor is greater than 3, the placement of the 4th and following replicas are determined randomly while keeping the number of replicas per rack below the upper limit (which is basically (replicas - 1) / racks + 2).

Management

Block Size

The block size is configurable per file.

Replication Factor

The common case is 3.

An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later.

Because the NameNode does not allow DataNodes to have multiple replicas of the same block, maximum number of replicas created is the total number of DataNodes at that time.

Default value is in HDFS - Configuration (hdfs-site.xml)

<property>
	<name>dfs.replication</name>
	<value>1</value>
</property>

Decrease

When the replication factor of a file is reduced, the NameNode selects excess replicas that can be deleted. The next Heartbeat transfers this information to the DataNode. The DataNode then removes the corresponding blocks and the corresponding free space appears in the cluster. Once again, there might be a time delay between the completion of the setReplication API call and the appearance of free space in the cluster.

Move

See HDFS - Block

Under-replicated Block

See Under-replicated Block

Pipeline

During a replication, the data is pipelined from one DataNode to the next.

Process:

A client is writing data to an HDFS file with a replication factor of three
The NameNode retrieves the list of DataNodes using a replication target choosing algorithm. This list contains the DataNodes that will host a replica of that block.
The client writes data to the first DataNode.
The first DataNode starts receiving the data in portions, writes each portion to its local repository and transfers that portion to the second DataNode in the list.
The second DataNode, in turn starts receiving each portion of the data block, writes that portion to its repository and then flushes that portion to the third DataNode.
Finally, the third DataNode writes the data to its local repository.

A DataNode can be receiving data from the previous one in the pipeline and at the same time forwarding data to the next one in the pipeline.