HDFS Data Organization

Staging : 

  • A client request to create a file does not reach the NameNode immediately. Initially the HDFS client caches the file data into a temporary local file. Application writes are transparently redirected to this temporary local file.
  • When the local file accumulates data worth over one HDFS block size, the client contacts the NameNode. The NameNode inserts the file name into the file system hierarchy and allocates a data block for it. The NameNode responds to the client request with the identity of the DataNode and the destination data block.
  • Then the client flushes the block of data from the local temporary file to the specified DataNode.
  • When a file is closed, the remaining un-flushed data in the temporary local file is transferred to the DataNode.
  • The client then tells the NameNode that the file is closed. At this point, the NameNode commits the file creation operation into a persistent store. If the NameNode dies before the file is closed, the file is lost.

This approach is suitable for HDFS applications, because of need for streaming writes to files. If a client does not use buffering and writes to a remote file directly, the network speed may go down and network congestion may impact throughput. This approach is not POSIX compliant. The POSIX requirement has been relaxed to achieve higher performance of IO.

Replication Pipeline:

  • Assume the HDFS file has a replication factor of N. If data is simultaneously written to N DataNodes, the synchronization load would be more. Hence a replication pipe-lining mechanism is used.
  • When the local file accumulates a full block of user data, the client retrieves a list of DataNodes from the NameNode. This list contains the DataNodes that will host a replica of that block.
  • The client then flushes the data block to the first DataNode. The first DataNode starts receiving the data in small portions (4 KB), writes each portion to its local repository and transfers that portion to the second DataNode in the list.
  • The second DataNode, in turn starts receiving each portion of the data block, writes that portion to its repository and then flushes that portion to the third DataNode.
  • Finally, the third DataNode writes the data to its local repository. Thus, a DataNode can be receiving data from the previous one in the pipeline and at the same time forwarding data to the next one in the pipeline. Thus, the data is pipelined from one DataNode to the next.

Data Blocks :

  • HDFS is suitable for the data that is written once and rarely requires updation.
  • Applications will write their data only once, but they read it one or more times and require these reads to be satisfied at streaming speeds.
  • A typical block size used by HDFS is 64 MB. Thus, an HDFS file is chopped up into chunks of 64 MB.
  • Ideally each chunk should be placed on a different DataNode.
  • The BlockSize can be configured.