HDFS Metadata Management
We need to understand the terminologies like EditLog, FsImage, BlockReport and CheckPoint.
EditLog: NameNode uses a transaction log called the EditLog to record every change that occurs to file system metadata. For example, if we create a new file in HDFS, NameNode inserts a record into EditLog for this change. Also when user changes the replication factor of a file, a new record is inserted into the EditLog. The EditLog is stored in a file in local machine file system.
FsImage: The entire file system namespace, mapping of blocks to files and file system properties are stored in a file called the FsImage. in local file system.
BlockReport: When a DataNode starts up, it scans through its local file system, generates a list of all HDFS data blocks that correspond to each of these local files and sends this report to the NameNode: this is the Blockreport.
CheckPoint: The NameNode keeps an image of the entire file system namespace and a file Blockmap in memory. When the NameNode starts up, it reads the FsImage and EditLog from disk, applies all the transactions from the EditLog to the in-memory representation of the FsImage, and flushes out this updated version into a new FsImage on disk. It can then delete the old EditLog, because its transactions have been updated in the persistent FsImage. This process is called a checkpoint. In the current implementation, a checkpoint only occurs when the NameNode starts up.
The Communication Protocols
All the communication is done on top of a layer of TCP/IP protocol.
Client-NameNode: A client needs to establish a connection to a TCP port on the NameNode machine. It talks the ClientProtocol with the NameNode.
DataNode-namenode : The DataNodes talk to the NameNode using the DataNode Protocol.
RPC: A Remote Procedure Call (RPC) abstraction wraps both the Client Protocol and the DataNode Protocol. By design, the NameNode never initiates any RPCs. Instead, it only responds to RPC requests issued by DataNodes or clients.