NameNode And DataNode

What is Name Node  ?

NameNode is a storage unit where HDFS keeps all the metadata information about all the data stored in the file system. It stores

  • A directory tree of all the files in  file system.
  • Logical location of blocks across the cluster.
  • Which all blocks correspond to which file.

A cluster typically has one NameNode. The existence of a single NameNode in a cluster simplifies the architectural needs. The NameNode serves as an arbitrator and repository for all HDFS metadata, but never operates on actual data.

What does a NameNode process do ?

  1. Manage the file system metadata :
    NameNode performs file system namespace operations and continuously updates itself with latest information on state of files.
  2. Provide information for file read :
    Client applications have to request file information to the NameNode, whenever they need to locate any file. NameNode sends them a list of DataNodes that contain the file.
  3. Voting for write access:
    When multiple users intend to update to a file, NameNode takes a decision on whom to assign the access privilege. We will later revisit later, how HDFS processes the write/update requests.

What is DataNode:

DataNode  stores the actual HDFS data. Internally a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The DataNode has no knowledge about HDFS files. It stores each block of HDFS data in a separate file in its local file system. The DataNode does not create all files in the same directory. Instead, it uses a heuristic to determine the optimal number of files per directory and creates sub-directories appropriately. It is not optimal to create all local files in the same directory because the local file system might not be able to efficiently support a huge number of files in a single directory.

What does a DataNode process do ?

  1. Creates a map on how the file is spread across the node.
  2. Takes the responsibility of serving read and write requests from the file system clients. It receives the instructions from NameNode for creating/deleting/updating blocks.

Limitations of NameNode :

  1. Single point of failure :
    The NameNode can be treated as a single point of failure in an HDFS Cluster, posing a threat to high availability goal of HDFS design. When NameNode goes down, the metadata information about files becomes unavailable. So practically the file system goes offline.  To alleviate the problem, Hadoop allows to keep a backup NameNode. This should be hosted on a separate machine. It continuously creates checkpoints on current state of namespace by merging the edits file into file system image. When the primary NameNode goes offline, the backup NameNode take responsibility of meta data management and access requests.
  2. Bottleneck for parallel write requests :
    Multiple parallel read reqeusts on files by multiple clients can be allowed safely. But when parallel write requests are received, NameNode synchronizes the access requests. For a highly scallable system the NameNode may get overloaded while serving the read and write requests. This may significantly can impact system performance.

Best practices for NameNode :

  • Allocate adequate memory to the machine that hosts the nameNode. This may significantly determine the system performance. Otherwise use compressed pointers to keep the JVM heap size lower.
  • Spread the NameNode across multiple directories.
  • Keep a copy of transaction logs on a separate disk from the image.
  • Monitor the disk space available to NameNode. Add more storage before the free space goes low. Otherwise it may lead to file system unavailability.