File Management In HDFS

Deleting a file:

  • When a file is deleted by a user or an application, it is not instantly removed. Instead, HDFS first renames it to a file in the /trash directory.
  • The file can be restored as long as it remains in /trash. A file remains in /trash for a configurable amount of time. After the expiry of the retention duration in /trash, the file entry is removed from the HDFS namespace by NameNode.
  • The deletion will free the blocks associated with the file. A time delay is expected by design between the time a file is deleted and the time of the freeing up of space in HDFS.

Undelete a file:

  • A file can be restored after it is deleted as long as it remains in the /trash directory. To Undelete a file, the user needs to go to the /trash directory and restore the required file.
  • The /trash directory contains only the latest version of the deleted file.
  • The /trash directory is similar to any other directory with some specific policies to automatically delete files from this directory.

Modify Replication Factor

  • setReplication() API call is used to set the replication factor in HDFS.
  • When the replication factor of a file is reduced, the NameNode selects excess replicas that can be deleted.
  • When replication factor is increases, the NameNode identifies some aditional DataNodes where the data blocks can be stored.
  • This information is passed to the DataNode in the next HeartBeat message.
  • The DataNode then removes/adds the corresponding blocks and the available space is updated for the cluster. There is a configurable time delay between replication request and actual data operation in the cluster.