File Management In HDFS
Deleting a file:
- When a file is deleted by a user or an application, it is not instantly removed. Instead, HDFS first renames it to a file in the /trash directory.
- The file can be restored as long as it remains in /trash. A file remains in /trash for a configurable amount of time. After the expiry of the retention duration in /trash, the file entry is removed from the HDFS namespace by NameNode.
- The deletion will free the blocks associated with the file. A time delay is expected by design between the time a file is deleted and the time of the freeing up of space in HDFS.
Undelete a file:
- A file can be restored after it is deleted as long as it remains in the /trash directory. To Undelete a file, the user needs to go to the /trash directory and restore the required file.
- The /trash directory contains only the latest version of the deleted file.
- The /trash directory is similar to any other directory with some specific policies to automatically delete files from this directory.
Modify Replication Factor
- setReplication() API call is used to set the replication factor in HDFS.
- When the replication factor of a file is reduced, the NameNode selects excess replicas that can be deleted.
- When replication factor is increases, the NameNode identifies some aditional DataNodes where the data blocks can be stored.
- This information is passed to the DataNode in the next HeartBeat message.
- The DataNode then removes/adds the corresponding blocks and the available space is updated for the cluster. There is a configurable time delay between replication request and actual data operation in the cluster.