HDFS Intruduction

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware.

Historically, HDFS was built as infrastructure for the Apache Nutch web search engine project. It has become the most popular file system for Apache ecosystem. It has many similarities with existing distributed file systems, with some significant differences from other distributed file systems. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data.

Features/Goals of HDFS:

1. Fault tolerance : Assume your computer uses one hard disk. Your observation is that the disk may crash some time, may be once in a year. So if you have 365 computers, the probability of disk failures is one disk per day.

In multi-node data platforms, hardware failure is a norm rather than an exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data.  It is highly probable that some component of HDFS is non-functional at any day of the year. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.

2. Streaming Data Access : HDFS applications are not general purpose applications that typically run on general purpose file systems. They need streaming access to their data sets. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. POSIX system imposes many hard requirements, that are not really needed for bulk data processing applications. POSIX semantics has been relaxed in a few key areas to increase data throughput rates.

3. Peta Byte Scale Data Sets : Applications that use unstructured data, usually have huge data sets. A typical file in HDFS can be gigabytes to terabytes in size. HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support millions of files.

4. Move Your Computation Closer To Data : In a large distributed system, the data may not be on the node which requests some computation on the data. We have to move the data to the computation engine, which takes time and network bandwidth.

A more efficient approach is to move the computation instructions to the machine, where the data has been stored.This minimizes network congestion and increases the overall throughput of the system. HDFS provides interfaces for applications to move themselves closer to where the data is located.

5. Portability Across Multiple Platforms : HDFS provides an abstraction on the machine file system. It has been designed to be easily portable from one platform to another.

6. Simple Data managent Model: If a file stored in disk is updated, it involves a lots of housekeep work, like estimating the new size, deleting the old data, creating the new data in place or at some other disk space, allocating the required chinks of storage, etc. HDFS simplifies the operations with data access model of write-once-read-many-times concept for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access.