Unstructured data processing : Relational database Vs Hadoop
Let’s discuss what is unstructured data and why Hadoop is preferred. Data that does not have a fixed schema is generally classified as unstructured or semi-structured data. Textual data like documents, image files, audio and video files are the candidates.
What do relational databases support ?
Relational database management systems are optimized for structured data only. They rely extensively on the schema when storing, planning and retrieving the data. Pre-collected statistics information plays a major role in generating the access plan for user queries. Relational databases try to support unstructured data in form of LOBs(Large Objects), that can support files of a few GB size. Almost all relational databases do not allow statistics collection/indexing/optimization of access plan for LOBs. If the DBMS supports UDF(user Defined Function), the only way to process Unstructured data is to create User Defined Functions for reading and analyzing the LOB data. This results in limited capabilities and performance hit. So Hadoop as a generalized framework looks like a cheaper and more efficient alternative.
How does Hadoop help us ?
Basic map-reduce framework does not offer any added advantage when dealing with unstructured data. We have to write our own specialized business logic for analyzing the unstructured data based on data type. Say for text files, we need a text parser or summary maker algorithm to extract knowledge from raw data. We can do indexing on textual data with simple algorithms like inverted index or tries. These operations are CPU intensive. Since Hadoop installation is cheaper we can have multiple nodes processing the data in parallel.
Similarly for image files, we need some advanced pattern recognition algorithm for graphical data. Here processing the data may mean sophisticated transformation logic to manipulate the pixels. For search operation we have two ways – we can generate binary patterns from searched image file, and compare with stored binary data. Otherwise we can create graphical pixel patterns from searched image file and compare with pre-processed image repository.
For audio and video files, any analysis operation is difficult. We first need decoders to generate a comprehensible expanded format from the specific audio/video format. You will need different decoders for mp3, rm, mpeg, avi, mkv, mp4 formats. Then the audio/video splitting and comparison is another big task. For video files you will need some sophisticated logic for extracting frames and analyzing them. Without a proper algorithm these can become very expensive in terms of machine resources. Most softwares available in market for processing audio and video are licensed and need some customization before being used by Hadoop applications.
Apache Lucene can be readily used for indexing and analyzing textual unstructured data. This is already in use by some commercial data procesing systems. Moderate level of image processing can be expected from Lire that is built on top of Lucene.
We may not see any efficient open source projects that provide audio/video processing capabilities for Hadoop. These are still in research in many organizations. We may be bound to make our own implementations or use licensed products that support processing few formats.