How to create high performance database systems

When creating a high performance database systems, we need to understand the relevant architecture. From the time of inception of DBMS engines, the data processor architectures have changed a lot. The change was mostly triggered by

  1. Exponentially increasing volume of data.
  2. Nature of data e.g. from structured to unstructured and semi-structured data.
  3. Storage devices and Intended use of the data.

Let’s discuss the impact of above changes so that we predict the future trend of databases.

Exponentially increasing volume of data:

The 1980’s systems never predicted processing of terabytes of data. Few megabytes of data and moderate number of transactions was considered a good achievement. So the level of query optimization and file system architectures were simple. When data volume increased,  the designs had to be changed to support bulk data processing. We preferred a redesign the data systems when data came from social networking sites. Massively parallel processing engines started becoming popular.

So the DBMS engines should implemented user defined customization to support low volume transactional data and high volume analytical data.

Nature of data:

Three decades back, the information generated by data applications where atomic in nature. Applications stuck to a fixed schema to make the data organization simpler. Analysis of schema gave us an advantage of optimizing the data access plan. Gradually the data application tried to generate composite data that could be decomposed to simpler types and stored. But when web scale unstructured data needed processing, we realized that such data may not have a fixed schema nor can they be decomposed to simpler types. Data processing systems adopted new design to support efficient processing of unstructured data.

Storage devices and Intended use of the data :

Some data systems intend to give real time update/insert/delete capabilities e.g banking applications. Analyzing the records is not needed. So instead of magnetic disk based storage, they adopted SSDs or main memory based storage. But when trying to process bulk data it is difficult to support real time updates. So limited updates with bulk read support are characteristics of analytical engines. Archive systems and parallel processing systems like Hadoop try to store data in read-only format to give better data retrieval functionalities with lesser maintenance overhead. Some vendors like Nutanix have created hybrid models where the RAM, SSD and disk drive based storage coexist to give optimal performance.

Considerations for creating a High Performance Data Processing Engine

1. File System : Creating a file system for the database is not mandatory. But a dedicated file system gives you huge performance benefits. Irrespective of the file system, optimal organization of data depends on whether typical queries access few record or many records. If few records are to be accessed, a B-Tree should be sufficient. The block size should be small so that for one row we do not access a big block. If the queries would access many rows, the block size can be a bit larger and a segment tree could be more performant.

If the data processor system would process unstructured or semi-structured data, it is not recommended to use any tree structure. Rather the design should be like a key-value store. Here value would be the unstructured data file. Block size can be chosen in the scale of MBs, to give better performance.

2. Plan generation: Query plan directly affects the performance of query. For OLTP systems, plan generation time may be comparable to the execution time. Further most queries are similar with different data values. So such systems use extensive plan caching to save the planning time.

For analytical data processing systems, the planning phase may include join order planning, join mechanism planning, access path planning, data geogrphy planning, etc. Mush importance is given to planning, because bad plan may result in a query running 10 times longer than the required time.

3. Cost optimization : If statistics is available, we can estimate how much data is going to be processed by the query. Then we can get an approximation of the read/write effort. Most databases use a cost profile to convert the execution steps into a single unit – time. If the data processor uses parallel processing, the parallel tasks should be accounted in computing the final elapsed time.

4. Scalability : Any component in  your data engine that has centralized component can be a bottleneck. The execution engine should be distributed and picking up jobs by message.

Also a customizable memory manager, caching logic can improve performance. A data processor cannot be more efficient when it has generic design. The data processor should try to solve a particular problem and optimize the process.