Normalization In DBMS
Database normalization or data normalization is a technique for decomposing record sets to eliminate data redundancy based on the relationship between record attributes. Normalization concept became popular with the adoption of relational database management systems(RDBMS). Entity Relationship model also describes advantages of normalization. This model expects that the data is properly understood and their relationships are analyzed.
Normalization In Relational Databases :
In typical relational data store, some operationally related record sets may contain common attributes. If the record sets have a tightly bound relationship, those common attributes may seem redundant. They occupy unnecessary storage space. Also, when we make any change to one common attribute in a record set, we must immediately update the attribute in other record sets. Failing to do so will create integrity issues, if the relationship is ever checked.
For the dynamically updated data, it is not advisable to keep the common attributes in more than one record set. Rather keep the attribute in one set where it is the most accessed and create references with other sets where it is less frequently accessed. Referential integrity constraints can be defined for this purpose.
When you normalize a database, you have five goals:
- Arrange data into logical groupings such that each group describes a smaller part of the whole
- Minimize the amount of redundant data stored in a database.
- Organize the data in such a way that, a change to one entity is automatically propagated to other entities.
- Make the data manipulation fast and efficiently without compromising the data integrity.
- Make that addition of new database entities easier.
We need to do enough research, before beginning to normalize a database. We need to perform requirements analysis, identify policies and relevant business rules. There are different normal forms. A table is called normalized if it is in third normal form.
1NF : First Normal Form is achieved when each data value is atomic.
2NF : Second Normal Form is achieved when data is already in 1NF and every non-prime attribute of the table is dependent on the whole of every candidate key.
3NF : Third Normal Form means the entity is in second normal form, and all the attributes in the table are determined only by the candidate keys of that table and not by the non-prime attributes. 3NF tables are free of insertion, update, and deletion anomalies.
Denormalized databases have a performance advantage. When the user tries to select a number attributes, if any attribute is accessed from another related record set, the resource used in retrieving the other set hampers performance. On the other hand, if all the attributes can be accessed from the same record set, the query will execute quickly. This is beneficial for analytical systems.
Denormalized database systems may become space intensive, compromise integrity, and become hard to maintain when data volume grows.
Data normalization is primarily important in the transaction processing databases, in which data modifications (inserts, updates, deletes) occur frequently throughout the stored data. In contrast, a data warehouse may require large amount of denormalized and summarized data to help ad hoc queries perform better. Also data warehouse end users do not modify the data. Updates rarely happen in controlled manner. So the data warehouses may apply normalization with selective denormalization to benefit the adhoc queries.
Now a days, anything that is not relational is being called NoSQL.
Normalization In NoSQL Databases :
If there comes a need to fit NoSQL data into normalized databases we have to understand each unit of the NoSQL data, its meaning, where it came from, its quality, its integrity before creating normalization. Organizing the data into related groups may be undesirable, from operational convenience perspective. Consider the e-commerce/social networking sites that generate huge amount of data from user interactions. These data do not have a proper schema or relationship or a valid business rule. Understanding their meaning for normalizing is a tough task.
NoSQL databases tell a different story – Data need not be analyzed or understood before storing. Store them in the form as they are generated. NoSQL databases are designed to capture and process semi-structured/unstructured data, from data sources over which we have little control. Data volume is huge and the data meaning may not remain same over time. This gives us a conclusion that Normalization is not practical in the NoSQL databases. Denormalization is desirable and suited by the design.
But again, over the past years we have come forward to use the log data in different ways. We never imagined these use cases, when creating the data logging applications. In future, when data logging will scale to multiple sources in multiple dimensions, probably the unstructured logs would be treated as subjective definition for atomic data values. Then we may find a use case for applying normalization concept to NoSQL data. We will also need new mechanisms to detect semantic relationships for creating normalization.