In one of our blog post we covered on a high level what is a database and we gave you examples of databases that were normalized and are frequently used in operational databases (every day use databases).
In a separate article we will go into details of normalization process but in this post we will focus on three reasons for normalization in a database.
Before we go into reasons for normalization let's make it very clear who is interested in normalization. This is important because there are different types of databases and two main kind of databases are:
- Operational Databases (Online Transaction Processing - OLTP) - These databases are used every day by employees or customers and generally they are updated on row by row basis for example: An Employee may retrieve one customer's details (one row) and update one customer details (one row). These databases are generally used by large number of users that may potentially want to update the same records. OLTP databases are the most widely used databases but bear in mind that the reason behind it, is it's purpose (operational activities) and therefore they will be often many purpose specific databases. OLTP Databases are often critical from operations perspective (selling an air ticket) and are available for 99.9%-100% of the time.
- Data Warehouses (Supports Online Analytical Processing - OLAP) - This kind of database is actually the opposite of OLTP in many aspects. Data Warehouses integrate data from multiple sources (usually many OLTP databases but also excel files etc) into one centralized location. During integration process (commonly known as ETL) the data is cleansed (remove duplicates, typos etc) and business rules are applied (reject/accept, calculations etc). Data Warehouses are usually used by decision makers and analysts and they access set of rows (rarely single row) and they typically ask questions like how much did the company (or department) sale last month and the same time a year before. Most companies should have only one data warehouse and the database is critical for supporting tactical and strategic decisions (sometimes operational)
NOTE: When we talk about Data warehouses we mean Kimball Data Warehousing methodology which is the most common one. There are different ones where our arguments may not be applicable.
IMPORTANT: Normalization supports OLTP characteristics and objectives and you should always bear this in mind because Normalization almost never is used in Data Warehouse databases. You may see normalized Data warehouses but that is only because someone was not aware of negative consequences of doing that.
Below are three main reasons for normalization (from my point of view).
Reminder: These reasons are not applicable in Data Warehouses because they are handled differently and usage is different.
- Data Integrity
- Data in each table should contain unique instances of that entity (table) with valid attribute (column) values. That means in Customer Entity (table) you should have only one customer (not the same customer with misspelled surname, or changed surname and so on) also certain customer attributes should be valid for instance DOB should not be in the future or sometimes it should be enforce rule of at least 18 years. UK customer Post Code should also be valid and you can enforce UK post code pattern (that get's complicated with multiple countries)
- You can achieve that using table constraints like: Primary Key, Unique Key, Check constraint, default constraint.
- Related tables should contain valid values.This is also known as Referential Integrity which means that two related tables should have related keys. In order words when we break it down to Parent (Primary Key field) and Child table (Foreign Key) we should not have Child table Foreign Key values that do not exist in Parent table.
- We can achieve that using Foreign Key Constraints that will ensure you cannot delete (or insert) value into Foreign Key field that does not exist as Primary Key field value.
- Controlled Data Manipulation
- In OLTP many users may want to access and modify the same rows. After Normalization process you have many tables that contain usually one values in one table (one instance in one table to be more precise). This means that if you need to modify Product Category you don't lock product information and just lock one particular table which is used in many other tables (using Foreign Key). This means users conflicts (trying to modify the same row) is minimized to minimum.
- Security of data is also important and with many tables you get greater control over data access.
- If you break down table into multiple smaller tables than that means you will save space (note: this is not really a good argument in data warehouses due to different usage). When you save space that means you get better performance because accessing particular row requires less effort.
- In databases you often improve retrieval performance by adding indexes and in OLTP choosing the right indexes in particularly important as there are large amount of request that usually affects individual rows or small set of rows and you want to ensure extract of one or small set of rows is very fast but at the same time you don't want to slow down updates, inserts and deletes and the more indexes you have to slower updates/inserts/delete are therefore choosing the right indexes is particularly important and usually have a big affect on users experience.
The above are our three reasons for normalization in OLTP databases. If you have any other reasons that you believe are key than fell free to add it and we will update our blog post with your points.