So far, we have discussed the top-down approach, which focuses on decomposing raw information of different forms into attributes on data tables and setting criteria that we believe constitute a good database design. We now dive deeper into the process of analyzing and minimizing redundancies and anomalies in data. This process is called normalization. Normalization includes techniques to decompose data while still preserving functional dependencies of attributes, maintaining the original relationships and constraints of the data in the decomposed tables.
Normalization evaluates relation schema through a series of up to 6 tests. Each test has a criteria and a corresponding normal form. The relation schema must meet each criteria to achieve each level of normalisation. The first 4 tests are the most important ones basing on functional dependencies among attributes. The last 2 tests can be optional considering their certain disadvantages, they are based on multivalued dependencies and join dependencies. In the coming part, we will see how the first 4 test are performed to purify and improve the database design .
Assumptions
While performing normalisation process, it is important to note that we are assuming functional dependencies exist within given database tables (relations). This assumption is made based on the fact that database designer follows top-down approach and performs an appropriate design process in previous conceptual stage to produce entity-relationship model.
First Normal Form(1NF)
Criteria : Only atomic/indivisible values are permitted
Multiple values or composite value ( such as array, list, etc.,) are not allowed in a table cell. This is because of the constrain we set for database table that each cell of table contain only one single value.
Three main techniques to achieve the first normal form:
- Remove the non-atomic attribute from the table and place it in a separate table. The new table also include the primary key or the original table for reference. This results in two 1NF relations.
For example, we have table Persons and each person can have multiple address locations:
ID | Name | Locations |
1 | John | California, Texas, Florida |
2 | Mary | New York, Pennsylvania, Georgia |
3 | Peter | Michigan , Illinois, Ohio |
New table and original table after removing the non-atomic attribute:
Person_ID | Location |
1 | California |
1 | Texas |
1 | Florida |
2 | New York |
2 | Pennsylvania |
2 | Georgia |
3 | Michigan |
3 | Illinois |
3 | Ohio |
ID | Name |
1 | John |
2 | Mary |
3 | Peter |
2. Put each location in to a separate rows, other attributes are the same. This has disadvantages of introducing redundancy, hence it is not recommended
ID | Name | Location |
1 | John | California |
1 | John | Texas |
1 | John | Florida |
2 | Mary | New York |
2 | Mary | Pennsylvania |
2 | Mary | Georgia |
3 | Peter | Michigan |
3 | Peter | Illinois |
3 | Peter | Ohio |
3. The third approach is creating more columns and place each location in one column : Location1, location2, ect. This has disadvantage of introducing NULL in case that not all people have three address location. For example, we add to the table another person name Katie with only 1 location, our table will look like this:
ID | Name | Location1 | Location2 | Location3 |
3 | Peter | Michigan | Illinois | Ohio |
4 | Katie | New York | NULL | NULL |
… | … | … | … | … |
Among three solutions proposed above, the first one is considered the best because it have less redundancy compared to other two.
It is also noted that sometimes atomic data can include the entire object or file stored as a binary large object (BLOB) or character large object (CLOB).
Second Normal Form (2NF)
Criteria: Only fully functional dependency exists in the table
The goal of second normalization step is to achieve a relation in which all attributes are fully and functionally dependent on the primary key. In other words, this process removes all possible partial dependencies in the table to make all non-prime attribute fully functionally dependent only on the primary key.
An attribute is non-prime if it is not a part of any key ( either primary key or candidate key). A partial dependency happens when a group of non-prime attributes only functionally depends on a part of the primary key. For example, we have table Person_Project with 3 attributes: PersonID, ProjectID, Project_Location. If Project_Location ony functional depends on ProjectID but not PersonID, we have a partial dependency in this table and we need to remove that to achieve the second normal form.
It is noted that we only consider this test if primary key contains more than 1 attribute.
To achieve the second normal form, we will move the partial dependency out to a separate table (e.g., Project). Only the reference attribute( e.g., ProjectID) remains in the original table.
Third Normal Form(3NF)
Criteria: No transitive dependency exists in the table
Another scenario is that we have certain non-prime attributes, though functional depends on the primary key, but indirectly through one or more other non-prime attributes.
For example, we have 3NF-violated table below:
Table: EmployeeDetails
EmployeeID | EmployeeName | DepartmentID | DepartmentName | DepartmentLocation |
---|---|---|---|---|
1 | John Doe | D01 | HR | New York |
2 | Jane Smith | D02 | Finance | San Francisco |
3 | Alice Johnson | D01 | HR | New York |
4 | Bob Brown | D03 | IT | Chicago |
The table violates 3NF because there are transitive dependencies. Specifically:
- EmployeeID → DepartmentID
- DepartmentID → DepartmentName, DepartmentLocation
This means that DepartmentName and DepartmentLocation are indirectly dependent on EmployeeID through DepartmentID.
To achieve 3NF, we can decompose the above table into two tables: EmployeeDetails which have only 3 attributes: EmployeeID, EmployeeName, and DepartmentID, while DepartmentName and DepartmentLocation are placed in a new table with Primary Key is DepartmentID.
Corrected Tables in 3NF
To correct this and achieve 3NF, we should decompose the table into two separate tables.
Table 1: Employees
EmployeeID | EmployeeName | DepartmentID |
---|---|---|
1 | John Doe | D01 |
2 | Jane Smith | D02 |
3 | Alice Johnson | D01 |
4 | Bob Brown | D03 |
Table 2: Departments
DepartmentID | DepartmentName | DepartmentLocation |
---|---|---|
D01 | HR | New York |
D02 | Finance | San Francisco |
D03 | IT | Chicago |
Now, the Employees table and the Departments table are in 3NF because there are no transitive dependencies. Each non-key attribute is fully functionally dependent on the primary key of its respective table.
Boyce-Codd Normal Form
Criteria: Every functional dependency in a relation, the determinant (left-hand side) always is a candidate key
The third normal form (3NF) can still be achieved if a group of non-prime attributes functionally depends on a prime attribute. This prime attribute cannot be part of primary key ( because it will violate 2NF), but it can be a part of a candidate key. In other case, there is a chance that a prime attribute depends on a non-prime attribute as well. Hence 3NF can achieve if in every dependency A ->B ( indicating that the value of A determines the value of B) either:
- A is a prime attribute
- B is a prime attribute
The critical difference is that BCNF does not allow the exceptions that 3NF allows for functional dependencies where B is a prime attribute.
BCNF enforces stricter rules than 3NF by ensuring that every functional dependency’s determinant is a super key, thereby eliminating potential anomalies and redundancies that 3NF might still allow.
Normal forms do not guarantee a good database design. It must confirm additional properties: guarantees that the spurious tuple generation does not occur after decomposition(critical, must be achieved at any cost), and each functional dependency is preserve after decomposition ( sometime sacrificed)