Normalization
Normalization is a database design technique that reduces data reduandancy and anomalies for insertion, update and deletion through decomposing database relation to have only relevant attributes. Depending on the level of normalization, there are several normal forms. Largely, there are First Normal Form (1NF), Second Normal Form (2NF), Third Normal Form (3NF), Boyce-Codd Normal Form (BCNF), Fourth Normal Form (4NF), and Fifth Normal Form (5NF).
Anomalies
Problems that occurs when we try to cram too much information into a single relation.
Redundancy
: Information that may be repeated unnecessarilyInsertion Anomalies
: We may not insert data in a table that refers to other table if we don’t insert the referenced attribute together.Update Anomalies
: We may update information in one tuple but forgot to update the same information in other tuples.Deletion Anomalies
: If a set of values get removed, we may lose other information as a side effect.
Decomposition
Decomposing relations is the accepted way of eliminate anomalies.
Functional Dependency (FD)
- Goal: “equality of the symbols in a given column”
- Some rules for simplification of projected relations:
- “not necessary to check the trivial FD’s”
- “can restrict to looking for FD’s with a singleton right side because of the the combining rule for FD’s”
- If an FD’s left side “does not contain the left side of any given dependency surely cannot hold, since there is no way for its chase test to get started.”
First Normal Form (1NF)
A relation belongs to 1NF if all of its attributes are composed of atomic values.
Composite attributes that have multiple values do not satisfy the conditions for 1NF.
Relation that does not satisfy 1NF
customer_id event_id
membership discount_rate participate c1 e1, e2 regular 10% Y, Y c2 e2, e3 premium 20% Y, N c3 e4 super premium 30% N Relation that satisfies 1NF
customer_id event_id
membership discount_rate participate c1 e1 regular 10% Y c1 e2 regular 10% Y c2 e2 premium 20% Y c2 e3 premium 20% N c3 e4 super premium 30% N - Primary key: (customer_id, event_id)
Anomalies still remaining for 1NF
- Insertion Anomaly: Cannot insert a data ‘(c1, ‘super premium’, 30%)’ without including event_id.
- Update Anomaly: Redundant data (ex: ‘membership’, ‘discount_rate’) for the same customers. If the discount_rate changes for regular membership, we need to update discount_rate attribute for both rows. If we fail to update both rows, data is no longer consistent.
- Deletion Anomaly: We have one customer c3 who has ‘super premium’ membership. If we delete the row of c1, we lose data about the ‘super premium’ membership and its discount_rate.
Second Normal Form (2NF)
A relation belongs to 2NF if it also belongs to 1NF and all attributes except primary key is fully functionally dependent on the primary key.
The table from above is not fully functionalty dependent on its primary key (customer_id, event_id) since membership and discount_rate are not related to event_id. Therefore, if we decompose the table into two relations, we can have attributes fully functionally dependent on the primary key.
The following relations satisfy 2NF:
Customer Relation
customer_id membership discount_rate c1 regular 10% c2 premium 20% c3 super premium 30% c4 regular 10% - Primary key: customer_id
Event Participation Relation
customer_id event_id participate c1 e1 Y c1 e2 Y c2 e2 Y c2 e3 N c3 e4 N c4 e3 Y - Primary key: (customer_id, event_id)
Anomalies still remaining at 2NF
There are multiple functional dependencies in the Customer Relation and this can cause anomalies:
- Insertion Anomaly: We may not insert data that has only (NULL, ‘star’, 35%) since its primary key is customer_id.
- Update Anomaly: If the discount_rate changes for the regular membership, we need to update discount_rate in all the rows that has regular membership. Otherwise, we cannot maintain data consistency.
- Deletion Anomaly: We may lose data about membership and discount_rate due to a customer’s deactivation. If customer c2 deletes their account, then we lose data about premium membership and its discount_rate.
Third Normal Form (3NF)
A relation belongs to 3NF if it also belongs to 2NF and if all non-primary-key attributes do not have transitive dependencies.
- A transitive functional dependency is that if
X -> Y
andY -> Z
, then it satisfiesX -> Z
. We say that Z is transitively dependent on X. - A 3NF relation must satisfy that all attributes are not transitively dependent on the primary key.
Let’s take a look at Customer relation that satisfies 2NF but is not 3NF.
Customer Relation
customer_id membership discount_rate c1 regular 10% c2 premium 20% c3 super premium 30% c4 regular 10% - Primary key: customer_id
- membership depends on customer_id and discount_rate depends on membership.
customer_id -> membership
andmembership -> discount_rate
. Therefore,customer_id -> discount_rate
transitively dependent.
We can decompose this Customer relation into two relations (Customer and Membership) and they can satisfy 3NF.
Customer Relation
customer_id membership c1 regular c2 premium c3 super premium c4 regular - Functional dependency:
customer_id -> membership
- Functional dependency:
Membership Relation
membership discount_rate regular 10% premium 20% super premium 30% - Functional dependency:
membership -> discount_rate
- Functional dependency:
Anomalies still remaining at 3NF
When there are multiple candidate keys, there can be anomalies.
Below is an exmaple relation that can display such anomalies.
Course Registration Relation
customer_id lecture_id instructor_id c1 lec001 inst001 c2 lec002 inst002 c3 lec001 inst001 c3 lec002 inst004 c4 lec001 inst003 - Primary key: (customer_id, lecture_id)
- Conditions:
- A customer can register for multiple lectures and cannot register for the same lecture more than once.
- An instructor can teach only one lecture
- A lecture can be conducted by multiple instructors
Functional dependencies:
(customer_id, lecture_id) -> instructor_id
instructor_id -> lecture_id
(since an instructor can teach only one lecture)- lecture_id is dependent on instructor_id although insturctor_id is not a candidate key
Anomalies:
- Insertion Anomaly: We may not insert data (NULL, lec003, inst005) if there is no customer who registers for the lecture because (customer_id, lecture_id) is the primary key. We cannot assign
NULL
to the primary key. - Update Anomaly: If the lecture_id for the instructor (inst001) changes, we need to update lecture_id in all the rows that have the instructor. If we only update one, it violates the relation condition that an instructor can only teach one lecture.
- Deletion Anomaly: If we remove the row (c3, lec002, inst004), then we lose the only data about inst004 who teaches lec001.
- Insertion Anomaly: We may not insert data (NULL, lec003, inst005) if there is no customer who registers for the lecture because (customer_id, lecture_id) is the primary key. We cannot assign
Boyce-Codd Normal Form (BCNF)
A relation belongs to BCNF if all the determinants in functional dependencies are candidate keys and if the relation belongs to 3NF.
In order to address the anomalies associated with the above 3NF relation, we can decompose the 3NF relation further. This will mainly ensure that there is no functional dependecy’s determinant not being a candidate key.
Customer-Instructor Relation
customer_id instructor_id c1 inst001 c2 inst002 c3 inst001 c3 inst004 c4 inst003 - Primary key: (customer_id, instructor_id)
- No functional dependency
Instructor-Lecture Relation
instructor_id lecture_id inst001 lec001 inst002 lec002 inst003 lec001 inst004 lec002 - Functioanl dependency: instructor_id -> lecture_id
With this decomposition, it resolves the issue where some determinants in functional dependencies are not candidate keys and the relations now satisfy BCNF.
Fourth Normal Form (4NF)
Fifth Normal Form (5NF)
BCNF Stages for Relational Database Schema
- Find problems associated with poorly designed database schemas
- Decomposition: break a relation into two smaller schemas
Boyce-Codd Normal Form (BCNF)
: a condition that eliminates the problems found in Stage 1.- Assure BCNF condition by decomposing relational schemas.
Boyce-Codd Normal Form (BCNF)
Goal of Decomposition
To replace a relation by several ones that do not exhibit anomalies. Boyce-Codd Normal Form
is the condition that guarantees that anomalies do not exist.
A relation R is in BCNF if and only if every nontrivial FD A1A2...An -> B1B2...Bm
, {A1, A2, …, An} is a superkey for R.
- A superkey does not need to be minimal.
- The left side of every nontrivial FD must contain a key.
What are superkeys?
A set of attributes that contains a key (“Superset of a key”).
Decomposition into BCNF
Any relation schema can be broken into a collection of subsets of its attributes by repeatedly choosing suitable decompositions, such that:
- The subsets are the schemas of relations in BCNF
- The original relation instance can be re-constructed from its decomposed relation instances.
The decomposition strategy is to look for a nontrivial FD that violates BCNF.
- All the attributes involved in violating FD
- The left side of the FD plus all the attributes not involved in the FD
In general, we must keep applying decompositions rule as many times as needed, until all our relations are in BCNF.
BCNF Decomposition Algorithm
- Input: A relation R0 with a set of FDs S0.
- Output: A decomposition of R0 into a collection of relations, all of which are in BCNF.
- Method: Initially, apply them with
R = R0
andS = S0
.- Check if R is in BCNF. If so, return R.
- If there are BCNF violation, let one be
X -> Y
. ComputeX+
. ChooseR1 = X+
as one relation schema and letR2
to have attributesX
and those attributes of R that are not inX+
. - Compute the sets of FDs for
R1
andR2
. Let them beS1
andS2
. - Recursively decompose
R1
andR2
using this algorithm. Return the union of the results of these decompositions.
Properties of Decomposition
- Elimination of Anomalies
- Recoverability of Information (Can we recover the original relation?)
- Preservation of Dependencies (Can we preserve the original FDs?)
It is not possible to achieve all three properties at once.
Third Normal Form (3NF)
A relation R is in third normal form (3NF) if: whenever A1A2…An -> B1B2…Bm is a nontrivial FD, either:
- {A1, A2, …, An} is a superkey
- Or those of B1, B2, …, Bm that are not among the A’s, are each a member of some key (not necessarily the same key).
- Prime attribute: attribute that is a member of some key
Notes
- First normal form: the condition that every component of every tuple is an atomic value
- Second normal form: a less restrictive version of 3NF
The Synthesis Algorithm for 3NF Schemas
To decompose a relation R into a set of relations such that the decomposed relations are:
- all in 3NF
- a lossless join
- dependency-preserving
Algorithm
Find a minimal basis (G) for F
- How to check a minimal basis FD
- Verify that any of the dependencies in the FD cannot be eliminated
- Verify that any attributes from the left side cannot be eliminated
- How to check a minimal basis FD
For each FD X -> A in G, use XA as the schema of one of the relations in the decomposition
- We take the attributes of each FD as a relation schema. If a relation schema is a proper subset of another schema, we can drop that subset-schema.
If none of the sets of relations from #2 is a superkey for R, add another relation whose schema is a key for R
Multi-Valued Dependency (MVD)
- Goal: “to produce a row of the tableau that has unsubscripted letters in all the attributes of one of the relations of the decomposition; that row may have any letters in the other attributes”
- Some rules for simplification of projected relations:
- “not necessary to check the trivial MVD’s”
- If an MVD’s left side “does not contain the left side of any given dependency surely cannot hold, since there is no way for its chase test to get started.”
References
- A First Course in Database Systems
- 데이터베이스 개론 9장: 정규화