Normalization for Relational Database

created:

updated:

tags: database book

Normalization

Normalization is a database design technique that reduces data reduandancy and anomalies for insertion, update and deletion through decomposing database relation to have only relevant attributes. Depending on the level of normalization, there are several normal forms. Largely, there are First Normal Form (1NF), Second Normal Form (2NF), Third Normal Form (3NF), Boyce-Codd Normal Form (BCNF), Fourth Normal Form (4NF), and Fifth Normal Form (5NF).

Anomalies

Problems that occurs when we try to cram too much information into a single relation.

  • Redundancy: Information that may be repeated unnecessarily
  • Insertion Anomalies: We may not insert data in a table that refers to other table if we don’t insert the referenced attribute together.
  • Update Anomalies: We may update information in one tuple but forgot to update the same information in other tuples.
  • Deletion Anomalies: If a set of values get removed, we may lose other information as a side effect.

Decomposition

Decomposing relations is the accepted way of eliminate anomalies.

Functional Dependency (FD)

  • Goal: “equality of the symbols in a given column”
  • Some rules for simplification of projected relations:
    • “not necessary to check the trivial FD’s”
    • “can restrict to looking for FD’s with a singleton right side because of the the combining rule for FD’s”
    • If an FD’s left side “does not contain the left side of any given dependency surely cannot hold, since there is no way for its chase test to get started.”

First Normal Form (1NF)

A relation belongs to 1NF if all of its attributes are composed of atomic values.

Composite attributes that have multiple values do not satisfy the conditions for 1NF.

  • Relation that does not satisfy 1NF

    customer_idevent_idmembershipdiscount_rateparticipate
    c1e1, e2regular10%Y, Y
    c2e2, e3premium20%Y, N
    c3e4super premium30%N
  • Relation that satisfies 1NF

    customer_idevent_idmembershipdiscount_rateparticipate
    c1e1regular10%Y
    c1e2regular10%Y
    c2e2premium20%Y
    c2e3premium20%N
    c3e4super premium30%N
    • Primary key: (customer_id, event_id)

Anomalies still remaining for 1NF

  • Insertion Anomaly: Cannot insert a data ‘(c1, ‘super premium’, 30%)’ without including event_id.
  • Update Anomaly: Redundant data (ex: ‘membership’, ‘discount_rate’) for the same customers. If the discount_rate changes for regular membership, we need to update discount_rate attribute for both rows. If we fail to update both rows, data is no longer consistent.
  • Deletion Anomaly: We have one customer c3 who has ‘super premium’ membership. If we delete the row of c1, we lose data about the ‘super premium’ membership and its discount_rate.

Second Normal Form (2NF)

A relation belongs to 2NF if it also belongs to 1NF and all attributes except primary key is fully functionally dependent on the primary key.

The table from above is not fully functionalty dependent on its primary key (customer_id, event_id) since membership and discount_rate are not related to event_id. Therefore, if we decompose the table into two relations, we can have attributes fully functionally dependent on the primary key.

The following relations satisfy 2NF:

  • Customer Relation

    customer_idmembershipdiscount_rate
    c1regular10%
    c2premium20%
    c3super premium30%
    c4regular10%
    • Primary key: customer_id
  • Event Participation Relation

    customer_idevent_idparticipate
    c1e1Y
    c1e2Y
    c2e2Y
    c2e3N
    c3e4N
    c4e3Y
    • Primary key: (customer_id, event_id)

Anomalies still remaining at 2NF

There are multiple functional dependencies in the Customer Relation and this can cause anomalies:

  • Insertion Anomaly: We may not insert data that has only (NULL, ‘star’, 35%) since its primary key is customer_id.
  • Update Anomaly: If the discount_rate changes for the regular membership, we need to update discount_rate in all the rows that has regular membership. Otherwise, we cannot maintain data consistency.
  • Deletion Anomaly: We may lose data about membership and discount_rate due to a customer’s deactivation. If customer c2 deletes their account, then we lose data about premium membership and its discount_rate.

Third Normal Form (3NF)

A relation belongs to 3NF if it also belongs to 2NF and if all non-primary-key attributes do not have transitive dependencies.

  • A transitive functional dependency is that if X -> Y and Y -> Z, then it satisfies X -> Z. We say that Z is transitively dependent on X.
  • A 3NF relation must satisfy that all attributes are not transitively dependent on the primary key.

Let’s take a look at Customer relation that satisfies 2NF but is not 3NF.

  • Customer Relation

    customer_idmembershipdiscount_rate
    c1regular10%
    c2premium20%
    c3super premium30%
    c4regular10%
    • Primary key: customer_id
    • membership depends on customer_id and discount_rate depends on membership.
      • customer_id -> membership and membership -> discount_rate. Therefore, customer_id -> discount_rate transitively dependent.

We can decompose this Customer relation into two relations (Customer and Membership) and they can satisfy 3NF.

  • Customer Relation

    customer_idmembership
    c1regular
    c2premium
    c3super premium
    c4regular
    • Functional dependency: customer_id -> membership
  • Membership Relation

    membershipdiscount_rate
    regular10%
    premium20%
    super premium30%
    • Functional dependency: membership -> discount_rate

Anomalies still remaining at 3NF

When there are multiple candidate keys, there can be anomalies.

Below is an exmaple relation that can display such anomalies.

  • Course Registration Relation

    customer_idlecture_idinstructor_id
    c1lec001inst001
    c2lec002inst002
    c3lec001inst001
    c3lec002inst004
    c4lec001inst003
    • Primary key: (customer_id, lecture_id)
    • Conditions:
      • A customer can register for multiple lectures and cannot register for the same lecture more than once.
      • An instructor can teach only one lecture
      • A lecture can be conducted by multiple instructors
  • Functional dependencies:

    • (customer_id, lecture_id) -> instructor_id
    • instructor_id -> lecture_id (since an instructor can teach only one lecture)
    • lecture_id is dependent on instructor_id although insturctor_id is not a candidate key
  • Anomalies:

    • Insertion Anomaly: We may not insert data (NULL, lec003, inst005) if there is no customer who registers for the lecture because (customer_id, lecture_id) is the primary key. We cannot assign NULL to the primary key.
    • Update Anomaly: If the lecture_id for the instructor (inst001) changes, we need to update lecture_id in all the rows that have the instructor. If we only update one, it violates the relation condition that an instructor can only teach one lecture.
    • Deletion Anomaly: If we remove the row (c3, lec002, inst004), then we lose the only data about inst004 who teaches lec001.

Boyce-Codd Normal Form (BCNF)

A relation belongs to BCNF if all the determinants in functional dependencies are candidate keys and if the relation belongs to 3NF.

In order to address the anomalies associated with the above 3NF relation, we can decompose the 3NF relation further. This will mainly ensure that there is no functional dependecy’s determinant not being a candidate key.

  • Customer-Instructor Relation

    customer_idinstructor_id
    c1inst001
    c2inst002
    c3inst001
    c3inst004
    c4inst003
    • Primary key: (customer_id, instructor_id)
    • No functional dependency
  • Instructor-Lecture Relation

    instructor_idlecture_id
    inst001lec001
    inst002lec002
    inst003lec001
    inst004lec002
    • Functioanl dependency: instructor_id -> lecture_id

With this decomposition, it resolves the issue where some determinants in functional dependencies are not candidate keys and the relations now satisfy BCNF.

Fourth Normal Form (4NF)

Fifth Normal Form (5NF)


BCNF Stages for Relational Database Schema

  1. Find problems associated with poorly designed database schemas
  2. Decomposition: break a relation into two smaller schemas
  3. Boyce-Codd Normal Form (BCNF): a condition that eliminates the problems found in Stage 1.
  4. Assure BCNF condition by decomposing relational schemas.

Boyce-Codd Normal Form (BCNF)

Goal of Decomposition

To replace a relation by several ones that do not exhibit anomalies. Boyce-Codd Normal Form is the condition that guarantees that anomalies do not exist.

A relation R is in BCNF if and only if every nontrivial FD A1A2...An -> B1B2...Bm, {A1, A2, …, An} is a superkey for R.

  • A superkey does not need to be minimal.
  • The left side of every nontrivial FD must contain a key.

What are superkeys?

A set of attributes that contains a key (“Superset of a key”).

Decomposition into BCNF

Any relation schema can be broken into a collection of subsets of its attributes by repeatedly choosing suitable decompositions, such that:

  • The subsets are the schemas of relations in BCNF
  • The original relation instance can be re-constructed from its decomposed relation instances.

The decomposition strategy is to look for a nontrivial FD that violates BCNF.

  • All the attributes involved in violating FD
  • The left side of the FD plus all the attributes not involved in the FD

In general, we must keep applying decompositions rule as many times as needed, until all our relations are in BCNF.

BCNF Decomposition Algorithm

  • Input: A relation R0 with a set of FDs S0.
  • Output: A decomposition of R0 into a collection of relations, all of which are in BCNF.
  • Method: Initially, apply them with R = R0 and S = S0.
    1. Check if R is in BCNF. If so, return R.
    2. If there are BCNF violation, let one be X -> Y. Compute X+. Choose R1 = X+ as one relation schema and let R2 to have attributes X and those attributes of R that are not in X+.
    3. Compute the sets of FDs for R1 and R2. Let them be S1 and S2.
    4. Recursively decompose R1 and R2 using this algorithm. Return the union of the results of these decompositions.

Properties of Decomposition

  1. Elimination of Anomalies
  2. Recoverability of Information (Can we recover the original relation?)
  3. Preservation of Dependencies (Can we preserve the original FDs?)

It is not possible to achieve all three properties at once.

Third Normal Form (3NF)

A relation R is in third normal form (3NF) if: whenever A1A2…An -> B1B2…Bm is a nontrivial FD, either:

  • {A1, A2, …, An} is a superkey
  • Or those of B1, B2, …, Bm that are not among the A’s, are each a member of some key (not necessarily the same key).
  • Prime attribute: attribute that is a member of some key

Notes

  • First normal form: the condition that every component of every tuple is an atomic value
  • Second normal form: a less restrictive version of 3NF

The Synthesis Algorithm for 3NF Schemas

To decompose a relation R into a set of relations such that the decomposed relations are:

  • all in 3NF
  • a lossless join
  • dependency-preserving

Algorithm

  1. Find a minimal basis (G) for F

    • How to check a minimal basis FD
      • Verify that any of the dependencies in the FD cannot be eliminated
      • Verify that any attributes from the left side cannot be eliminated
  2. For each FD X -> A in G, use XA as the schema of one of the relations in the decomposition

    • We take the attributes of each FD as a relation schema. If a relation schema is a proper subset of another schema, we can drop that subset-schema.
  3. If none of the sets of relations from #2 is a superkey for R, add another relation whose schema is a key for R

Multi-Valued Dependency (MVD)

  • Goal: “to produce a row of the tableau that has unsubscripted letters in all the attributes of one of the relations of the decomposition; that row may have any letters in the other attributes”
  • Some rules for simplification of projected relations:
    • “not necessary to check the trivial MVD’s”
    • If an MVD’s left side “does not contain the left side of any given dependency surely cannot hold, since there is no way for its chase test to get started.”

References