Database Management Systems - Prof. Holowczak
Zicklin School of Business - Baruch College
City University of New York
Database Management Systems
Normalization
What You'll Learn
| Pratt/Adamski | Rob/Coronel (5th ed) | Elmasri/Navathe (3rd) ed. | Kroenke (7th ed.) | Hoffer, Prescott & McFadden (6th ed) | Mata-Toledo / Cushman
|
|---|
| Chapter 5 | Chapter 4 | Chapter 14 and 15 | Chapter 5 | Chapter 5 and Appendix B | Shaum's Outlines Ch. 4 and 5
|
The Relational Model
- Recall, the Relational Model
consists of the
elements: relations, which are made up of
attributes.
- A relation is a set of columns (attributes)
with values for each attribute such that:
- Each column (attribute) value must be a single
value only.
- All values for a given column (attribute) must
be of the same type.
- Each column (attribute) name must be unique.
- The order of columns is insignificant
- No two rows (tuples) in a relation can be
identical.
- The order of the rows (tuples) is
insignificant.
- From our discussion of E-R Modelling, we know
that an Entity typically corresponds to a
relation and that the Entity's attributes
become attributes of the relation.
- We also discussed how, depending on the
relationships between entities, copies of
attributes (the identifiers) were placed in
related relations.
The process we are following is:
- Gather user/business requirements.
- Develop the E-R Model (shown as an E-R Diagram)
based on the user/business requirements.
- Convert the E-R Model to a set of relations
in the relational model
- Normalize the relations to remove any anomalies (***).
- Implement the database by creating a table for
each normalized relation.
Functional Dependencies
- A Functional Dependency describes a
relationship between attributes in a
single relation.
- An attribute is functionally dependant
on another if we can use the value of one
attribute to
determine the value of another.
- Example: Employee_Name is functionally
dependant on Social_Security_Number because
Social_Security_Number can be used to
determine the value of Employee_Name.
- We use the symbol -> to indicate a
functional dependency.
-> is read functionally determines
Student_ID -> Student_Major
Student_ID, Course#, Semester# -> Grade
SKU -> Compact_Disk_Title, Artist
Model, Options, Tax -> Car_Price
Course_Number, Section -> Professor, Classroom, Number of Students
- The attributes listed on the left hand
side of the -> are called
determinants.
One can read A -> B as, "A determines B".
Keys and Uniqueness
- Key: One or more attributes that
uniquely identify a tuple (row) in a relation.
- The selection of keys will depend on the
particular application being considered.
- Users can offer some guidance as to what would
make an appropriate key. Also this is pretty
much an art as opposed to an exact
science.
- Recall that no two relations should have
exactly the same values, thus a candidate key
would consist of all of the attributes in a
relation.
- A key functionally determines a tuple (row).
- Not all determinants are keys.
Modification Anomalies
- Once our E-R model has been converted into
relations, we may find that some relations are
not properly specified. There can be a number
of problems:
- Deletion Anomaly: Deleting a
relation results in some related
information (from another entity) being
lost.
- Insertion Anomaly: Inserting a
relation requires we have information
from two or more entities - this
situation might not be feasible.
- Here is a quick example: A company has a
Purchase order form:
- Our dutiful consultant creates the E-R Model:
LINE_ITEMS (PO_Number, ItemNum, PartNum, Description, Price, Qty)
PO_HEADER (PO_Number, PODate, Vendor, Ship_To, ...)
Consider some sample data for the LINE_ITEMS relation:
| PO_Number | ItemNum | PartNum | Description | Price | Qty
|
|---|
| O101 | I01 | P99 | Plate | $3.00 | 7
|
| O101 | I02 | P98 | Cup | $1.00 | 11
|
| O101 | I03 | P77 | Bowl | $2.00 | 6
|
| O102 | I01 | P99 | Plate | $3.00 | 5
|
| O102 | I02 | P77 | Bowl | $2.00 | 5
|
| O103 | I01 | P33 | Fork | $2.50 | 8
|
- What are some of the problems with this relation ?
What happens when we delete item 2 from Order O101 ?
- These problems occur because the relation
in question contains data about 2 or more
themes.
- Typical way to solve these anomalies is to
split the relation in to two or more
relations - Process called Normalization.
- Consider the performance impact.
Normalization
- Relations can fall into one or more categories
(or classes) called Normal Forms
- Normal Form: A class of relations free
from a certain set of modification anomalies.
- Normal forms are given name such as:
- First normal form (1NF)
- Second normal form (2NF)
- Third normal form (3NF)
- Boyce-Codd normal form (BCNF)
- Fourth normal form (4NF)
- Fifth normal form (5NF)
- Domain-Key normal form (DK/NF)
- These forms are cumulative. A relation in
Third normal form is also in 2NF and 1NF.
First Normal Form (1NF)
- A relation is in first normal form if it meets
the definition of a
relation:
- Each column (attribute) value must be a single
value only.
- All values for a given column (attribute) must
be of the same type.
- Each column (attribute) name must be unique.
- The order of columns is insignificant.
- No two rows (tuples) in a relation can be
identical.
- The order of the rows (tuples) is
insignificant.
- If you have a key defined for the relation,
then you can meet the unique row
requirement.
- Example relation in 1NF:
STOCKS (Company, Symbol, Date, Close_Price)
| Company | Symbol | Date | Close Price
|
|---|
| IBM | IBM | 01/05/94 | 101.00
|
| IBM | IBM | 01/06/94 | 100.50
|
| IBM | IBM | 01/07/94 | 102.00
|
| Netscape | NETS | 01/05/94 | 33.00
|
| Netscape | NETS | 01/06/94 | 112.00
|
Second Normal Form (2NF)
- A relation is in second normal form (2NF) if
all of its non-key attributes are dependent on
all of the key.
- Relations that have a single attribute for a
key are automatically in 2NF.
- This is one reason why we often use artificial
identifiers as keys.
- In the example below, Close Price
is dependent on Company, Date and
Symbol, Date
- The following example relation is not in 2NF:
STOCKS (Company, Symbol, Headquarters, Date, Close_Price)
| Company | Symbol | Headquarters | Date | Close Price
|
|---|
| IBM | IBM | Armonk, NY | 01/05/94 | 101.00
|
| IBM | IBM | Armonk, NY | 01/06/94 | 100.50
|
| IBM | IBM | Armonk, NY | 01/07/94 | 102.00
|
| Netscape | NETS | Sunyvale, CA | 01/05/94 | 33.00
|
| Netscape | NETS | Sunyvale, CA | 01/06/94 | 112.00
|
Company, Date -> Close Price
Symbol, Date -> Close Price
Company -> Symbol, Headquarters
Symbol -> Company, Headquarters
- Consider that Company, Date -> Close Price.
So we might use Company, Date as our key.
However: Company -> Headquarters
This violates the rule for 2NF. Also, consider the
insertion and deletion anomalies.
- One Solution: Split this up into
two relations:
COMPANY (Company, Symbol, Headquarters)
STOCKS (Symbol, Date, Close_Price)
| Company | Symbol | Headquarters
|
|---|
| IBM | IBM | Armonk, NY
|
| Netscape | NETS | Sunnyvale, CA
|
Company -> Symbol, Headquarters
Symbol -> Company, Headquarters
| Symbol | Date | Close Price
|
|---|
| IBM | 01/05/94 | 101.00
|
| IBM | 01/06/94 | 100.50
|
| IBM | 01/07/94 | 102.00
|
| NETS | 01/05/94 | 33.00
|
| NETS | 01/06/94 | 112.00
|
Symbol, Date -> Close Price
Third Normal Form (3NF)
- A relation is in third normal form (3NF) if it
is in second normal form
and it contains no
transitive dependencies.
- Consider relation R containing attributes A, B
and C.
If A -> B and B -> C then
A -> C
- Transitive Dependency: Three attributes
with the above dependencies.
- Example: At CUNY:
Course_Code -> Course_Num, Section
Course_Num, Section -> Classroom, Professor
- Example: At Rutgers:
Course_Index_Num -> Course_Num, Section
Course_Num, Section -> Classroom, Professor
- Example:
| Company | County | Tax Rate
|
|---|
| IBM | Putnam | 28%
|
| AT&T | Bergen | 26%
|
Company -> County
and
County -> Tax Rate
thus
Company -> Tax Rate
- What happens if we remove AT&T ?
We loose information about 2 different themes.
- Split this up into two relations:
| Company | County
|
|---|
| IBM | Putnam
|
| AT&T | Bergen
|
Company -> County
| County | Tax Rate
|
|---|
| Putnam | 28%
|
| Bergen | 26%
|
County -> Tax Rate
Boyce-Codd Normal Form (BCNF)
Fourth Normal Form (4NF)
Fifth Normal Form (5NF)
- There are certain conditions under which
after decomposing a relation, it cannot be
reassembled back into its original form.
- We don't consider these issues here.
Domain Key Normal Form (DK/NF)
De-Normalization
- Consider the following relation:
CUSTOMER (CustomerID, Name, Address,
City, State, Zip)
- This relation is not in DK/NF because it
contains a functional dependency not implied
by the key.
Zip -> City, State
- We can normalize this into DK/NF by splitting
the CUSTOMER relation into two:
CUSTOMER (CustomerID, Name, Address,
Zip)
CODES (Zip, City, State)
- We may pay a performance penalty - each
customer address lookup requires we look in
two relations (tables).
- In such cases, we may de-normalize the
relations to achieve a performance
improvement.
All-in-One Example
Many of you asked for a "complete" example that would run
through all of the normal forms from beginning to end
using the same tables. This is tough to do, but here is
an attempt:
Example relation:
EMPLOYEE ( Name, Project, Task, Office, Phone )
Note: Keys are underlined.
Example Data:
| Name | Project | Task | Office | Floor | Phone
|
|---|
| Bill | 100X | T1 | 400 | 4 | 1400
|
| Bill | 100X | T2 | 400 | 4 | 1400
|
| Bill | 200Y | T1 | 400 | 4 | 1400
|
| Bill | 200Y | T2 | 400 | 4 | 1400
|
| Sue | 100X | T33 | 442 | 4 | 1442
|
| Sue | 200Y | T33 | 442 | 4 | 1442
|
| Sue | 300Z | T33 | 442 | 4 | 1442
|
| Ed | 100X | T2 | 588 | 5 | 1588
|
- Name is the employee's name
- Project is the project they are working
on. Bill is working on two different projects,
Sue is working on 3.
- Task is the current task being worked
on. Bill is now working on
Tasks T1 and T2. Note that Tasks are independent
of the project. Examples of a task might
be faxing a memo or holding a meeting.
- Office is the office number for the
employee. Bill works in office number 400.
- Floor is the floor on which the office
is located.
- Phone is the phone extension. Note this
is associated with the phone in the given
office.
First Normal Form
- Assume the key is Name,
Project, Task.
- Is EMPLOYEE in 1NF ?
Second Normal Form
Third Normal Form
- Assume each office has exactly one phone
number.
- Are there any transitive dependencies ?
- Where are the modification anomalies in EMPLOYEE_OFFICE_PHONE ?
- Split EMPLOYEE_OFFICE_PHONE.
EMPLOYEE_PROJECT_TASK (Name, Project, Task)
| Name | Project | Task
|
|---|
| Bill | 100X | T1
|
| Bill | 100X | T2
|
| Bill | 200Y | T1
|
| Bill | 200Y | T2
|
| Sue | 100X | T33
|
| Sue | 200Y | T33
|
| Sue | 300Z | T33
|
| Ed | 100X | T2
|
EMPLOYEE_OFFICE (Name, Office, Floor)
| Name | Office | Floor
|
|---|
| Bill | 400 | 4
|
| Sue | 442 | 4
|
| Ed | 588 | 5
|
EMPLOYEE_PHONE (Office, Phone)
| Office | Phone
|
|---|
| 400 | 1400
|
| 442 | 1442
|
| 588 | 1588
|
Boyce-Codd Normal Form
- List all of the functional dependencies for
EMPLOYEE_PROJECT_TASK,
EMPLOYEE_OFFICE and EMPLOYEE_PHONE.
Look at the determinants.
- Are all determinants candidate keys ?
Forth Normal Form
- Are there any multivalued dependencies ?
- What are the modification anomalies ?
- Split EMPLOYEE_PROJECT_TASK.
EMPLOYEE_PROJECT (Name, Project )
| Name | Project
|
|---|
| Bill | 100X
|
| Bill | 200Y
|
| Sue | 100X
|
| Sue | 200Y
|
| Sue | 300Z
|
| Ed | 100X
|
EMPLOYEE_TASK (Name, Task )
| Name | Task
|
|---|
| Bill | T1
|
| Bill | T2
|
| Sue | T33
|
| Ed | T2
|
EMPLOYEE_OFFICE (Name, Office, Floor)
| Name | Office | Floor
|
|---|
| Bill | 400 | 4
|
| Sue | 442 | 4
|
| Ed | 588 | 5
|
R4 (Office, Phone)
| Office | Phone
|
|---|
| 400 | 1400
|
| 442 | 1442
|
| 588 | 1588
|
At each step of the process, we did the following:
- Write out the relation
- (optionally) Write out some example data.
- Write out all of the functional dependencies
- Starting with 1NF, go through each normal
form and state why the relation is in the
given normal form.
Another short example
Consider the following example of normalization for
a CUSTOMER relation.
Relation Name
CUSTOMER (CustomerID, Name, Street, City, State, Zip, Phone)
Example Data
| CustomerID | Name | Street | City | State | Zip | Phone
|
|---|
| C101 | Bill Smith | 123 First St. | New Brunswick | NJ | 07101 | 732-555-1212
|
| C102 | Mary Green | 11 Birch St. | Old Bridge | NJ | 07066 | 908-555-1212
|
Functional Dependencies
CustomerID -> Name, Street, City, State, Zip, Phone
Zip -> City, State
Normalization
As a final step, consider de-normalization.
File: week5.html Date: 1:20 PM 9/7/2005
All materials Copyright, 1997-2005 Richard Holowczak