When you read about normalization you usually get the set of conditions that a database in the nᵗʰ normal form should satisfy and the set of rules, a sort of a cook-book, for obtaining that normal form. But why these rules are safe to apply to your denormalized relations is a skip material. Here, I would like to present some elementary concepts on how we decompose relations and what can go wrong.

The concepts I want to present may seem a bit complicated but, on the other hand, they are so basic that you rarely see it written. Still, they are essential for understanding how the normalization process works. One may prepare a perfect meal just by following a recipe but no one can master the art of cooking without understanding what's going on behind-the-scenes. The same holds true with databases.

I will talk about relations **with functional dependencies only**. Some time ago, Agnieszka did an excellent exposition on functional dependencies. Here, I only briefly review a basic fact. In a relation R, **an attribute B is functionally dependent on attributes A _{1}, …, A_{n.}**(denoted as A1, …, An → B), if for every possible scenario there cannot be two tuples in R that are equal on all A's but differ on B.

I will also need a concept of a natural join. Well, I bet you know what a natural join is but let me say something, just in case. If you have two relations, say **Dogs (dog_id, dog_name, owner_id)** and **Owners(owner_id, owner_name, address)** then a natural join of **Dogs** and **Owners** consists of all tuples from **Dogs** merged with tuples from **Owners** that have the same value on the columns with the same name. In our case this column is **owner_id** and we get a relation with all information about a dog and its owner in one row. A notation for a natural join is **Dogs⋈Owners** and it has columns **dog_id**, **dog_name**, **owner_id**, **owner_name**, **address** (a common column is not repeated). But be careful, if instead of **dog_name** and **owner_name** we had in both tables a column **name**, then a natural join would give us only dogs and their owners that have the same name! Usually this is not what we want.

### How Do You Decompose a Relation?

Now, let’s go back to the subject at hand. Our working example will be the relation **LazyAuthors**.

Author | Title | Year |
---|---|---|

John Doe | The story of my life | 2011 |

John Doe | The story of my uncle's life | 2012 |

Mary Major | Quantum gravity for dummies | 2012 |

Let us assume that no author publishes two books in the same year (they are indeed lazy authors) and that no two authors publish books under the same title. We allow that there are two editions of the same book in different years. We can write these dependencies as:

**author**, **year** → **title** and **title** → **author**.

What happens when you decompose a relation? Let’s decompose **LazyAuthors** into two relations, **AuthorYear** and **TitleYear**. We choose for the first relation the attributes **author** and **year** and perform a projection of **LazyAuthors** onto these columns. Thus, **AuthorYear** may be understood as the result of the following query: `select author, year from LazyAuthors`

. Similarly, **TitleYear** may be understood as the result of: `select title, year from LazyAuthors`

. And here we have our new relations!

AuthorYear | TitleYear | |||

Author | Year | Title | Year | |
---|---|---|---|---|

John Doe | 2011 | The story of my life | 2011 | |

John Doe | 2012 | The story of my uncle's life | 2012 | |

Mary Major | 2012 | Quantum gravity for dummies | 2012 |

You may often meet a notation for projection with π and the set of attributes on which we project. Then, we could write **AuthorYear** as **π _{{author, year}}(LazyAuthors)** and

**TitleYear**could be written as

**π**. In what follows I will use this notation.

_{{title, year}}(LazyAuthors)### Why can’t I just split my relation into two?

Splitting is just one side of a normalization process. Once you split an immediate question arises: **how could I retrieve my original relation from the two split ones?** After all, you do not want to lose some information by decomposing, do you? A usual answer for this is: **use natural join**! It is a good answer but **it works only if we split our original relation neatly**. Do you remember our two relations **AuthorYear** and **TitleYear**? If we perform a natural join **AuthorYear**⋈**TitleYear** we get something like

Author | Title | Year |
---|---|---|

John Doe | The story of my life | 2011 |

John Doe | The story of my uncle's life | 2012 |

John Doe | Quantum gravity for dummies | 2012 |

Mary Major | Quantum gravity for dummies | 2012 |

Mary Major | The story of my uncle's life | 2012 |

**Look, we got some new tuples! Oops!** Since two authors wrote a book in the year 2012 they were both joined with two titles by the equality of the year column. What’s worse, if we only have relations **AuthorYear** and **TitleYear** there is no way to reconstruct the original **LazyAuthors**. We will never know who wrote *Quantum gravity for dummies*.

Thus, if I decompose a relation R as π_{X}(R) and π_{Y}(R), how could I secure the equality R = π_{X}(R)⋈π_{Y}(R)? It would be enough that the intersection A of X and Y, A = X ∩ Y, implies all attributes in Y (or that A implies values of all attributes in X, for that matter). If it holds then any tuple from R generates exactly one tuple in π_{Y}(R) and any tuple in π_{X}(R) will be joined with exactly one tuple from π_{Y}(R). This suffices for the equality to hold.

What does it mean for our example? Let's decompose **LazyAuthors** into **TitleYear** and **AuthorTitle** (remember **Title** → **Author**!). Now, after joining we get **LazyAuthors** again. Hurray! So, is such a decomposition correct? Not so easy.

### Lazy Authors Shall Never Be Busy!

After we decomposed a relation we work only with the relations we obtained, the original one is lost and may be reconstructed only with a join. Problems may arise when **AuthorYear** and **TitleYear** change over time. Assume that we add to **TitleYear** a tuple (**The story of my aunt’s life, 2011**). This is legal since there are no constraints on **TitleYear**. After the join we get something like this.

Author | Title | Year |
---|---|---|

John Doe | The story of my life | 2011 |

John Doe | The story of my aunt's life | 2011 |

John Doe | The story of my uncle's life | 2012 |

Mary Major | Quantum gravity for dummies | 2012 |

Our LazyAuthors are no longer lazy! But why did this happen? After decomposing we lost one functional dependency. The attributes of **author, year → title** were split into two tables and we lost this constraint in our database. A lesson to remember is that when we decompose a relation, we should think about all possible data our database may acquire. Therefore, we distribute functional dependencies of our original table as well. If we lost a functional dependency then it is possible that sometime in the future we insert data to decomposed tables that will not satisfy our original constraints after a join.

### Two Conclusions

We ended with two rules. When decomposing a relation R into π_{x}(R) and π_{y}(R), with A= X ∩Y, we should have:

**A**implies all attributes in**X**or in**Y**- All functional dependencies of
**R**may be reconstructed from functional dependencies of**π**and_{X}(R)**π**_{Y}(R)

Now, if all constraints of R are functional dependencies and if you meet the above requirements then you will have R=π_{X}(R)⋈π_{Y}(R) (point 1) and you will always get a legal relation R after a join (point 2).

Of course, there are more constraints than functional ones and, of course, there is a whole theory behind the second point but this is already a long post. At last, let me mention that the functional dependencies of our **LazyAuthors** are based on an example by Beeri and Bernstein from 1979^{[1]}. With these examples they showed that you cannot always get a Boyce-Codd normal form. But this is a topic for another story.

1 Beeri, Catriel and Bernstein, Philip A. Computational problems related to the design of normal form relational schemas. ACM Transactions on Database Systems 4(1979), p. 30-59.