Normalization in SQL: 1NF, 2NF, 3NF, and BCNF

Overview

Normalization in SQL is the process to eliminate data redundancy and enhance data integrity in the table. Normalization also helps to organize the data in the database. It is a multi-step process that sets the data into tabular form and removes the duplicated data from the relational tables.

Note: The article about normalization in SQL requires prior knowledge of important related topics like keys in DBMS, and functional dependencies. I’ve discussed a few of them at the bottom of this article.

What is Normalization in SQL?

Let’s begin with why normalization in SQL was introduced originally by discussing the problem it solves. As we know, we’ve relations (or tables) where we store our data, and if you remember one part of DBMS talks about management, we need to manage this data efficiently. Let’s see the wrong way first.

student_id	student_name	student_sub	teacher_id	teacher_name	teacher_dept
1	A	science	1	X	science
2	B	math	2	Y	mathematics
3	C	science	1	X	science

Can you identify what’s wrong with the table? (Hint: find the number of entities)

If we see clearly, there are 2 entities, one is a student and the second is a teacher. It can be taken in 2 ways,

They both are added to a single table.
One entity is merged with another. In the above example, in the student table, we stuffed the teacher’s data or vice-versa.

The problem with the first assumption is clear. We don’t have a proper way to query the data. Because we need a key to get the data but if we make Student_id a key then we get the teacher’s data too which we don’t want or vice-versa. So, it’s kinda like a heap of data with no organization.

But what’s wrong with another scenario? Actually, to solve this kinda problem normalization is introduced originally. So please have a nice look at this Section.

The three problems or anomalies are:

Insertion Anomaly
Deletion Anomaly
Modification Anomaly

Insertion anomaly means cases where you can’t insert data unless you have another dependent piece of information. To see this from the above example, we can’t insert the teacher’s data unless we have student data (assuming the key is student_id).
Deletion anomaly means cases where deleting a record can mistakenly delete other prominent information which is in reality independent in nature. In the above example, deleting a student record will also delete the corresponding teacher’s record and that’s not what we want. That’s because let’s say if a student leaves the school and we’ve to delete their data then doing it would also delete the corresponding teacher’s data and that’s not what we want because it doesn’t mean the teacher also left the school.
Modification anomaly means cases when instead of updating a single piece of information at one place we’ve to update it at multiple places. For example, student 1 and student 3 have the same subject, and that’s why they are allotted the same teacher. If we ever want to change this teacher's name (for some reason) we’ve to do it in 2 places. Assume if they’re duplicated at tons of places then we’ve to do it at every place to provide consistent data. and this you can guess is really difficult here.

So let’s see now what normalization really is and how it solves these anomalies.

Normal Form

Please note the last line from the above paragraph “It divides a big table into smaller tables until it doesn’t follow a set of rules ”. What are these rules? Actually, we divide the normalization process into a set of normal forms. A normal form is nothing but a form of a table that follows some norms (rules) that prevent the above anomalies to some extent. We have a list of normal forms,

1NF
2NF
3NF
BCNF
4NF
5NF
6NF

The complexity to divide tables into a particular normal form increases as we go down. Also, as we go down, all tables should also follow all the normal forms mentioned above. Please read the above line once more. It’s important!

Also, in real-life databases that contain multiple tables, they all must be normalized and all the tables should follow the norms, not just one.

Types of Normal Forms

Let’s discuss each normal form one by one along with an example.

Note: Article contains the usage of functional dependency occasionally when required. Discussing it here will divert us from the main topic but in short, it’s a way to represent the relationship between the set of attributes of a table.

For a table attributes (A,B,C,D), functional dependency could be like,

A->BC

It says from attribute A we can get the values of A, B, and C.

1st Normal Form (1NF)

It’s the first normal form and almost all tables already follow rules specified in it.

Rules are,

Each cell of a row must have a single value.
Each record should be unique.
Each column should have the same type of data.

Let’s understand them.

The first rule is easy. It says, a cell should have a single value or you can say it as a cell shouldn’t contain composite values. For example, if an attribute asks for the name of a student, we shouldn’t add 2 or more names to it. Just a single value.

The second rule is related to the concept of keys that identifies each record uniquely. Basically, each record inside a table must have something unique in it because if it’s not then it’s nothing but redundant data (because of duplicity).

The third rule is quite intuitive. See this for an example.

Name (string)	Age (number)	Portfolio_link (string)

There are 3 attributes in the above table i.e. name, age, and portfolio_link. And we define data type string, number, and a string for each corresponding attribute. Now, according to the rule, each value inside a name (string type) attribute must be a string. It can’t have numbers. Similarly, for age (number type), each value must be numerical.

Let’s see a table that doesn’t follow 1NF (though it’s very rare in real-life scenarios).

student_id	Name (string)	Age (number)	portfolio_url (string)
1	A	ABC	www.url.com
2	B	12	www.url1.com www.url2.com
1	A	ABC	www.url.com

Why is It not in 1NF?

The first row has defined ABC (string value) inside an age attribute (number type) – breaking rule 3
The second row defines multiple values in the portfolio_link attribute. – breaking rule 1
The third row is identical to the first row. – breaking rule 2

Let’s see a table that follows 1NF through an example.

Attribute1	Attribute2	Attribute3
1	A	value1
2	B	value2
3	C	value3

This table does follow each rule of 1NF since each cell has a single (atomic) value, as well as each record, is unique so can be uniquely identified.

2nd Normal Form (2NF)

A table is in 2NF when,

It’s in 1NF.
It shouldn’t have a partial dependency.

What is Partial Dependency?

Suppose a relation R, having attributes A, B, C, D. To represent it,

R (A,B,C,D)

functional dependency defined on it is,

AB->D

B->C

From given functional dependencies, we can conclude that AB is a primary key because through AB we can find the value of each attribute.

Therefore, the closure of R (A, B, C, D) is AB.

Now, all the attributes which are part of the candidate key, i.e., A and B are called prime attributes. And all other attributes that are not part of the candidate key are called non-prime attributes.

So, according to 2NF, non-prime attributes should depend on all prime attributes, not on part of the prime attribute. From the above example,

AB->D

B->C

A and B are prime attributes. C and D are non-prime attributes.

D follows the rule because though it’s a non-prime it depends on all prime attributes.

But, C doesn’t follow the rule since it depends on B only (part of the prime attribute) and this is called partial dependency.

So, according to 2NF, a table shouldn’t have a partial dependency.

To make this relation into 2NF,

Make a relation that has prime attributes as well as those non-prime attributes that are totally dependent on prime attributes.
Make a separate relation for attributes that are in partial dependency.

So, to make the above example into 2NF, we do,

R (A,B,D) AB->D
R (B,C) B->C

3rd Normal Form (3NF)

A table is in 3NF,

It must be in 2NF.
It shouldn’t have a transitive dependency.

What is Transitive Dependency?

Let’s take a relation, R (A,B,C) with functional dependencies defined as,

A->B

B->C

From the given dependencies, we can conclude that A is the primary key because from A we can get both B and C attributes (from A we get B and from B we get C). Therefore, the closure of R (A, B, C) for given dependencies is A.

Now, A is a prime attribute, and B and C are non-primes.

It’s in 2NF because there is no non-prime attribute that is partially dependent on part of the primary key since B (non-prime) is dependent on A (primary key) and C (non-prime) is dependent on B (non-prime). Also, if a given relation has only a single attribute as a primary key then partial dependency is not even possible because it’s single. Got the point?

But here, a non-prime attribute C is dependent on non-prime attribute B. And, this is called transitive dependency when a non-prime attribute is dependent on another non-prime.

To make this relation into 3NF,

Make a separate table of prime attributes and all other attributes directly dependent on it.
Make a separate table for non-prime attributes that are dependent on other non-primes and make determining non-prime the primary key in this new table.

So, to make above example into 3NF, we do,

R (A,B) A->B
R (B,C) B->C

Let’s see this relation, R (A,B,C,D)

A->B

B->C

C->D

Closure of this relation is A. Why?

Now, is it in a 2NF? Yes, because there is no partial dependency.

Is it in a 3NF? No, because there are 2 non-primes that are dependent on other non-primes.

To make it into 3NF,

R (A,B)
R (B,C)
R (C,D

How?

Boyce CoddNormal Form (BCNF)

Boyce-Codd's normal form comes into the picture because of some anomalies that are not handled by above mentioned normal forms.

If we notice, 2NF and 3NF both deal with non-prime attributes. If there is a functional dependency from A->B then whatever the A is B is always non-prime. They both didn’t deal with situations when B is prime.

So, that’s why BCNF.

A table is in BCNF,

It must be in 3NF
For any functional dependency A -> B, A must be a super key.

Let’s take a relation, R (A,B,C)

AB->C

C->B

Here, we can conclude AB and AC are 2 candidate keys.

Is it in 2NF? Let’s check it.

AB -> C is not a partial dependency because AB is prime and C is also a prime attribute.

Similarly, C -> B is not a partial dependency. Though C (part of prime attribute) is determining an attribute but since B is also prime so partial dependency doesn’t exist here. If you remember, partial dependency exists when there is a dependency from X->Y and X is prime and Y is non-prime. Here dependency is from prime to prime not prime to non-prime.

Is it in 3NF?

According to transitive dependency, a non-prime attribute is dependent on other non-prime but here it’s not the case. So, it’s definitely in 3NF.

But what is the problem? Now, we’re determining even prime attributes. But that shouldn’t happen. Ideally, prime attributes should be the one that determines others. Right? Because that’s why we make keys to uniquely identify the record. That’s what they’re capable of.

So, is the above relation in BCNF?

AB -> C holds true because AB is a candidate key and also super key.

C -> B doesn’t hold because C is a prime attribute which is determining another prime attribute B, and C is not a super key as well. Therefore, the given relation is not in BCNF.

To make it into BCNF

R (C,B) C->B
R (A,C) AC->AC

This decomposition might be confusing at first but it’s logical. Let’s see how.

See, whenever we decompose a table into smaller tables then we’ve to do it in a way that we are able to get the original table once we join these smaller tables again. This is called lossless decomposition.

So, for the above example, if we make tables,

R (C, B) C -> B
R (A, B) AB -> AB

We can’t have A, B, and C in a single table because that’s what we’re solving. A separate table for C -> B (C being a super key) is clear because according to BCNF any functional dependency from X -> Y, X should be a super key.

Now, there is confusion between R (A,C) and R (A,B). Why did we choose R (A,C)? It’s because of lossless decomposition. If we choose R (A,B) then how do we connect R (A,B) and R (C,B)? To combine them, there must be one common prime attribute which is a key because then only we are able to relate the two. Isn’t it? In R (A,B) AB is the key and in R (C,B) C is the key (no common prime attribute found).

But in R (A,C) AC is the key and in R (C,B) C is the key. As you can see, C is the prime attribute which is common and also super key in the other table. So, that’s why we chose R (A,C) over R (A,B).

Note: Lossless decomposition is a separate topic altogether. We just discussed a glimpse of it here. Please have a look at it on a separate blog.

4th Normal Form (4NF)

A table is in 4NF,

It must be in BCNF.
It shouldn’t have a multivalued dependency.

What is Multivalued Dependency?

Let’s assume a table Person where each person has 2 mobile numbers and 2 email_ids. How can we store them?

Name	Mobile_number	Email_Id
A	99XXXXXXXX	aa@aa.com
A	88XXXXXXXX	a@a.com
A	99XXXXXXXX	a@a.com
A	88XXXXXXXX	aa@aa.com

We can’t store multiple values in a single cell (condition of 1NF). So, this is the only possible way. But try to see the problem here. Mobile_number and Email_id don’t have anything related. They’re independent entities, yet relevant enough to describe an entity (person). We can’t omit them but storing them produces tuples that are not required. And that’s why this representation has a multivalued dependency. So, to avoid these issues we use 4NF.

To make it into 4NF,

R (Name, Mobile_number)
R (Name, Email_id)

Name	Mobile_number
A	99XXXXXXXX
A	88XXXXXXXX

Name	Email_id
A	aa@aa.com
A	aa@aa.com

Now, both the relations will have only 2 tuples. And, using this we maintained all the information and prevented redundant rows.

5th Normal Form (5NF)

A table is in 5NF,

It must be in 4NF
It shouldn’t have to join dependency

What is Join Dependency?

The concept is similar to lossless decomposition. If we divide a relation into smaller tables and combine them again then we must get the original table. To ensure it, we have to find a common attribute that is prime as well as a candidate key in the other table.

So, according to 5NF, if we are able to divide a table further into a smaller table and still the information is lossless then it’s not in 5NF.

Basically, 5NF is the last level and any further decomposition will cause data loss.

6th Normal Form (6NF)

6NF is yet to be discussed completely by experienced professionals. It’s proposed by experts and hopefully will be implemented in the near future. Until then 1NF to 5NF is the only major normal form to read and learn.

Prerequisites

What is a Key in SQL?

The key is nothing but a set of the attribute(s) of a table that can uniquely identify each record of the table. We’ve 8 different kinds of keys and each one of them has special use cases.

Super Key
Candidate Key
Primary Key
Alternate Key
Foreign Key
Composite Key
Compound Key
Surrogate Key

Foreign Key

In general, any basic database includes 2 or more tables at least. And, in most of cases, you’ll find some relation between them. So, in order to relate 2 tables with each other, we use the foreign key. We make the primary key of a table a foreign key in another table of interest. Let’s see this in action.

Suppose we have a table user and order, and the order table wants to use the contents of the table user. The functionality we’re looking for is that we are able to get each user and their corresponding order data. There’s one bad way and one good way to design this.

Let’s see a bad way first.

Table user

Sno (primary key)	Name	Age
1	A	18
2	B	19
3	C	16

Table order

S. no	Name	Age	Ono (primary key)	Order_name	Order_price
1	A	18	1	Pizza burst	500
2	B	19	2	Pizza mania	150
3	C	16	3	Cheese Burger	120

Why is this a bad way? First, we’re unnecessarily copying the entire table of the user inside the order table. Secondly, if a user changes their name then to reflect this change we’ve to modify the user table and also the order table. And it’s just 2 tables. Imagine if we have a bunch of tables then it becomes cumbersome to reflect all those changes. Isn’t it?

So, what is a good way? Use foreign key. See the design below.

Sno (foreign key)	Ono (primary key)	Order_name	Order_price
1	1	Pizza burst	500
2	2	Pizza mania	150
3	3	Cheese Burger	120

Do you get the point now? Instead of copying the entire records of the table, we set Sno (primary key in the user table) as a foreign key in the order table. We are leveraging the power of the primary key as it’s capable of uniquely identifying the records. So why copy? Instead, refer to those through its primary key and set it as a foreign key in another table.

Now, if you want to change anything inside the user, then only the user table has to be modified and all the changes will reflect everywhere automatically.

Conclusion

So, in this article, we learned what normalization in SQL is, why it’s introduced, and how it prevents all the anomalies discussed, i.e., insertion, deletion, and modification anomalies. We also discussed all the types of normal forms from 1NF to 6NF and their rules set. We also learned to decompose a relation into a particular form with examples.

Normalization in SQL: 1NF, 2NF, 3NF, and BCNF

Learn via video course

Overview

What is Normalization in SQL?

Normal Form

Types of Normal Forms

1st Normal Form (1NF)

2nd Normal Form (2NF)

3rd Normal Form (3NF)

Boyce CoddNormal Form (BCNF)

4th Normal Form (4NF)

5th Normal Form (5NF)

6th Normal Form (6NF)

Prerequisites

What is a Key in SQL?

Foreign Key

Conclusion