How to do database normalization

How to do database normalization


Database normalization is a process, Of arranging your tables, columns and relationships between tables; In order to avoid data redundancy and to attain data integrity Normalization: to put correct data in correct tables In order to avoid CUD (Create, update and delete) anomalies (issues) So let us discuss in detail now as to why should we do database normalization Let us write a table which is in de-normalized form So this is my FilmAgents table it has data such as agent_id, agent_name the agents are from certain agencies the agencies are located in LA, NY, LA and Mumbai and they have customers agent A1(Ari Gold) has two customers; Bread Pith and Fakira Tori Bronze has Nedd Stark, CNolan and Solman Kan and Gori Diamond (agent) has two customers; Rajini Sir and Saruk Kan This table is in de-normalized form. Now, why do I say that? because it has certain issues the same issues to avoid which we do database normalization There is data redundancy, it is clearly visible you can see that the agency name has been repeated and because the agency name is repeated so is the agency office location and office contact no. as I said before, we need to avoid CUD (create, update, delete) anomalies let us start with CREATE (or INSERT) anomaly that exists here BTW, AgentId column is the primary key here. I will denote it by underlining the column name so AgentId is my primary key in this FilmAgents denormalized table let us suppose that there is a new agency in the market called as Free Folks (watch GoT) this agency is based in Los Angeles (where else) and they have already acquired a client;
Mr. Morgage Freeman the problem is, as this is a new agency, they don’t have an agent yet so we can’t create a database entry in this table (FilmAgents) right now for this new agency because they don’t have an agent So AgentName value will be null, which is still okay but the primary key column will also be null, as there is no AgentId because there is no agent so we can’t insert this data (new agency info) into this table now that’s a problem, the agency is there but we can’t store that data because don’t have an agent so this is an example of Insert (Create) anomaly Now let’s look at an example of Update anomaly let’s say that “New Cine” agency has moved out from LA to San Diego and hence they have to change their phone number so this (111-222-333) has been changed to 222-444-555 and this will be San Diego, this has to be updated but the problem is that because this data is repeated in multiple rows, we are only looking at 4-5 entries right now (thanks to my white-board size) but imagine if this table had 100s of records, there will be at least 20-30 entries of data for New Cine agency so we will have to make sure that, we update these new details (contact number and office location) in all of those 20-30 rows if, it happens that this new information gets updated in few rows and not in remaining rows then in that case, the data will be in inconsistent state So this is called as UPDATE anomaly and it is arising because you have data redundancy (duplication) Now let’s look at an example of DELETE anomaly Now we have this agent, Gori Diamond from the FireFolks Pvt (agency) in Mumbai What if tomorrow, Gori Diamond is no more with us (permanent retirement) due to some unforeseeable circumstances so surely, AgentId being the primary key, we will have to remove this entry of (RIP) Gori Diamond from the table but the problem is, if we remove that entry, the information about FireFolks Pvt Ltd (the agency) will also get removed from the table the agency info will also be gone and if we assume that there was only one agent (RIP) Gori Diamond for this agency, then we won’t have the agency information at all What about the clients? Mr Rajini, Mr Saruk Kan? so we can’t possibly delete the agency just because the agent is no more (RIP Gori Diamond) so this is an example of DELETE anomaly there is one more issue with this denormalized table although we don’t have something called as READ anomaly Let us suppose I want to find who is the agent or which is the agency working for Mr Morgage Freeman So how do I write the query? Select AgencyName from FilmAgents table where customer1=”M Freeman” OR customer2=”M Freeman” OR customer3=”M Freeman”; what if tomorrow we add two more columns to add infor about five customers so we will have to add two more columns in where clause in all our queries feels a bit clumsy right? What if we want to sort the data based on customer name so on which column will we sort? on customer1, customer2 or customer3? so there are problems in the retrieval of data also in this denormalized way of storing data So let us rather normalize this database In database normalization there are few forms, you can achieve database normalization in stages the forms are 1 NF, 2 NF and 3 NF (NF: Normal Form) apart from these three, there are few more normal forms such as: BCNF (Boyd-Codd Normal Form) fondly called as 3.5 NF and there are 4 NF, 5 NF and 6 NF also but it is generally said that; if your database is in 3 NF, it is sufficiently normalized so we will study normalization till 3 NF What are the rules for attaining 1 NF? 1 NF says that; each column should be atomic i.e you should not have column values as Fakira, Sakira single value in every column and, there can’t be a group of columns representing similar information so we can’t have three customer columns the second rule of 1 NF says that: there should be a key which identifies each row of the database table that key could be one column (like AgentId) or a combination of two or more columns So currently this table is not in 1 NF. Why? because it has a group of columns representing similar information (customer1, customer2, customer 3) In order to achieve 1 NF, I will divide this table into two tables so I will separate out the customer information into another table so now I will call this table as AgentInfo and I will create another table called as CustomerInfo so as you can see here, I have customer id (CId) as my primary key agent id value for this row is: A1 so I can identify that Mr Bread Pith is represented by agent A1 which is: Ari Gold Mr Morgage Freeman (C6) is represented by agent A4 that is Gori Diamond (if she is still alive) and it has customer city and customer pin code so now our database is in 1 NF all the columns are atomic because they have single values and there are no more group of columns representing similar information Also, as the rules of 1 NF says for each table there should be a column or a group of columns which should be uniquely able to identify each row In short, there should be a primary key So in CustomerInfo table we have CId (customer_id) and in AgentInfo we have AgentId so we have primary keys and we have atomic columns so we are in now 1st normal form now being in 1 NF, we have resolved certain issues now we can add unlimited amount of customers for each agent remember earlier we had three columns customer1, customer2, customer3…so at max every agent could represent three customers In this table (CustomerInfo) you can keep adding new entries, so one agent can have as many customers he/she wants to represent Sorting, remember we could not sort data based on customer name now we can sort the data based on customer name as it just one column instead of three insert and delete anomalies on the customer table are also gone I could delete a customer, let us say Mr Morgage Freeman goes out of planet in that case, I can delete this entry of Mr Morgage Freeman from this table and still my agent info A4 (i.e of Gori Diamond) will not be harmed in any manner So delete anomaly is avoided but we still have some anomalies left in this design we still haven’t solved this problem there is a new agency in LA but as they don’t have any agents so they can’t represent any customers so although we are in 1 NF we still have some anomalies left so we will now go to 2 NF in order to get rid of those anomalies, have more data integrity and less data redundancy in order to attain 2 NF, you should adhere to all the rules of 1 NF; as I told you earlier, they are progressive plus all non-key columns should be dependent on primary key So first, what is a non-key column? so in this AgentInfo table, AgentId is primary key here (in CustomerInfo) table, CustomerId is primary key so all these other columns (AgentName, AgencyName, location, contact) are non-key columns so primary key could be a single column as in our case or it could be a combination of multiple columns for e.g here the primary key could be (a combination of) Customer_Id and CustName so at times, primary key is a composite key i.e it is a combination of two columns so let us say here if CId and CustName form your composite key; then CustCity and CustPinCode will be your non-key columns (* AgentId is foreign key, so not a non-key column) here, in our case (AgentInfo table), it is more simpler AgentId is the primary key, so all these are non-key columns they (non-key columns should be dependent on primary key) so what do you mean by dependent? In order to answer that question; are they dependent or not? you have to ask one simple question Does that non-key column describe what this primary key identifies? so in this case, consider OfficeLocation column (a non-key column) Los Angeles: Does it describe what is uniquely identified by this primary key A1? This primary key A1 is uniquely identifying what? The Agent: Ari Gold Now Los Angeles office location, does it describe Ari Gold? Not really. It is actually describing his office, his work-place So if your answer is yes, for e.g column AgentName primary key A1 is representing this agent AgentName is describing that agent? Yes so it (the column) should be in this table if the answer is NO, in the case of OfficeLocation column, e.g Los Angeles then that information should not be in this table it should be in some other table they say that; for a table to be in 2 NF it should have a single purpose so our current database design is not in 2 NF why? because in AgentInfo table we have three columns (AgencyName, OfficeLocation and OfficeContact) which are not describing the agent they are describing his/her work-place So this is not in 2 NF So lets make our database in 2 NF if your non-key columns are not dependent on your primary key; meaning if your non-key columns do not describe what is being uniquely identified by your primary key; move those columns into another table so I have divided my database into three tables now AgentInfo (table) has AgentId, AgentName and AgencyId AgentId is primary key and AgencyId is my foreign key we have established a relation between Agents and Agencies via foreign key of AgencyId Thus we know that this agent: Ari Gold belongs to New Cine Agency What about, we removing the foreign key AgentId from the CustomerInfo table so how would we know, a customer (e.g Bread Pith) is being represented by which agent and consequently by which agency? so we will create a join table so this is my join table It is called as join table or intersection table it will have just two columns: AgentId and CustomerId both foreign keys join tables are usually used to represent many-to-many relationships an Agent can have many customers and a Customer can be represented by many agents (that is what our system says, usually it’s one-to-one) so from this table we get the information that; agent A1 (Ari Gold) is representing customer C1 (Bread Pith) so this is our join table 2 NF: All the non-key columns shuld be dependent on primary-key here (in AgentInfo) table, AgencyId is a foreign key so it’s not a non-key column rest of the columns are dependent on the primary key as they describe the agent here (in AgencyInfo table) AgencyName, OfficeLocation and OfficePhone are describing the agency so they are dependent on priamry key here (in CustomerInfo table) CustName, CustCity and CustPinCode are describing the customer which is uniquely identified by this CId primary key hence they are also dependent on primary key so all these three tables are 2 NF compliant so is this table, because there are only keys here both columns are foreign keys so these rules (2 NF) do not apply here at all so all these tables are in 2 NF, hence we are (our database) is in 2 NF now this in itself is a very good database design In fact, 3 out of 4 of our tables are already 3 NF compliant except for this CustomerInfo table so let us see now, what is 3 NF so 3NF will be all the rules of 2NF plus all the non-key columns should be NON-TRANSITIVELY dependent on primary key so what do you mean by NON-TRANSITIVE ? so first let us see what is Transitive A depends on B and then B depends on C so in turn we say: A is dependent on C so this is transitive property 3 NF says: all non-key columns should be non-transitively (i.e directly) dependent on primary key in our case, this CustomerInfo table is violating that rule How? so we have the CId (customer_id) as primary key CustName describes the customer which is uniquely identified by this primary key hence CustName depends on the primary key CustCity? The same. It describes the customer hence it is directly depends on the primary key now, CustPinCode? CustPinCode depends on CustCity and CustCity depends on customer id (primary key) so here CustPinCode is transitively dependent on the primary key CId (customer id) not directly but transitively through CustCity column hence this table (CustomerInfo) is not in 3NF so let us make it in 3NF I will create another table: CustomerPinCodes It has CustPinCode and CustCity as columns now this is the foreign key here (CustPinCode) joining these two tables hence that non-transitive rule does not apply on this column (CustPinCode) anymore as it is not a non-key column anymore. It is a foreign-key column. so this table (CustomerInfo) is in 3NF And here also, City is describing (or directly dependent on primary key) Pin Code hence it is also 3NF compliant So now our database is in 3NF It is a good relational database design when your database is in 3NF

Danny Hutson

100 thoughts on “How to do database normalization

  1. Can you help me understand why it would be a bad thing to put the Agent ID and Agent name in the Customer Info table? I get it doesn't follow 3NF because of the non-transitively rule, but in order to find out something simple like "Who is the customer's Agent?", you're going to have to write a several queries or one long join query to get something that could be a simple select statement. How could the data become redundant or lose integrity by this? Thanks for your help.

  2. I may be wrong but @6:14 sir you said data integrity is data inconsistency. can you please explain this.

  3. Morgan freeman to Mortgage freeman was funny i couldn't stop cracking but Sir you are doin' a great job!

  4. Before this lecture, many teachers guided me how to normalize the data. But this lecture is awesome. My all concepts about normalization are cleared in just half an hour. Excellent And Great Teacher. Thank you so much!

  5. after 2 years at collage not understand normalization, but i was able to put my tables into the 3nf (i know its werd). i was final able to understand normalization perfectly. this video saved my life as its the best one out of all the normalization videos ive watched, thank you so much

  6. I kinda always knew Mr. Mortgage Freeman is not from this planet. That also explains how he knew so much about space and wormholes.

  7. Dude, he's an awesome teacher!

    I have a bias typically attached to heavy accents, but his video is super well done and laid out. You can follow him easily.

    I just feel bad for Gori Diamond and Morgan Freeman……..=/ . Morgage will never be casted again! =)

  8. Yeah I had Brad as a tutor at University when studying database design. He was great but he'd always lean down next to me to watch what I was typing while munching on potato crisps. I don't think he realised it but he'd let the odd moan out while crunching in my ear. Breath stunk of bile and throat stones

  9. This was a very helpful video. I had been reading through a lot of material prior to this and it just was not making sense to me. Your explanation cleared up all my confusion! Thank you!!!

  10. Like this is straight up amazing! I sat through lectures on this topic given by a bad lecturer thinking I was a moron for not getting it. Watched a 20 minute video by this legend of a man, and it's like someone turned a light on in my brain for how we normalize, and why. You just earned yourself a sub!

  11. One of the simplest videos to understand database form normalisation. Thanking you, all the way from Sri Lanka.

  12. Great job on the definitions and explanation. It kept my attention, but it appears he over normalized the Customer Info table. There is no reason to move CustCity and CustPin (zip code) to another table when they both depend or directly related to CID. Just like cust name depends or directly related to CID, so should cust city and cust pin. The customer name city and zip code tells you who the customer is and where customer works/lives. Plus it meets CUD. Now if the the intention is to add more CustPin and CustCities to a customer then it would not be in 3rd normal form. But the way it is designed now, then you have to write to update statements or make sure cascade update is on.

  13. Didn't actually follow the process. Took a big jump into dividing into two tables at about 10 minutes with no real explanation. Bit crap to be frank.

  14. I have looked through lots of material on NORMALIZATION and this has been the *BEST* explanation. Great Job!!!

  15. This video is fantastic. I thought he genuinely didn't know how to pronounce those names but then I saw him doing the same with Indian names so I'm sure he's just messing with us. This video had me in stitches at times!
    Also, more importantly, I've gotten a good overview of what normalisation means. Thanks, man!

  16. Hey, Dean From Ireland. My database lecture is not good a teaching in general. I thought I was screwed for my summer exams until I came across this. Thanks you

  17. Basically, you got your pronunciations messed up since you're spellings are absolutely wrong. Otherwise, your explanation was really good. Simple and easy to follow.

  18. literally after going through 4 other videos from different people explaining Normalization, you got the best explanation thanks man keep it up.

  19. NOTICEEEEEEEEEEEEEEEEEEEE DEAR TEAHCER:-

    PLEASE ANSWER : IN THE LAST STEP OF 3NF… IS IT FINE TO MOVE THE CUSTPINCODE TO THE NEW TABLE AND LEAVE THE CUSTCITY IN THE ORIGINAL TABLE BECAUSE OF THE non-transitively dependent RULE.

  20. You just screwed yourself dude. You left out "Rajni Sir", big mistake…prepare to "be no more with us due to unfortunate incident". Jokes apart, I enjoyed this video.

  21. why in 2NF, you have removed agent id from the customer table? it could be part of it and as in agent table you linked agency id

  22. Hello i have to submit an assignment on attendance sheet on mysql…So i have to design tables with normalization and then use stored procedures and create tables…any help would be appreciated pls help

  23. There is a mistake in that tutorial. There is no anomaly that creates. Instead of creating, the insert must come.
    Then you need to ignore the anamolys that insert, update, delete in normalizations.

  24. well done. But I prefer my databases to be in 1,592th normal form. If you don't have a 300 iq you will never understand all 1600 normal forms

  25. OMG.. you taught me a lot here, but the Mortgage Freeman part killed me man ! I hope Mortgage Freeman watches this video !

Leave a Reply

Your email address will not be published. Required fields are marked *