SQL Index |¦| Indexes in SQL |¦| Database Index

SQL Index |¦| Indexes in SQL |¦| Database Index

Today we are going to run SQL queries against
a table containing ten THOUSAND records. {{ Maniacal laughter }} {{ Phone call }} What is it, I’m in the middle of a video… You don’t say? ALL in RAM?… Well, alrighty then… Today we are going to run SQL queries against
a table containing one .. Hundred .. MILLION records. {{ Maniacal laughter }} But don’t worry. By using indexes, we can rapidly speed up
queries so you do not have to experience the phenomenon known as boredom. We will work with a single table called ‘person’
containing 100 MILLION randomly generated people. The first row is an auto-generated primary
key called “person_id” The other columns are first_name… last_name…
and birthday. To create this table, we randomly generated
names using the 1000 most popular female names, male names, and last names in the United States. We did not weight the names by frequency when
generating our random sample. The datasets and the Python code used to generate
the random names are available for download from Github. Today, we are going to pay close attention
to how long each query takes to execute. So on the right, we will maintain a list of
queries and how many milliseconds each query took to complete. These values will vary depending on the specifications
of your workstation. We will begin by counting the number of rows
in the “person” table… There are indeed 100 MILLION records. Notice how long it took to perform this simple
count. 3 seconds and 562 milliseconds. Or 3,562 milliseconds. Let us put this query and the time required
in our table. As another speed test, count the number of
people with the last name “Smith.” Execute Over 100,000 Smiths… And the query took 4,261 milliseconds. For a second test, count how many people are
named “Emma.” Execute… Nearly 50,000. And note the time: 4,066 milliseconds. What if we count the number of people born
in May, 1980? For this we will use the BETWEEN operator. Execute… 166,000… And the query took 4,480 milliseconds. A pattern is starting to emerge. Let’s quickly gather a few more speed samples. How many Michaels are in our database? 4,127 milliseconds How many people have the last name
‘Hawkins’ or ‘Snow’? 5,315 milliseconds And how many people were born on April 1, 1995? 4,110 milliseconds. We ran 7 queries, and the execution times
were all very similar. This is because in order to answer each query,
the database had to scan all 100 MILLION rows to check each record for a match. Our 7 scans of the entire table took a combined
total of 29,921 milliseconds. This means on average my workstation was able
to scan 23,394,940 rows per second. While this may seem fast, we can do better… MUCH better… And we will do this by building an INDEX. The name ‘index’ was chosen because of
how an index works in a book. For example, if you want to learn about “normal
distributions”, you would not flip through your entire math book page by page, scanning
for the term “normal distribution.” Instead, you would go to the index, find the
term “normal distribution”, and you would see the list of pages. This technique is used by databases. When you create an index, the database will
generate a method to rapidly find data based on one or more columns. As our first example, let us create an index
on the “person” table. To do this, write “CREATE INDEX” and then
give your index a name. There are many different naming conventions
that you can use. The important thing is to be consistent. We will use the table name, followed by the
field name, then add “idx” to indicate this is an index. Next, write “ON” followed by the table
name. Inside parentheses, write the columns to include
in the index. We will start simply and create an index for
the “first_name” column. Execute. That certainly took a while. Remember, the database had to scan 100 MILLION
rows and build a “first name” index from scratch. Let’s test our new index by repeating an
earlier query where we counted the number of people named “Emma.” Execute. This time, it took only 508 milliseconds. If we repeat the count of people with last
name “Smith”, it takes 4,415 milliseconds. What’s going on here? This is about as long as it took last time. It did not speed up because our index was
built for the first name column. The last name was not indexed. And as you might suspect, if we search for
people born on April 1, 1995, the query should not run any faster than last time. Execute. 4,011 milliseconds. No real change. Our index improved the speed of searches by
first name, but had no effect on queries by last name or birthday. However, there are other queries where the
“first name” index will improve performance. For example, count the number of people named
“Julie Andrews.” Execute. This ran in just 514 milliseconds. By contrast, if you count people with last
name ‘Andrews’ and born on June 12, 1992… It takes 4,274 milliseconds. The full name query benefited from the first
name index, because the database was able to use it to quickly find all the “Julies.” From this smaller set of records, it then
completed the job by finding people with last name ‘Andrews’. The second query, however, had no index to
help. It had to perform a full scan of the table
to find all people with last name “Andrews”, and then find those with a matching birthday. Fortunately databases are clever about how
to use indexes. For example, if you count the number of people
born on October 31, 1985 with a first name of “Mia”… It executes quickly: 512 milliseconds… But why? Wouldn’t the database first scan all rows
to find people with a matching birthday, and then filter by first name? No! Just because we wrote it in this order does
not mean the database will do its work in that order. The database will consider all possible ways
to execute your query, then chooses the optimal path. We call each possibility a “query plan”,
and the “query optimizer” picks what it thinks is best. You’ll be happy to know a database can have
more than one index. Better still, each index can be created using
more than one column. Let us create a multi-column index. Write “CREATE INDEX”, then an overly long
but descriptive name. Next, write “ON” then the table name. In parentheses, list the columns you want
to go into this index. Important Note: The order matters. Think of this as sorting the data first by
last_name, then by first_name Execute… It is now time to take this new index for a test run. How many people named “John Williams”
are in the database? Execute… There are 36 people. And it only took 27 milliseconds to run this query. What about Julie Andrews? 26 milliseconds. Well done, index.. Well done… A word of caution. Indexes are not free. Just as indexes in a book require paper, indexes
in a database require storage. Furthermore, when you add data to a database,
it needs to create new records AND update all relevant indexes. So go forth and use indexes, but use them wisely. And before you go, please help us achieve
our goal of being the first channel with
One Hundred BILLION subscribers! {maniacal laughter
hahhahaa please subscribe, really}

Danny Hutson

100 thoughts on “SQL Index |¦| Indexes in SQL |¦| Database Index

  1. If create index than add record to the database are new records will take into consideration? If yes then is it true that we will create index when we have less records on database and the time required for creating index be less?

  2. Your videos help me understand farrr better than my professors. Thank you so much for all you do!

  3. I can see how the sci-fi visuals & sound effects are engaging, but I feel like the robotic vocal affect detracts from the quality of this video. I appreciate the writing + examples though.

  4. So if I add a new row to the table, will it take 4 minutes ( in this case ) to rebuild the index? What data structure is used and how can the index be rebuilt in a smarter manner?

  5. I can't find the python code on Github, but I've seen it in action in another video of yours, I don't remember which one 😀

  6. Wow! Well done! This is by far the best video I've ever seen on SQL indexes.
    The demonstration was phenomenal!

  7. @Socratica Could you please share github link for this SQL file / python script which has 100K Rows. I tried to search on github, but it is not available there.
    You guys are the best. Love your work, Keep it up.

  8. woooow really interesting and the way of her teaching and animation was awesomee….upload more videos on sql ,clear explanation ……..congrats

  9. I just want to say THANK YOU!!!! I cannot believe that this was answered so quickly and easy to understand / digest! Do you guys offer some sort of bootcamp / offline support for those who want to expand their knowledge in this field? I currently work at a place that has never indexed their tables and the waiting process is horrendous! I'm using all sorts of ETL's (like Alteryx) and other AI's to support my end to end reporting process by updating my data in an incremental fashion…..in which I also need to work on partitioning my data as well….but unfortunately I'm not too fluent in this field of indexing.

  10. Hi beautiful devil 😈 while watching your vedios I feel forget all my SQL knowledge.

    SQL is very easy and interesting domain. Don't scared anyone.

  11. At last, I can understand mySql. Explained perfectly in a way my brain can comprehend.

    I will definitely be donating to keep you guys going. Thank you so much.

  12. some things worth pointing out:

    1. It's better to add the idxncixcix(for RDBMS that supports clustered index) as a prefix since index names can become too large,

    and although an index is not an "object" like a stored procedure, table or view it is still better to add prefixes for the identifier and suffixes for other uses.

    2. (for the viewers to better understand what's happening under the hood) – it took the first index 4 minutes to be created not because of the scan but due to the fact that it needs to sort the data on the fly while scanning the rows, which is resource-intensive on the CPU side rather than on the IO (not a linear scale).

    3. no offense but you should look into your workbench, 508ms to count ~50K rows (assuming your index is 3 levels deep, heck even 4) is rather slow, very slow unless PostgreSQL is doing something else behind the scene except for an index seek and scalar computation.

    4. the last index comment of "order matters" is a little misleading, in mathematical theory and the way that RDBMSs access the storage engine – it does not matter what the order of the columns is if you are using equality operators in the predicate, however order of columns can help in eliminating redundant indexes when having 3 queries which query on per say: Q1 (x = @x AND y = @y AND z = @z), Q2 (x = @x AND y = @y), Q3 (x = @x), there should be only one index and the order should be (x,y,z) thus the index can be reused, another point is for using order by operators explicitly or via window functions (pre-baking data order) that is when order of columns matter and also the ascending or descending order of the column.

  13. I love the enthusiasm, robotic fast (not boring) narration, clear and practical examples, and the visualization! Definitely subscribing!

  14. At 8:00 an index is created called "first_name_last_name_idx" on the columns (last_name, first_name) and she says "the order matters". — wouldn't it had made more sense to name the index "last_name_first_name_idx"?

  15. U guys are just awesome wonderful keep the same method of teaching love u guys so much 😘😘😘😘😘😘😘

  16. If instead of the nomenclature of the database, we observe the actual function of management of memory In-form-ation formulae in Mathematical Abstractions, we can recognize the superimposed steps of integration of data sets as time duration categories of timing.
    It's the specific functionality of QM-TIMESPACE "computation" in Principle.

    Whatever system or story coordinates access the meaning you are familiar with, of resonant harmonic integration of Actuality.., that is this Holographic Universe you believe, that is all there is, a format/formula, algorithm and paradigm, according to the approach to integration that has accumulated in personal memory. Something to think about.

  17. Did you just name your index first_name_last_name but create it using the last_name 1st and the first_name 2nd!? Was that a test?

  18. Do you love Socratica Programming videos? We have a Kickstarter to make 20 Advanced Python Lessons. Support what you love! http://bit.ly/PythonKickstarter

  19. I have studied mathematics and computer science for a decade In the university and can say that your channel has the highest utility factor. Meaning that, for some reason your content goes straight into my long-term memory, no confusion. Optimal learning for the shortest time.

  20. It's official! The Socratica Python Kickstarter was a success! Thank you to all of our supporters. Because of you, many more Python videos coming soon!! 💜🦉

  21. Nice explanation. How you setup your database server impacts query performance. If you have a SQL Server system running on a virtual server, you'll usually get slower query performance. On a dedicated server with ample processing power and memory, you'll usually get improved performance. But of course, cost is factor.

    And thank you for pointing out that indexes do not come for free. You can index a table, but storing the data costs more and also impacts performance. On a very high transaction system, you'll want to index to improve performance, and just spend more on memory and storage for performance. But given that the costs for memory and storage have decreased over the years, then go ahead and spend it.

  22. Around 5 years ago I was reading, that it will happen soon and 2 years ago I saw it on a presentation of IBM: SQL experts are not needed anymore, using Mahine Learning it is possible now to generate highly optimised SQL codes from most simple and unefficient codes. This means, you only need to learn the absolute basics of SQL an no more, use the time to learn more important stuff, like improving your math, number theory for efficient coding and stochastics and linear algebra for Machine Learning.

  23. It's not a tutorial or a demonstration but a Great show👏 . Talking about clustered and non clustered index it's also an interesting continuity for this video.

Leave a Reply

Your email address will not be published. Required fields are marked *