All About Indexes: The Very Basics
There are many things that impact database performance. The most obvious is the amount of data: the more you have, the slower your database will be. While there are many ways to fix performance problems (like removing old data or using denormalization), the primary solution is to properly index your database. In this series, we’ll explain several very important indexing concepts, starting with the basics and ending with best practices.
Why Do We Need a Database Index?
To answer this question, we’ll tell a story. Let’s imagine that we manage a library and we have a database to store information about our books. For each book, we store the barcode, title, author, genre, publisher, and publication year. We could save all this in one big file, with each row representing a single book. In that case, the order of books would be chronological, according to when the record was added.
If we want to find a book (maybe The Adventures of Huckleberry Finn, by Mark Twain), we’d need to scan every record in our list. Starting from the beginning of the file, we’d test search conditions on each record. The search would be terminated after we found the first matching value – and this could be the very last record! Technically, this approach works; but as our file becomes larger, the impact on performance will become more obvious. At some point, our system will become unusable.
This scenario can easily apply to databases, with some additional factors. For example, each database record should have at least one unique key. That key might contain some real-life data, but in most cases it’s just an automatically assigned numeric value. To find any given record in our database, we’d have to use:
- The primary key value, if we knew it. Remember, PKs are not real-life data, so there is little chance we’d know this value.
- Real-life values, such as the book’s title or author.
Even with this information, sorting through a large amount of data is time consuming. Just picture reading a list of every book in your city’s main library to find a single title! Fortunately, there’s a more efficient way to get things done.
To speed things up, many designers rely on a database index.
What Is a Database Index?
A database index is a specialized data structure that allows us to locate information quickly. It’s organized like a binary tree structure, with smaller values to the left and larger values to the right. Rather than force a scan of the entire table, an index can compare row values in the tree-like structure to locate desired data faster.
When we create an index on one or more columns, we store their values in a new structure. We also store pointers to rows the values come from. This organizes and sorts the information, but it doesn’t change the information itself. Think of a database index as the digital equivalent to the index in the back of a book. While it stores some actual information, it also includes pointers to the locations where more details can be found.
After the data is sorted by our search criteria, locating the desired records becomes much simpler. Picture an old telephone book that was sorted alphabetically. Knowing someone’s last name, first name, and address meant you could find their telephone number very quickly. But what if you only knew someone’s address and first name? Without the last name, finding the phone number would be quite difficult. You’d do much better with a reverse telephone directory, which lists phone numbers based on addresses.
In databases, changing your search criteria often means creating a new index for a combination of attributes. As mentioned, adding these indexes would require additional disk space. When we add, delete or update values, changes are also made to the indexes.
Why Indexes Are Important – the Math
In most cases, we can find data faster using an index than by searching sequentially through the database. The exception is if we only have a few records in our database. If we expressed this in a formula, with t = time, then tusing_index < tsequential search . We can calculate these values, and the result of that calculation is the formula for algorithm complexity.
Algorithm complexity determines how much time is needed to perform an operation. Since hardware configurations differ, we’ll use the amount of data in our dataset as a parameter, with n = the number of records in our database. So, if we have 1,000,000 books in our library, then n = 1,000,000.
If we wanted to find a record for The Adventures of Huckleberry Finn, we’d have to look at each individual book title until we found the right one. If n=1,000,000, that would mean going through half a million books on average! This sequential process is called a full table scan, and the algorithm complexity (n/2) is directly related to the size of the dataset. We’ll use O(n) to point to that. Note: “O” is used to describe the most important term in the algorithm. For example, if we calculate that an algorithm has 2n3 + 4n + 21 operation, the big O notation for that algorithm is O(n3).
On the other hand, if we use an index on the title attribute, we can find our record much faster. If the books were already organized alphabetically by title, we’d simply look at the book in the middle (the 500,000th record). If our desired value comes before this record, we’ll search for it in the left data subset; otherwise, we’ll look in the right subset. By continuing to split our list in half, we’ll find the record we want. The most we’d need to repeat this process is 20 times (2^20 = 1,048,576). The complexity of such an algorithm is expressed as O(log(n)).
Obviously, the indexed approach gets better results. There is one prerequisite to using it, and that is having an index on the title attribute. This will result in faster search operations, but we’ll lose some disk space and experience slower performance when adding and deleting data.
An Example Database Model
We’ll use this model to explain indexes in this and in upcoming articles. It is designed to store all the relevant data about our library’s books.
I won’t go into model details because they are self-explanatory. The most important thing to know is that some tables:
- Are dictionaries, which we don’t expect to change frequently.
- Contain a lot of data (
book_author). Of course, any big library could have thousands or even millions of books. It’s thought that the Ancient Library of Alexandria had more than 400,000 scrolls, and that was 2,300 years ago!
How to Create an Index
Adding a new index requires more disk space. If we try to index every attribute and all possible combinations, we could end up with performance that is worse than what we initially had. So when considering a new index, remember:
- Add indexes only in tables where we expect a lot of data. In our model, that could be the
- Only add an index to the attribute(s) that we expect to be searched often; for example, it’s logical to assume that people will search for book titles, so the
book_titleattribute will have an index.
- Add an index to a field if UNIQUE can’t help us. In
book_details, for example, it’s entirely possible that several books will have the same name.
- If we use more than one field for an index, we must order them properly.
We could easily create an index on the
book_title attribute using the following SQL statement:
CREATE INDEX book_details_idx_1 ON book_details (book_title);
We could also create an index directly in the model. Vertabelo offers a very easy way to add an index to one or more fields. After you select the appropriate table in the model, select “+ Add index” in the “Indexes” section. The index name will be generic, and you’ll have to select all the columns that form that index. In our example, only the
book_title column is needed, and we’ll leave the default “ASC” option.
The “SQL preview” option displays the following code, which creates our table:
-- Created by Vertabelo (http://vertabelo.com) -- Last modification date: 2016-06-10 15:29:30.603 -- tables -- Table: book_details CREATE TABLE book_details ( id int NOT NULL AUTO_INCREMENT, book_title varchar(255) NOT NULL, edition text NULL, publisher varchar(255) NULL, publish_year int NULL, language_id int NOT NULL, CONSTRAINT book_details_pk PRIMARY KEY (id) ) COMMENT 'list of all book titles'; CREATE INDEX book_details_idx_1 ON book_details (book_title); -- End of file.
Using the same logic, we’ll create two more indexes on the
author table. The first index uses the
first_name column and then the
last_name column. The second index reverses this order, using the
last_name column first and then the
first_name column. They may look the same, but the order of columns in the index is crucial; create your index so the leftmost attribute in the index is the one you’ll use first.
Why are there two indexes on the same attributes? People are likely to search for an author in several different ways. The first index,
author_idx_1, allows us to find records using the
last_name pair or the
first_name attribute alone. But it won’t work for the
first_name pair or even the
last_name attribute alone. The second index,
author_idx_2, will do that. So no matter how an author’s name is entered into the search, a meaningful result will be returned.
CREATE TABLE statement for our table now looks like this:
-- Created by Vertabelo (http://vertabelo.com) -- Last modification date: 2016-06-10 15:36:04.529 -- tables -- Table: author CREATE TABLE author ( id int NOT NULL AUTO_INCREMENT, first_name varchar(255) NOT NULL, last_name varchar(255) NOT NULL, birth_date date NULL COMMENT 'it is here just to distinguish authors with same first and last name (if any)', CONSTRAINT author_pk PRIMARY KEY (id) ) COMMENT 'authors list'; CREATE INDEX author_idx_1 ON author (first_name,last_name); CREATE INDEX author_idx_2 ON author (last_name,first_name); -- End of file.
Things to Remember About Database Indexes
If a table has one or more indexes, it will definitely slow the INSERT operation. This is because when a record is added to the table, it has to go in the right place. Therefore, the index will need to be adjusted, which takes time.
If we expect significant changes in a table that has indexes, we could drop the indexes first, insert new records, and then re-create the indexes. This could be the faster solution, depending on circumstances. In our model, if we’re adding lots of new book titles, it would make sense to drop the indexes. It’s reasonable to assume that we’d prepare a list of new books and their related data and run the script after working hours.
When an Index Is Created Automatically
As mentioned before, an index is automatically created for the primary key. In all of our tables, the primary key is called simply
id. It’s of the int type, and autoincrement is set to
In the MySQL InnoDB engine, an index is also created automatically for foreign keys.
If you define an attribute as UNIQUE, an index will also be created on it automatically, just as it would for a primary key.
In all of these cases, indexes are created on keys, which speeds up search operations (when using the id as a search parameter) quite a bit. This helps us in cases where we know the id and we want to retrieve, update, or delete the value of that record.
Indexes are very powerful. They allow us to perform search and sort operations much faster. But that speed comes with a price: creating an index will require disk space and could slow down other operations.
In upcoming articles, we’ll focus on database performance and other issues relating to indexes.