We’re all familiar with the indexes in books: they help you find specific contents much faster by telling you where they’re located. In a nutshell, database indexes essentially do the same thing—they let you retrieve information from a database much faster by narrowing down the scope of your search.
In fact, creating indexes for database tables is one of the most important concepts of database modeling. It’s also often one of the first things you should consider if your query is running too slowly, in which case it may benefit from indexing. In this article, I'll explain all you need to know to understand mySQL indexes while at the same time demonstrating how indexing can improve the performance of your queries.
Imagine you want to find a particular chapter (say 10) in a book that happens to lack an index. The only certain way to find this chapter would be to scan the entire book until it’s found. If the book did have an index and it told you that chapter 10 starts on page 1023, you’d know that you can immediately skip all other pages and go directly to that chapter’s location.
Now, let's imagine a similar situation with a database table that has thousands, or perhaps millions, of data records. If you have to read each row to check if a particular record is present (called a full table scan), it will take you way more time to retrieve your record than if you were to specify an index pointing to the table section where your data is located.
As you can imagine, using indexes can significantly improve a query’s runtime performance if the number of records in your database grows. With our analogy, the advantage isn’t too obvious if we imagine having to sift through, say, a 100-page book page by page to locate the desired chapter, but suppose we had to scour all of a 5000-page encyclopedia—that would take forever!
I’ve prepared some sample data that you can download. This test data represents weather samples that I will use to guide our understanding of database indexing. Here’s a brief snapshot:
In this diagram, there are three tables. The first table is titled
indexed_samples; this is where we will be adding indexes to compare records against those of the second table,
unindexed_samples, which, as the name suggests, lacks indexes. Both of these tables store information from 1.8 million records of weather data. Finally, the third table
small_samples will be used later on to illustrate special types of indexes.
I highly recommend you download the files and test this out in your local environment, as it will be faster and easier to go over such a large number of records. I’ve opted to use tables that store lots of data in order to illustrate just how much indexes can impact query importance as the number of records increases.
A quick peek at indexing
To see the power of indexing in action, go ahead and run the following query on the indexed_samples table to create an index for the wind speed column (we’ll get into the finer details afterwards):
ALTER TABLE indexed_samples ADD INDEX index_windspeed (wind_mtsperhour ASC);
Now, let’s see how long it takes to get the maximum wind speed on the unindexed table by executing the following query:
SET SESSION query_cache_type = OFF; #prevents caching so we can see the impact SELECT max(wind_mtsperhour) FROM unindexed_samples;
Running the second query above will return the
max(wind_mtsperhour) value of “1446485”. Pay attention to how long it took this query to run. Your results may vary, depending on the specifics of your machine’s hardware, but it should look something like this:
|1 row(s) returned||1.891 sec / 0.000 sec|
Let’s now run the same exact query, this time using our
indexed_samples table to demonstrate the computing power of indexing:
SELECT max(wind_mtsperhour) FROM indexed_samples;
max(wind_mtsperhour) value remains unchanged, as expected, but the time it took the query to execute is now so miniscule that it essentially rounds off to zero, suggesting it took almost no time to complete!
|1 row(s) returned||0.000 sec / 0.000 sec|
Operations that benefit from indexing
As we saw in the previous section, indexing can significantly improve the performance of a query, and it doesn’t just apply to those that perform trivial tasks like searching for minimum/maximum values. In general, mySQL queries benefit from data indexing in the following scenarios:
- finding rows that match a
- removing rows that do not match a
- row retrieval from other related tables when performing joins
- finding the minimum/maximum value for a specific column, as in our example above
- sorting or grouping the records of a table
More importantly, in some situations, a query can return the requested information directly from the index without ever having to consult the data rows themselves, allowing it to perform much faster than normal (a mechanism known as covering indexes).
Indexes may correspond to one (single-column indexes) or many columns (multi-column indexes). They may also be used for the entire column or just part of it (prefix indexes) and can even enforce special constraints, such as via
FULL TEXT. We won’t cover
FULL TEXT in this article, but you can read more about it in this Vertabelo blog if you’re interested.
Indexes can correspond to a single column, like the one we already saw, or even several columns. We can create a general index that corresponds to a single column either when we create our table or at a later time, once the table has already been created.
The syntax for adding an index to, say, the
wind_mtsperhour column when creating our sample table would be the following (note that
index_windspeed is the name of the index):
CREATE TABLE indexed_samples ( measure_timestamp datetime NOT NULL, station_name varchar(1000) DEFAULT NULL, wind_mtsperhour int(11) NOT NULL, windgust_mtsperhour int(11) NOT NULL, windangle int(3) NOT NULL, rain_mm decimal(5,2) DEFAULT NULL, temperature_dht11 int(5) DEFAULT NULL, humidity_dht11 int(5) DEFAULT NULL, barometric_pressure decimal(10,2) NOT NULL, barometric_temperature decimal(10,0) NOT NULL, lux decimal(7,2) DEFAULT NULL, KEY index_windspeed (wind_mtsperhour) );
Alternatively, if the table already exists, we can add the index to it by
ALTERing the table, like we did in the introductory example:
ALTER TABLE indexed_samples ADD INDEX index_windspeed (wind_mtsperhour);
It’s also possible to use the
CREATE INDEX statement to achieve the same result:
CREATE INDEX index_windspeed ON indexed_samples (wind_mtsperhour);
A full list of index options and modifiers can be found here. In the rest of this article, I’ll use the
ALTER TABLE statement, since we will be modifying the base
MySQL :: MySQL 5.7 Reference Manual :: 13.1.14 CREATE INDEX Syntax table.
Before we move on, you should know that you can remove a particular index at any time with the following syntax:
ALTER TABLE table_with_indexes DROP INDEX index_name;
Indexing string columns
In general, mySQL has a 767-byte index length limit for InnoDB (1000 on myIsam) for string columns. This means a regular index cannot include the full column string but only the first 767 bytes of it. The part that can be indexed is called the prefix. The prefix length can be specified after the indexed column name upon index definition, like so:
CREATE INDEX index_station_name ON indexed_samples ( station_name (100) );
Here, we’re creating an index that will consider the first 100 characters of the
station_name column (which is 1000 characters long). For
VARBINARY column indexes, the prefix length is optional. If you don’t set it, the engine will use the maximum length possible when considering the column’s character set.
TEXT columns, you must specify the prefix length explicitly, or you will get
ERROR 1170: BLOB/TEXT … without a key length. It’s important to note that the prefix size you specify will be considered in bytes instead of characters for the
BLOB column types.
Using the special index modifier
FULL TEXT will index an entire column, and the prefix will be ignored if it is specified. However, as I noted before, I won’t cover that type of index in this article.
As noted before, indexes may also correspond to more than one column (up to 16 in mySQL). The only difference between single- and multi-column indexes is that the latter requires specifying several columns instead of just one. So, for starters, let’s create a single index for the wind speed and wind gust columns:
ALTER TABLE indexed_samples ADD INDEX idx_speed_gust (wind_mtsperhour, windgust_mtsperhour);
Just as we did with single-column indexes, we can specify a prefix length for the string columns of a multi-column index if we want to. For instance, we could create another index that includes the first six characters of the
station_name and the
ALTER TABLE indexed_samples ADD INDEX idx_station_angle (station_name (6), windangle);
How multi-column indexes are used
The engine will create multi-column index key values by concatenating the column values in the order they are specified for the index. Therefore, only queries that conform to the following criteria may use the index:
- queries that test all columns used in the index
- queries that test just the first column
- queries that test the first two columns
- queries that test the first three columns
- and so on...
So, the order in which you specify the index definition is critical to how the engine can use it. Let’s take a look at when the engine can use our multi-column index from the prior example (with the
A query including all columns in the index will use it:
SELECT * FROM indexed_samples WHERE station_name = “test 1” and windangle = 180
Or this one, too, which includes the first column:
SELECT * FROM indexed_samples WHERE station_name = “test 1”
But, this one will not use the index:
SELECT * FROM indexed_samples WHERE windangle = 180
The last query above does not use all columns: it only uses the second indexed column. Therefore, indexing cannot be used. This also applies when using
OR in a
WHERE clause, as each part must be evaluated separately. So, in the next example, the engine still cannot use the index:
SELECT * FROM indexed_samples WHERE station_name = "test 1" OR windangle = 180;
If we rewrite the query like so, it can use the index:
SELECT * FROM indexed_samples WHERE station_name = "test 1" OR (station_name = "test 2" AND station_name = "test 3" AND windangle = 180);
Don’t worry if these rules seem complicated. Later in this article, I’ll show how you can detect whether your query is using indexing.
The concept here is fairly straightforward: when creating a
UNIQUE index, all indexed values must be distinct. This applies to both single- and multi-column indexes. If you try to add a new row with a key value that matches an existing one, an error will occur. Also, if you try to create a unique index for a column with repeated values, the index will not be created, and Error 1062 will be generated. In practice, UNIQUE is used to prevent selecting duplicated rows.
Let’s see this in action. I’ll first try to create a multi-column index for the station_name (with a 100-character prefix) and measure_timestamp columns:
ALTER TABLE indexed_samples ADD UNIQUE INDEX idx_station_timestamp (station_name(100), measure_timestamp);
As expected, the above code fails with
Error Code: 1062. Duplicate entry 'test 2-2014-12-02 00:02:35' for key 'idx_station_timestamp'. This is because the table already contains duplicated values for these two columns. Now, let’s try using the
small_samples table, which stores 10 non-repeated samples:
ALTER TABLE small_samples ADD UNIQUE INDEX idx_station_timestamp (station_name(100), measure_timestamp);
The above query will succeed, and we can now try inserting a repeated value:
INSERT INTO small_samples VALUES('11', '2014-11-30 22:01:26', 'test 2', '5980', '9467', '171', '0.00', '16', '37', '93943.00', '17', '0.00');
The above query fails with
Error Code 1062. Duplicate entry 'test 2-2014-11-30 22:01:26' for key 'idx_station_timestamp'. Again, this is expected.
UNIQUE indexes prevent duplicate records from arising at a database level. If your application wanted to insert a record into uniquely indexed columns for some reason, you could rely on mySQL’s
UNIQUE indexing to protect you from accidental duplicates, while at the same time speeding up searches for records in any indexed columns.
PRIMARY KEY indexes
From an indexing perspective, defining a column or columns as a
PRIMARY KEY is very similar to creating a
UNIQUE index. The main difference is that
PRIMARY KEYs cannot hold null values. However, like
PRIMARY KEYs cannot hold duplicates.
PRIMARY KEYs are usually defined on the smallest set of columns that can uniquely identify a row within your table, such as an id column. For this reason, mySQL optimizes data storage with
PRIMARY KEY. Simply put, the recommendation is that you define a
PRIMARY KEY for your table to improve performance. This is one of the core concepts of database normalization.
Let’s see how we can add a
PRIMARY KEY index to our
small_samples table for the
ALTER TABLE small_samples ADD PRIMARY KEY (station_name (100), measure_timestamp);
The query succeeds. Let’s now try to add another PRIMARY KEY to the id column to see what happens:
ALTER TABLE small_samples ADD PRIMARY KEY (id); Error Code: 1068. Multiple primary key defined
This reveals another core deviation from
UNIQUE indexes: you are not allowed to have more than one
PRIMARY KEY per table.
Let’s remove the existing
ALTER TABLE small_samples DROP PRIMARY KEY;
And then try to add the id index again:
ALTER TABLE small_samples ADD PRIMARY KEY (id);
Now, let’s confirm that we cannot insert duplicate entries or null values into the id column, as the index has been set as a
INSERT INTO small_samples (id) VALUES (null); Error Code: 1048. Column 'id' cannot be null
INSERT INTO small_samples (id) VALUES (10); Error Code: 1062. Duplicate entry '10' for key 'PRIMARY'
Both of these tests fail, as expected, preventing us from having duplicates or null values in column(s) that are set as
EXPLAIN and SELECT optimizations
The best way to prove that your query is properly using the expected indexes is to use the
EXPLAIN statement on the
SELECT query you intend to analyze. Although a full optimization tutorial would be outside the scope of this article, I’d like to give you a quick example so you can see its power.
EXPLAIN a query that grabs samples whose wind speed values equal to “7000” to see what happens when we use the unindexed table:
EXPLAIN SELECT wind_mtsperhour = 7000 FROM unindexed_samples;
This query will return the following:
As it stands, the query has no possible indexes to use, so a full table scan must be performed on all rows. However, if we run the same query on the indexed table:
EXPLAIN SELECT wind_mtsperhour = 7000 FROM indexed_samples;
We get the following feedback:
As you can see, the
extras column tells us that this query does indeed rely on indexing. Thus, the query need not read through the entire table to find the matching record, as the
index_windspeed column already contains all the required information. The query can simply grab the records it needs directly from that column in a snap.
EXPLAIN on multicolumn index
Earlier in this article, I listed the criteria for using multicolumn indexes and mentioned that a
SELECT statement that uses an
OR can be modified to use indexes. Let’s
EXPLAIN those queries to further understand the concept:
EXPLAIN SELECT * FROM indexed_samples WHERE station_name = "test 1" OR windangle = 180;
This is what is returned:
However, since the second part of our OR clause uses a condition that does not comply with the multi-column indexing criteria I mentioned earlier, this query will not rely on any indexes and will thus have to scan through all 1.7 million rows.
Now, if we
EXPLAIN the query that includes all the indexed columns as I suggested earlier, we can clearly observe the improvement in performance:
EXPLAIN SELECT * FROM indexed_samples WHERE station_name = "test 1" OR (station_name = "test 2" AND station_name = "test 3" AND windangle = 180);
Here’s the feedback we get:
In this case, the index is used, and 50% less rows need to be scanned.
The key concept to keep in mind is that indexing speeds up your queries and helps you add an extra layer of consistency when used in combination with
PRIMARY KEY. The columns you are most likely going to use for your
SELECTs are ideal candidates for indexing. Also, as I briefly showed,
EXPLAIN can offer useful hints and insights into whether your query is properly using indexing in its searches.
Keep in mind that indexing also comes at hidden a cost: updating indexes as you insert/update/delete data impacts server performance. Choosing the best index, if any, will also consume mySQL processing time. Therefore, you should be extremely picky when choosing indexes and only use them where they are needed.