Database indexes speed up data retrieval operations. But there is a price we pay for these benefits.
In this article, we’ll focus on the structure behind the MySQL index. We’ll also measure database performance by using large datasets to test two versions of the same database – one with indexes and the other without them.
This is the second article in our series about database indexes. If you missed the first one, click here.
What Structures Do Indexes Use?
MySQL supports a few different index types. The most important are BTREE and HASH. These types are also the most common types in other DBMSs.
Before we start describing index types, let’s have a quick review of the most common node types:
- Root node – the topmost node in the tree structure
- Child nodes – nodes that another node (the parent node) points to
- Parent nodes – nodes that point to other nodes (child nodes)
- Leaf nodes – nodes that have no child nodes (located on the bottom of the tree structure)
- Internal nodes – all “non-leaf” nodes, including the root node
- External nodes – another name for leaf nodes
Binary Search Trees
Now let’s consider the binary search tree (BST) structure. These structures are not database indexes, but it’s good to understand how they work. You’ll soon see that the tree-structures (B-Tree and B+Tree) used by databases are similar.
A BST is a binary tree in which the nodes are sorted. In the picture above, we see that the values on the left of node X are smaller than X, and the values on the right of node X are greater. This is how any BST is organized.
When we search for a value in the BST, we start from the root node and compare the searched value with the value of the root node. If the searched value is lower than the root value, we’ll go to the left subtree. If it’s higher, then we go to the right subtree. If we find the value, we’re done. If we reach the leaf node without finding our value, we know that it’s not in our tree.
Adding a new value to the BST starts just like searching for a value. (Once we find the right place for our value, of course, the process changes.) For example, if we want to add the value “30”, we would add it as a left child node of the node “31” (30 is greater than 27, so we go right; 30 is less than 35, so we go left; 30 is less than 31, so we go left again.)
When we want to delete a value from the BST, the situation is more complicated. We may even need to rearrange the tree in some cases. There are three possible situations:
- Deleting a node that has no child nodes – We’ll simply remove the node from the BST. In our example, these are the nodes“10”, “19”, “31” and “42”.
- Deleting a node that has one child node – We’ll replace the deleted node with the child node. If node “31” has “30” as a child node in the left subtree and we delete it, we’ll simply replace “31” with “30”.
- Deleting a node that has two child nodes – This is the most complex case. In this situation, we’ll find the smallest element in the right subtree of the deleted node and put it in place of the deleted node. So if we delete node “27”, we’ll look for the minimum value in the right subtree and that is “31”. We’ll replace the node “27” with the node “31”.
The B-Tree is the basic index structure for most MySQL storage engines. Each node in a B-Tree has between d and 2d values. Values in each node are sorted. Each node has between 0 to 2d+1 child nodes. Each child node is attached before, after, or between values. (In the above graphic, values “9” and “12” come between values “7” and “16”.)
The values in B-Tree are sorted in a similar way to those in a binary search tree. Child nodes to the left of value “X” have values smaller than X; child nodes to the right of the value “X” have values greater than X (see the picture).
In contrast to a BST, a B-Tree is a balanced tree: all branches of the tree have the same length.
Searching for values in a B-Tree also corresponds to searching in a BST. First we check if the value is present in the root node. If it isn’t, we select the appropriate child node, and look for the value in that node.
Adding and removing values from a B-tree usually does not create new nodes: the number of values in each node can vary. Of course, that means we’ll have some empty space, so a B-tree will require more disk space than a denser tree would.
Adding a value has to maintain both the order of value and the balance of the tree. First we’ll find the leaf node where the values should be added. If there is enough space in the leaf node, we’ll simply add the value; the structure and the tree depth won’t change.
In our example, adding the value “22” would be extremely simple. We would simply add it to the third leaf node. If we want to add the value “4”, we would split the first leaf node. This always happens in a case of overflow. Values “1” and “2” would form a new leaf node (the left one); value “4” would be the middle node (and inserted into the parent node), while values “5” and “6” would form another new leaf node (the right one).
We would split the parent node in the same manner if an overflow happens there. In extreme cases, we’d have to split the root node and the tree depth would increase.
When we want to delete a value from the B-Tree, we’ll locate that value and remove it. If that deletion caused underflow (the number of values stored in a node is too low) we’ll have to merge nodes together. This is exactly the opposite from what happens when we’re adding values. We’ll merge that node with the neighboring node that contains the most values. If the newly merged node contains too many values (overflow) we’ll split it to the left, middle, and right and rearrange the tree structure.
When B-Tree is used in database index, each node holds the index value and the pointer to the row it comes from. For example, if the index is on the column id, then the tree will hold both the value "7" and a pointer to the row where id=7. When we search for the row where id=7, we can quickly find the value 7 in the B-Tree and then follow the pointer to get the actual row.
The important thing to remember is that in the B-tree structure each node contains key values, pointers to the values, and pointers to child nodes with smaller and larger values than the value in the parent node. Of course, leaf nodes don’t point to any child nodes.
With a B-Tree index, we can use equality operators (= or <=>), operators that will result in a range (>, <, >=, <=, BETWEEN), and the LIKE operator.
We can create an index on more than one column, too. If we created an index on a (
last_name) pair, we could search using either term, but the index will also work when we’re using just one of these attributes.
FInally, it’s good to note that the B-tree will perform better when the data accessed most frequently lies closer to the root node.
The B+Tree structure is similar to the B-Tree structure. The most important differences are:
- The internal nodes only store values; they don’t store pointers to actual rows. The leaf nodes store the values and row pointers. This reduces the internal nodes’ size, allowing the storage of more nodes on the same memory page. In turn, this increases the branching factor. As the branching factor grows, the height of the tree decreases and that results in fewer disk I/O operations.
- Leaf nodes in the B+Tree are linked, so we can conduct a full scan with only one pass. That is very helpful when we need to find all the data in a given range. Because row pointers are stored in both internal and leaf nodes, this isn’t possible in a B-Tree.
- Performing delete operations in a B+tree structure is much easier than in a B-tree. This is because we don’t need to remove values from the internal nodes. In a B+Tree structure, we’ll have the same value repeated; in a B-Tree structure, each value is unique. In B+Tree, we’ll store one value and one data pointer in the leaf node, but that value could also be stored in internal nodes (where it’s used to point to child nodes).
- The advantage of B-Tree is that we can find values that lie close to the root fairly quickly, while in B+Tree we would need to look all the way down to the leaf nodes for any value.
The InnoDB storage engine uses a B+Tree structure to store indexes.
Hash indexes are directly related to the hashing technique. Look at the picture above. On the left side, we see the set of key values used to find data. In this case, they are numeric values. The hash function is used to calculate the address where the actual data bucket is stored. That will give us the location of records related to each key value.
Values are stored in buckets based on the value of a hash function. When we want to search for a value, we’ll use the hash function to calculate the address where our data could be stored. We’ll look for the data in the bucket. If we find it, we’re done. If we don’t find our value, it means it’s not in the index.
Adding new values works similarly: we’ll use the hash function to calculate the address where we’ll store our data. If that address is already occupied, we’ll add new buckets and re-compute the hash function. Once again, we’ll use the whole key as an input for our function. The result is the actual address (in disk memory) where we can find the desired data.
Updating or deleting values consists of first searching for a value and then applying the desired operation on that memory address.
Hash table indexes are very fast when testing for equality (= or <=>). This is because we’re using the whole key and not just its parts. Individual parts can’t help us when we want to find range, use the < or > operators, or speed up the ORDER BY part of the clause.
Index Types in MySQL, MyISAM, and Other Engines
MySQL uses both BTREE (B-Tree and B+Tree) and HASH indexes. MyISAM use only BTREE indexes while MEMORY/HEAP and NDB can use both HASH and BTREE. MEMORY/HEAP and NDB will use the HASH index structure by default.
If we want to specify an index structure, we need to add USING BTREE or USING HASH in the CREATE INDEX statement. In Vertabelo, we do this by setting the ‘Using’ value in the ‘Indexes’ section.
Performance Testing: The Databases and The Expected Results
We’ll test the performance of two identical databases, one that uses indexes and one that doesn’t. For the sake of simplicity, we’ll refer to these as the indexed and unindexed databases, even though both databases have indexes on their primary keys. The structure of both databases is presented in the above model and detailed in Part 1 of this series.
In the test databases, I’ve added 1,000,000 records to the
In the first article, we created three indexes using following statements:
CREATE INDEX book_details_idx_1 ON book_details (book_title); CREATE INDEX author_idx_1 ON author (first_name,last_name); CREATE INDEX author_idx_2 ON author (last_name,first_name);
To remove these indexes from the second database, we used the following statements:
DROP INDEX book_details_idx_1 ON book_details; DROP INDEX author_idx_1 ON author; DROP INDEX author_idx_2 ON author;
The database using indexes should also use more disk space. Remember that indexes are structural and store copies of the values from the indexed attributes and pointers to the records. It is expected that queries used to find and sort data (SELECT, ORDER BY) will perform faster on the indexed database than the unindexed one. However, the unindexed database should perform better when we’re using INSERT, DELETE and UPDATE queries.
Performance Testing: Database Size
Each MySQL server instance stores all the relevant data about every database inside the information_schema database. That is where we can find all the relevant data about databases on server, tables, columns, views, triggers, etc.
We’ll query the information_schema database to find the sizes of our two example databases:
SELECT ROUND(SUM(information_schema.tables.data_length + information_schema.tables.index_length) / (1024 * 1024), 2) AS "MB" FROM information_schema.tables WHERE information_schema.tables.table_schema = "indexes_performance";
As expected, the size of the indexed database is 315 MB, while the unindexed database is 289 MB. This might not seem like much of a difference, but it’s roughly 10% and we’ve used only three indexes.
We’ll take a more detailed look at both databases with the following query:
SELECT a.table_name, a.MB AS "table size; data + index (MB) / INDEX", a.index_MB AS "index size (MB) / INDEX", b.MB AS "table size; data + index (MB) / NO INDEX", b.index_MB AS "index size (MB) / NO INDEX" FROM ( SELECT information_schema.tables.table_name AS table_name, ROUND(SUM(information_schema.tables.data_length + information_schema.tables.index_length) / (1024 * 1024), 2) AS "MB", ROUND(SUM(information_schema.tables.index_length) / (1024 * 1024), 2) AS "index_MB" FROM information_schema.tables WHERE information_schema.tables.table_schema = "indexes_performance" GROUP BY information_schema.tables.table_name ) AS a LEFT JOIN ( SELECT information_schema.tables.table_name AS table_name, ROUND(SUM(information_schema.tables.data_length + information_schema.tables.index_length) / (1024 * 1024), 2) AS "MB", ROUND(SUM(information_schema.tables.index_length) / (1024 * 1024), 2) AS "index_MB" FROM information_schema.tables WHERE information_schema.tables.table_schema = "indexes_performance_no" GROUP BY information_schema.tables.table_name ) AS b ON a.table_name = b.table_name
The result is:
The first column is the table name, which is the same for both databases. The second and third columns belong to the indexed database, while the fourth and fifth columns belong to the unindexed database. The second and fourth columns are the sum of the data size and the index size, while the third and fifth columns are the index size.
As expected, the
author tables are larger in the indexed database. We can also see that even in a database without custom indexes added in, about 110 MB is reserved for indexes. This is the size of the indexes created over primary key attributes. Almost the entire difference in database sizes comes from the custom index added on the
Performance Testing: Query Speed
We’ll use the same databases, but this time we’ll measure the time needed to execute three SELECT queries, plus one stored procedure that includes many data changes.
The first query we’ll use is pretty simple:
SET @start_time = NOW(); SELECT book_details.book_title, COUNT(*) AS result FROM book_details GROUP BY book_details.book_title; SET @end_time = NOW(); SELECT TIMEDIFF(@end_time, @start_time) AS difference;
To return the time needed to execute our query, @start_time and @end_time are set just before and immediately after the query. Their difference will tell us how long this operation took.
The indexed database returned a result 17 times faster than the unindexed database. The unindexed version is compelled to conduct a sequential search for the information, which accounts for the time difference.
Tip: MySQL supports the following statements:
- DESCRIBE – usually used to describe table structure
- EXPLAIN and EXPLAIN EXTENDED – used to describe a query execution plan
- SHOW WARNINGS –used to display information about conditions (errors, warnings, etc.) after executing an observed statement
DESCRIBE, EXPLAIN and EXPLAIN EXTENDED are placed in front of the query. SHOW WARNINGS should be placed right after it.
The second query we’ll use is more complex and we can expect that both databases’ execution times will increase:
SET @start_time = NOW(); SELECT author.first_name, author.last_name, COUNT(*) FROM book_details, book_author, author WHERE book_details.id = book_author.book_details_id AND book_author.author_id = author.id GROUP BY author.first_name, author.last_name ORDER BY author.first_name, author.last_name; SET @end_time = NOW(); SELECT TIMEDIFF(@end_time, @start_time) AS difference;
In this case, the indexed database gave us the result two times faster than the unindexed database.
We’ll modify the second SELECT query a little, adding a selection condition:
SET @start_time = NOW(); SELECT author.first_name, author.last_name, COUNT(*) FROM book_details, book_author, author WHERE author.first_name = "Ernest" AND book_details.id = book_author.book_details_id AND book_author.author_id = author.id GROUP BY author.first_name, author.last_name ORDER BY author.first_name, author.last_name; SET @end_time = NOW(); SELECT TIMEDIFF(@end_time, @start_time) AS difference;
Once again, the indexed database returned its result two times quicker than the unindexed database. This is due to the fact that I’ve used indexed columns; the GROUP BY and ORDER BY clauses use an index when it’s available. Of course, that index must contain the same columns in the same order they are used in the GROUP BY and ORDER BY clauses.
The result of all three test cases is as expected. Indexes improve performance significantly when we’re using SELECT queries and when we need to ORDER returned values.
- The last performance test executes a stored procedure that deletes 10,000 rows from
book_authortables; generates random values; and then adds 10,000 rows in all four tables. I won’t paste in the procedure code because it’s simply too large (and pointless) to do so.
This procedure required 50% more time on the indexed database. This highlights the fact that an indexed database needs time to modify the index structure when we’re adding, changing or deleting records.
Indexes are very powerful database structures. You probably won’t need them when handling smaller datasets. But as things become complicated, indexes play a crucial role in overall performance. We’ve presented three commonly-used index structures, and we’ve considered their advantages and disadvantages. During testing, we’ve shown that indexes lead to better performance when retrieving and changing data.