Back to articles list

October 7, 2014 - 5 minutes read

Design fundamentals

What Is a Database Index?

Michał Kołodziejski

Senior Software Engineer

Tags:

Sooner or later there comes a moment when database performance is no longer satisfactory. One of the very first things you should turn to when that happens is database indexing. This article will give you a general overview on what indexes are without digging into too much detail. We’ll discuss additional database index topics in future articles.

In general, a database index is a data structure used to improve queries execution time. To explain what an index is, we need to say a few words on how the data stored in tables is organized.

Tables may be organized in two ways:

heap tables
index-organized tables (IOTs), or clustered indexes (different vendors use different terminology, but the concept is common among them)

For now, let’s talk about heap tables only. We’ll get back to index-organized tables after we explain what an index is.

Understanding Heap Tables

In heap tables, data is stored in no particular order. New data is inserted without sorting or reorganizing previously inserted data in any way. This makes an insert operation execute very quickly but is inefficient when retrieving data.

For example, let’s say we have a table defined as above which contains one million rows. The table has no constraints and no indexes. The insert operation will be immediate, taking at most a few milliseconds of time:

test=> insert into people values ('111-22-3334', 'Mike', 'Fake'); 
INSERT 0 1

Time: 2,991 ms

But a query retrieving a person by SSN will take much more time:

test=> select * from people where ssn = '111-22-3334'; 
     ssn     | first_name | last_name 
-------------+------------+----------- 
 111-22-3334 | Mike       | Fake 
(1 row) 

Time: 132,499 ms

This is because the data is not structured in any specific way. In order to fetch the required rows, the DB needs to scan the whole table. The query execution time will increase linearly with the number of rows in the table.

Creating an Index

In this simple scenario, we can speed up the query execution time by creating an index on the column by which we filter the results: ssn. Here’s how to do it:

test=> create index ssn_idx on people (ssn); 
CREATE INDEX

Time: 6960,246 ms

Now, let’s take a look at how fast the same query will execute:

test=> select * from people where ssn = '111-22-3334'; 
     ssn     | first_name | last_name 
-------------+------------+----------- 
 111-22-3334 | Mike       | Fake 
(1 row) 

Time: 0,375 ms

Well, 0.375 ms opposed to 132.499 ms gives us... more than 350 times faster querying! The difference is huge.

This time the DB used the index. The index holds the ssn column values and pointers to the corresponding table rows in a well-organized data structure. When executing the query, the index structure is traversed to collect all pointers to rows that match the filter predicate. After that, those rows are retrieved from the table and returned. This huge performance gain is due to the fact that not entire index structure but only part of it is visited. In this particular case, a very small part.

What’s important is that the previously created index will be used only for queries filtering by ssn column. Take a look:

test=> select * from people where ssn = '111-22-3334' or last_name = 'Johnson';

Time: 238,426 ms

The execution time got significantly worse. This is because the DB needs to scan the whole table to find all people called “Johnson” and thus using the index is insufficient (and even pointless). To make such a query run faster we’d need to create a second index: on column last_name:

test=> create index ln_idx on people (last_name); 
CREATE INDEX

Time: 13234,828 ms

Now, the query executes much faster:

test=> select * from people where ssn = '111-22-3334' or last_name = 'Johnson'; 

Time: 7,207 ms

This time, the DB traversed both indexes, joined pointers into one structure and then retrieved the actual results from the table.

Excercise

If we changed “or” with “and” so that the query looked like this:

test=> select * from people where ssn = '111-22-3334' 
		and last_name = 'Johnson';

we need only the first index (on ssn column) to have this query executed in less than 0.5 ms. Why? Does existence of the second index change anything in a way this query executes?

Index-Organized Tables

Now, let’s take a look at index-organized tables (IOTs). IOTs are simply indexes which contain data from all table columns, not just a few of them. This way, the default heap structure is no longer needed. This has both advantages and disadvantages, but we won’t discuss them in this article.

Using indexes is not limited to heap tables only. You can create indexes both for heap tables and index-organized tables. You can have as many indexes as you need to, but there shouldn’t be too many of them per table. This is because of their overhead during insert, update and delete operations. For each of these operations, every index structure needs to be updated; this costs time.

As you can see, a database index is a very powerful mechanism to optimize query execution time. However, proper database indexing is a much more complex issue than described here. It requires more detailed knowledge about index structure and some rules of thumb you should follow when creating indexes. We’ll describe the finer details about database indexes in future articles so stay tuned!

Tags:

Understanding Heap Tables

Creating an Index

Index-Organized Tables

You may also like

SQL Performance Explained – the must-read book