This article reviews optimal placement of clustered and nonclustered indexes on OLTP databases, and explains how filtered indexes can be used to improve performance.
By default, SQL Server will create the table’s clustered index during the creation of the primary key:
CREATE TABLE PrimaryKeyTest (MyPK INT PRIMARY KEY) GO
SELECT * FROM Sys.Indexes WHERE Object_ID = Object_ID ('PrimaryKeyTest')
This can be overridden by specifying the NONCLUSTERED keyword during creation:
CREATE TABLE PrimaryKeyNonClusteredTest (MyPK INT PRIMARY KEY NONCLUSTERED) GO
SELECT * FROM Sys.Indexes WHERE Object_ID = Object_ID ('PrimaryKeyNonClusteredTest')
The HEAP exists because the table does not have a clustered index defined.
It’s a common misconception that the primary key and clustered index are bound together; the primary key is a special type of unique constraint (it does not allow NULL values, while normal unique constraints do). The Clustered index is a strictly-ordered B-tree intended to maximize scan performance. Only one clustered index per table is permitted.
Although SQL Server defaults the clustered index to the primary key, it is not always the optimal selection. The best choice for the clustered index is selective (many distinct values), narrow (few columns), changes infrequently (if at all), and is often used in queries that scan a range sequentially. This is something that’s rarely done with a primary key; it makes little sense to retrieve a range of values based on a surrogate or business key. The narrowness is important because all non-clustered indexes use the clustered index as the lookup key when accessing data, so bloating the clustered index will do the same to nonclustered indexes. Changes to the clustered index keys cause table re-ordering and fragmentation.
An ideal example of a clustered index use would be on a Sales table where the sale date is recorded in a datetime field. This would not change, would be highly selective, and often used in range queries involving ORDER/GROUP BY.
Markus Winand has a highly detailed explanation of how SQL Server implements clustered indexes.
Unlike clustered indexes, nonclustered indexes contain pointers to the actual data, rather than physically storing the data. Because of this, multiple indexes can be created on the same table (up to 1,000 total). Although there is a performance hit during DML operations to update nonclustered indexes, the benefits greatly outweigh the downsides. They should be used in the following scenarios:
- On any column often accessed by a WHERE clause
- On any column often used in JOIN or GROUP BY clauses
Additionally, Included columns can be added to the nonclustered index for nonkey columns. These are columns referenced in the SELECT portion of the query, rather than the WHERE clause. By adding these as Included columns, the entire query can be satisfied by the index, rather than necessitating additional I/O.
Filtered indexes are a subtype of nonclustered indexes introduced in SQL Server 2008. These indexes have a WHERE clause in the index definition intended to reduce the total number of rows included in the index. Fewer rows means less I/O while accessing the index.
Here’s a simplified example to illustrate potential performance gains.
SalesFigures is created just to hold some monetary data:
CREATE TABLE SalesFigures (SalesID INT IDENTITY, SalesAmount Money)
Now it gets populated with a reasonable amount of data (About 1.4 million rows):
DECLARE @i INT = 1, @Finish INT = 100000 WHILE (@i < @Finish) BEGIN Insert SalesFigures (SalesAmount) VALUES (100), (200), (300), (350), (375), (400), (425), (478), (512), (550), (700), (750), (900), (1000) Set @i+=1 END
Let’s also create a standard nonclustered index for comparison purposes:
CREATE NONCLUSTERED INDEX IX_SalesFigures_SalesAmount ON SalesFigures (SalesAmount) GO
In our simplified example, we’ll assume that the Sales value between 400 and 800 are by far the most interest during reporting:
SELECT SalesAmount FROM SalesFigures WHERE SalesAmount BETWEEN 400 AND 800
Our I/O costs are ~1.4.
Next we’ll drop our original index, and create a filtered one using the values 400 and 800 as ranges for the where clause.
DROP INDEX SalesFigures.IX_SalesFigures_SalesAmount GO CREATE NONCLUSTERED INDEX IX_SalesFigures_SalesAmount ON SalesFigures (SalesAmount) WHERE SalesAmount >= 400 and SalesAmount <= 800 GO
Under the hood, the filter predicate is stored directly in Sys.Indexes:
SELECT filter_definition, * FROM Sys.Indexes WHERE Name = 'IX_SalesFigures_SalesAmount'
Re-running our same query produces the following:
I/O costs have dropped to ~.7 – about 50% of the standard nonclustered index. Even in this highly simplified example, the performance improvements are apparent.
There are some limitations to the search predicates that are available – CASE Statements, NOT IN, the BETWEEN keyword, and date functions cannot be used in the definition of a filtered index.
Sparse Columns and Filtered Indexes
Sparse is a column-level option that can be applied during table creation or alteration. Specifying the Sparse keyword during table creation minimizes the space requirements for NULL values for that column. This option can be used on most SQL Server datatypes (excluded are text, ntext, image, timestamp, geometry, geography, timestamp, and user-defined types). The trade-off is that non-NULL values will require slightly more overhead when read.
Sparse columns and filtered indexes are often used together to maximize storage and performance where significant NULL values are expected for a particular column. The following syntax creates a table with a Sparse column, then creates a filtered index on that column specifying only non-NULL values:
CREATE TABLE MoreSalesFigures (SalesID INT IDENTITY, SalesAmount Money, ThisColumnIsNULLOften Varchar(100) SPARSE) GO
CREATE NONCLUSTERED INDEX IX_MoreSalesFigures_ThisColumnIsNULLOften ON MoreSalesFigures (ThisColumnIsNULLOften) WHERE ThisColumnIsNULLOften IS NOT NULL GO
This minimizes storage requirements for NULL values while maximizing performance when reading non-NULL values. In addition to optimizing NULL values, Sparse columns optimize zeroes in the same way. There is also no realistic limit to how many Sparse columns can be defined on a table (the actual limit is 100,000).