The SQL Language seems easy to learn – the commands follow a simple syntax and don’t describe the specific algorithms used to retrieve data. The simplicity might, however, be deceptive. Not all the database functions operate with the same efficiency. Two very similar queries can vary significantly in terms of the computation time. This article presents some of the best practices that can greatly boost your SQL queries.
1. Learn How to Create Indexes Properly
Learning how to index properly is the best thing you can do to improve the performance of your SQL queries. Indexes allow for quicker access to the database in typical situations. Database novices often find indexes mysterious or difficult. They either index nothing or they try to index everything. Of course, neither of these approaches is right. With no indexes at all, your queries are likely to be slow. If you index everything, your updates and insert triggers will be inefficient.
If you’re not sure that you know enough about indexes, take @MarkusWinand’s three minute-test. Markus’ website “Use the Index, Luke” is the best starting point to learn about indexes. He also published an excellent book on the subject.
2. Only Retrieve the Data You Really Need
A common way of retrieving the desired columns is to use the
* symbol even though not all the columns are really needed:
SELECT * FROM users WHERE age > 20;
If the table is small, retrieving additional columns won’t make much of a difference. For larger data sets, however, specifying the columns might save a lot of computation time:
SELECT id, last_name, sex, age FROM users WHERE age > 20;
Keep in mind, however, that many popular ORM will not let you to create a query which selects only a subset of table columns.
Similarly, if you only need a limited number of rows you should use the
LIMIT clause (or your database’s equivalent). Take a look at the following code:
SELECT name, price FROM products;
For instance, if you only want to display the first 10 records out of 50,000 on your website, it is advisable to inform the database about it. This way, the database will stop the search after finding 10 rows rather than scan the whole table:
SELECT name, price FROM products LIMIT 10;
LIMIT statement is available in MySQL and Postgresql, but other databases have ways to achieve a similar effect.
These above examples illustrate the general idea – you should always think whether you need all the rows returned by an SQL statement. If you don’t, there is always some room for improvement.
3. Avoid Functions on the Left Hand-Side of the Operator
Functions are a handy way to provide complex tasks and they can be used both in the
SELECT clause and in the
WHERE clause. Nevertheless, their application in
WHERE clauses may result in major performance issues. Take a look at the following example:
SELECT nickname FROM users WHERE DATEDIFF(MONTH, appointment_date, '2015-04-28') < 0;
Even if there is an index on the appointment_date column in the table users, the query will still need to perform a full table scan. This is because we use the
DATEDIFF function on the column appointment_date. The output of the function is evaluated at run time, so the server has to visit all the rows in the table to retrieve the necessary data. To enhance performance, the following change can be made:
SELECT nickname FROM users WHERE appointment_date > '2015-04-30';
This time, we aren’t using any functions in the
WHERE clause, so the system can utilize an index to seek the data more efficiently.
4. Consider Getting Rid of Correlated Subqueries
A correlated subquery is a subquery which depends on the outer query. It uses the data obtained from the outer query in its
WHERE clause. Suppose you want to list all users who have made a donation. You could retrieve the data with the following code:
SELECT user_id, last_name FROM users WHERE EXISTS (SELECT * FROM donationuser WHERE donationuser.user_id = users.user_id);
In the above case, the subquery runs once for each row of the main query, thus causing possible inefficiency. Instead, we can apply a join:
SELECT DISTINCT users.user_id FROM users INNER JOIN donationuser ON users.user_id = donationuser.user_id;
If there are millions of users in the database, the statement with the correlated subquery will most likely be less efficient than the
INNER JOIN because it needs to run millions of times. But if you were to look for donations made by a single user, the correlated subquery might not be a bad idea. As a rule of thumb, if you look for many or most of the rows, try to avoid using correlated subqueries. Keep in mind, however, that using correlated subqueries might be inevitable in some cases.
5. Avoid Wildcard Characters at the Beginning of a
Whenever possible, avoid using the
LIKE pattern in the following way:
SELECT * FROM users WHERE name LIKE '%bar%';
The use of the
% wildcard at the beginning of the
LIKE pattern will prevent the database from using a suitable index if such exists. Since the system doesn’t know what the beginning of the name column is, it will have to perform a full table scan anyway. In many cases, this may slow the query execution. If the query can be rewritten in the following way:
SELECT * FROM users WHERE name LIKE 'bar%';
then the performance may be enhanced. You should always consider whether a wildcard character at the beginning is really essential.
Bonus Tip – Read the Execution Plan
The performance of your SQL queries depends on multiple factors, including your database model, the indexes available and the kind of information you wish to retrieve. The best way to keep track of what’s happening with your queries is to analyse the execution plan produced by the optimizer. You can use it to experiment and find the best solution for your statements. If you use PostgreSQL, the article about execution plans on our blog may be of interest to you.