Databases need to run optimally, but that’s not such an easy task. The INFORMATION SCHEMA database can be your secret weapon in the war of database optimization.
We’re used to creating databases using a graphical interface or a series of SQL commands. That’s completely fine, but it’s also good to understand a bit about what is going on in the background. This is important for the creation, maintenance, and optimization of a database, and it’s also a good way to track changes that happen ‘behind the scenes’.
In this article, we’ll look at a handful of SQL queries that can help you peer into the workings of a MySQL database.
The INFORMATION_SCHEMA Database
We already discussed the
INFORMATION_SCHEMA database in this article. If you haven’t already read it, I would definitely suggest you do that before continuing.
If you need a refresher on the
INFORMATION_SCHEMA database – or if you decide not to read the first article – here are some basic facts you need to know:
INFORMATION_SCHEMAdatabase is part of the ANSI standard. We’ll be working with MySQL, but other RDBMSs have their variants. You can find versions for H2 Database, HSQLDB, MariaDB, Microsoft SQL Server, and PostgreSQL.
- This is the database that keeps track of all other databases on the server; we’ll find descriptions of all objects here.
- Like any other database, the
INFORMATION_SCHEMAdatabase contains a number of related tables and information about different objects.
- You can query this database using SQL and use the results to:
- Monitor database status and performance, and
- Automatically generate code based on query results.
Now let’s move on to querying the INFORMATION_SCHEMA database. We’ll start by looking at the data model we’re going to use.
The Data Model
The model we’ll use in this article is shown below.
This is a simplified model that allows us to store information about classes, instructors, students, and other related details. Let’s briefly go over the tables.
We’ll store the list of instructors in the
lecturer table. For each lecturer, we’ll record a
first_name and a
class table lists all the classes we have in our school. For each record in this table, we’ll store the
class_name, the lecturer’s ID, a planned
end_date, and any additional
class_details. For the sake of simplicity, I’ll assume that we have only one lecturer per class.
Classes are usually organized as a series of lectures. They generally require one or more exams. We’ll store lists of related lectures and exams in the
exam tables. Both will have the ID of the related class and the expected
Now we need students for our classes. A list of all students is stored in the
student table. Once again, we’ll store only the
first_name and the
last_name of each student.
The final thing we need to do is to track students’ activities. We’ll store a list of every class a student registered for, the student’s attendance record, and their exam results. Each of the remaining three tables –
on_exam – will have a reference to the student and a reference to the appropriate table. Only the
on_exam table will have an additional value: grade.
Yes, this model is very simple. We could add many other details about students, lecturers, and classes. We could store historical values when records are updated or deleted. Still, this model will be enough for the purposes of this article.
Creating a Database
We’re ready to create a database on our local server and examine what is happening inside it. We’ll export the model (in Vertabelo) using the “
Generate SQL script” button.
Then we’ll create a database on the MySQL Server instance. I called my database “
The next thing we need to do is to run a previously-generated SQL script.
Now we have the database with all its objects (tables, primary and foreign keys, alternate keys).
After the script runs, data about the “
classes and students” database is stored in the
INFORMATION_SCHEMA database. This data is in many different tables. I won’t list them all again here; we did that in the previous article.
Let’s see how we can use standard SQL on this database. I’ll start with one very important query:
SET @table_schema = "classes_and_students"; SELECT ROUND(SUM( INFORMATION_SCHEMA.TABLES.DATA_LENGTH + INFORMATION_SCHEMA.TABLES.INDEX_LENGTH ) / 1024 / 1024, 2) AS "DB Size (in MB)", ROUND(SUM( INFORMATION_SCHEMA.TABLES.DATA_FREE )/ 1024 / 1024, 2) AS "Free Space (in MB)" FROM INFORMATION_SCHEMA.TABLES WHERE INFORMATION_SCHEMA.TABLES.TABLE_SCHEMA = @table_schema;
We’re only querying the
INFORMATION_SCHEMA.TABLES table here. This table should give us more than enough details about all tables on the server. Please note that I’ve filtered only tables from the "
classes_and_students" database using the
SET variable in the first line and later using this value in the query. Most tables contain the columns
TABLE_SCHEMA, which denote the table and schema/database this data belongs to.
This query will return the current size of our database and the free space reserved for our database. Here is the actual result:
As expected, the size of our empty database is less than 1 MB, and the reserved free space is much greater.
Table Sizes and Properties
The next interesting thing to do would be to look at the sizes of the tables in our database. To do so, we’ll use the following query:
SET @table_schema = "classes_and_students"; SELECT INFORMATION_SCHEMA.TABLES.TABLE_NAME, ROUND(SUM( INFORMATION_SCHEMA.TABLES.DATA_LENGTH + INFORMATION_SCHEMA.TABLES.INDEX_LENGTH ) / 1024 / 1024, 2) "Table Size (in MB)", ROUND(SUM( INFORMATION_SCHEMA.TABLES.DATA_FREE )/ 1024 / 1024, 2) AS "Free Space (in MB)", MAX( INFORMATION_SCHEMA.TABLES.TABLE_ROWS) AS table_rows_number, MAX( INFORMATION_SCHEMA.TABLES.AUTO_INCREMENT) AS auto_increment_value FROM INFORMATION_SCHEMA.TABLES WHERE INFORMATION_SCHEMA.TABLES.TABLE_SCHEMA = @table_schema GROUP BY INFORMATION_SCHEMA.TABLES.TABLE_NAME ORDER BY 2 DESC;
The query is almost identical to the previous one, with one exception: the result is grouped on the table level.
Here is a picture of the result returned by this query:
First, we can notice that all eight tables have a minimal “Table Size” reserved for table definition, which includes the columns, primary key, and index. The “Free Space” is equally distributed between all tables.
We can also see the number of rows currently in each table and the current value of the auto_increment property for each table. Since all tables are completely empty, we have no data and auto_increment is set to 1 (a value that will be assigned to the next inserted row).
Each table should have a primary key value defined, so it is wise to check if this is true for our database. One way to do this is by joining a list of all tables with a list of constraints. This should give us the info we need.
SET @table_schema = "classes_and_students"; SELECT tab.TABLE_NAME, COUNT(*) AS PRI_number FROM INFORMATION_SCHEMA.TABLES tab LEFT JOIN ( SELECT INFORMATION_SCHEMA.COLUMNS.TABLE_SCHEMA, INFORMATION_SCHEMA.COLUMNS.TABLE_NAME FROM INFORMATION_SCHEMA.COLUMNS WHERE INFORMATION_SCHEMA.COLUMNS.TABLE_SCHEMA = @table_schema AND INFORMATION_SCHEMA.COLUMNS.COLUMN_KEY = 'PRI' ) col ON tab.TABLE_SCHEMA = col.TABLE_SCHEMA AND tab.TABLE_NAME = col.TABLE_NAME WHERE tab.TABLE_SCHEMA = @table_schema GROUP BY tab.TABLE_NAME;
We’ve also used the
INFORMATION_SCHEMA.COLUMNS table in this query. While the first part of the query will simply return all tables in the database, the second part (after
LEFT JOIN) will count the number of PRIs in these tables. We used
LEFT JOIN because we want to see if a table has 0 PRI in the
As expected, each table in our database contains exactly one primary key (PRI) column.
“Islands” are tables that are completely separated from the rest of the model. They happen when a table contains no foreign keys and isn’t referenced in any other table. This really shouldn’t occur unless there is a really good reason, e.g. when tables contain parameters or store results or reports inside the model.
SET @table_schema = "classes_and_students"; SELECT tab.TABLE_NAME, (CASE WHEN f1.number_referenced IS NULL THEN 0 ELSE f1.number_referenced END) AS number_referenced, (CASE WHEN f2.number_referencing IS NULL THEN 0 ELSE f2.number_referencing END) AS number_referencing FROM INFORMATION_SCHEMA.TABLES tab LEFT JOIN -- # table was used as a reference ( SELECT INFORMATION_SCHEMA.KEY_COLUMN_USAGE.REFERENCED_TABLE_SCHEMA, INFORMATION_SCHEMA.KEY_COLUMN_USAGE.REFERENCED_TABLE_NAME, COUNT(*) AS number_referenced FROM INFORMATION_SCHEMA.KEY_COLUMN_USAGE WHERE INFORMATION_SCHEMA.KEY_COLUMN_USAGE.REFERENCED_TABLE_SCHEMA = @table_schema GROUP BY INFORMATION_SCHEMA.KEY_COLUMN_USAGE.REFERENCED_TABLE_SCHEMA, INFORMATION_SCHEMA.KEY_COLUMN_USAGE.REFERENCED_TABLE_NAME ) f1 ON tab.TABLE_SCHEMA = f1.REFERENCED_TABLE_SCHEMA AND tab.TABLE_NAME = f1.REFERENCED_TABLE_NAME LEFT JOIN -- # of references in the table ( SELECT INFORMATION_SCHEMA.KEY_COLUMN_USAGE.TABLE_SCHEMA, INFORMATION_SCHEMA.KEY_COLUMN_USAGE.TABLE_NAME, COUNT(*) AS number_referencing FROM INFORMATION_SCHEMA.KEY_COLUMN_USAGE WHERE INFORMATION_SCHEMA.KEY_COLUMN_USAGE.REFERENCED_TABLE_SCHEMA = @table_schema AND INFORMATION_SCHEMA.KEY_COLUMN_USAGE.REFERENCED_TABLE_NAME IS NOT NULL GROUP BY INFORMATION_SCHEMA.KEY_COLUMN_USAGE.TABLE_SCHEMA, INFORMATION_SCHEMA.KEY_COLUMN_USAGE.TABLE_NAME ) f2 ON tab.TABLE_SCHEMA = f2.TABLE_SCHEMA AND tab.TABLE_NAME = f2.TABLE_NAME WHERE tab.TABLE_SCHEMA = @table_schema;
What’s the idea behind this query? Well, we’re using the
INFORMATION_SCHEMA.KEY_COLUMN_USAGE table to test if any column in the table is a reference to another table or if any column is used as a reference in any other table. The first part of the query selects all the tables. After the first LEFT JOIN, we count the number of times any column from this table was used as a reference. After the second LEFT JOIN, we count the number of times any column from this table referenced any other table.
The result returned is:
In the row for the
class table, the numbers 3 and 1 indicate that this table was referenced three times (in the
on_class tables) and that it contains one attribute referencing another table (
lecturer_id). The other tables follow a similar pattern, although the actual numbers will of course be different. The rule here is that no row should have a 0 in both columns.
So far, everything has gone as expected. We have successfully imported our data model from Vertabelo to the local MySQL Server. All tables contain keys, just as we want them to, and all tables are related to each other – there are no “islands” in our model.
Now, we’ll insert some rows in our tables and use the previously-demonstrated queries to track the changes in our database.
After adding 1,000 rows in the lecturer table, we’ll again run the query from the “
Table Sizes and Properties” section. It will return the following result:
We can easily notice that the number of rows and auto_increment values have changed as expected, but there was no significant change in table size.
This was just a test example; in real-life situations, we would notice significant changes. The number of rows will change drastically in tables populated by users or automated processes (i.e. tables that are not dictionaries). Checking the size of and values in such tables is a very good way to quickly find and correct unwanted behavior.
Care to Share?
Working with databases is a constant pursuit for optimal performance. To be more successful in that pursuit, you should use any tool available. Today we’ve seen a few queries that are useful in our fight for better performance. Have you found anything else useful? Have you played with the
INFORMATION_SCHEMA database before? Share your experience in the comments below.