Some years ago, when PostgreSQL version 8.3 was released, a new extension called tablefunc was introduced. This extension provides a really interesting set of functions. One of them is the crosstab function, which is used for pivot table creation. That’s what we’ll cover in this article.
The simplest way to explain how this function works is using an example with a pivot table. First, we will explain our initial point from a practical perspective, then we’ll define the pivot table we want.
Our Initial Point: Raw Data
As you read this article, imagine yourself as a teacher at a primary (elementary) school. We will assume that you teach every subject (language, music, etc). The school provides a system for you to record all evaluation or test results. The following SQL statement would show the evaluation results you’ve previously loaded into the system:
Select * from evaluations
Our Target: a Pivot Table
The following grid could easily keep track of your students’ progress. In computer science, we call this kind of grid a pivot table. If you analyze how the pivot table is built, you will find that we use values from raw data as column headers or field names (in this case, geography, history, maths, etc.)
Enabling the Crosstab Function
As we previously mentioned, the crosstab function is part of a PostgreSQL extension called tablefunc. To call the crosstab function, you must first enable the tablefunc extension by executing the following SQL command:
CREATE extension tablefunc;
How the Crosstab Function Works
The crosstab function receives an SQL SELECT command as a parameter, which must be compliant with the following restrictions:
- The SELECT must return 3 columns.
- The first column in the SELECT will be the identifier of every row in the pivot table or final result. In our example, this is the student’s name. Notice how students’ names (John Smith and Peter Gabriel) appear in the first column.
- The second column in the SELECT represents the categories in the pivot table. In our example, these categories are the school subjects. It is important to note that the values of this column will expand into many columns in the pivot table. If the second column returns five different values (geography, history, and so on) the pivot table will have five columns.
- The third column in the SELECT represents the value to be assigned to each cell of the pivot table. These are the evaluation results in our example.
If we think of our pivot table as a two-dimensional array, then the first SELECT column is the first array dimension, the second SELECT column is the second dimension, and the third is the array element value.like grid [first_column_value, second_column_value] = third_column_value.
In our example, the SELECT parameter will be:
SELECT student, subject, evaluation_result FROM evaluations ORDER BY 1,2
The crosstab function is invoked in the SELECT statement’s FROM clause. We must define the names of the columns and data types that will go into the final result. For our purposes, the final result is defined as:
AS final_result(Student TEXT, Geography NUMERIC,History NUMERIC,Language NUMERIC,Maths NUMERIC,Music NUMERIC)
Joining all these pieces, our final query will be:
SELECT * FROM crosstab( 'select student, subject, evaluation_result from evaluations order by 1,2') AS final_result(Student TEXT, Geography NUMERIC,History NUMERIC,Language NUMERIC,Maths NUMERIC,Music NUMERIC);
And we can see the result in here:
One Raw Data Set, Many Pivot Tables
From a single data set, we can produce many different pivot tables. Let’s continue with the teacher-and-class example as we look at a few of our options.
Example 1: Monthly Evaluation Averages
As teachers, we may also need a report for a student’s evaluation results for the year to date. For instance, suppose we want to obtain the average evaluations for John Smith from March to July. In a grid like the following, the table would look like this:
|month text||geography numeric||history numeric||language numeric||maths numeric||music numeric|
The SQL for this pivot table is:
SELECT * FROM crosstab( 'select extract(month from period)::text, subject.name, trunc(avg(evaluation_result),2) from evaluation, subject where evaluation.subject_id = subject.subject_id and student_id = 1 group by 1,2 order by 1,2') AS final_result(Month TEXT, Geography NUMERIC,History NUMERIC,Language NUMERIC,Maths NUMERIC,Music NUMERIC);
Please compare this output with the previous table:
Example 2: Finding Incomplete Student Records
We could also call this section ‘One Limitation of Crosstab and How to Fix It’. Before we get into that, let’s set the scene:
Suppose you want to see if some students don’t have an evaluation score for certain subjects. Perhaps you will try our previous query, adding a WHERE clause for July. The code would look like this:
SELECT * FROM crosstab( 'select student, subject, evaluation_result from evaluations where extract (month from evaluation_day) = 7 order by 1,2') AS final_result(Student TEXT, Geography NUMERIC,History NUMERIC,Language NUMERIC,Maths NUMERIC,Music NUMERIC);
The following pivot table is the result of this query. We can quickly see that we don’t have any grades for language, maths, and music for Peter.
However, if we try a regular query to obtain Peter’s grades in July ...
SELECT * from evaluations where extract ( month from evaluation_day)=7 and student like 'Gabriel, Peter'
… we get different results. Here we have grades for geography and language:
Of course, the second query is the correct one because it is showing raw data. The problem is in the pivot table building process – some categories are missing information. To fix this, we can use the crosstab function with a second parameter, which represents the complete list of categories. If there are missing values, the pivot table will still be built correctly. (I’ve highlighted the second parameter query in red.)
SELECT * FROM crosstab( 'select student, subject, evaluation_result from evaluations where extract (month from evaluation_day) = 7 order by 1,2', 'select name from subject order by 1') AS final_result(Student TEXT, Geography NUMERIC,History NUMERIC,Language NUMERIC,Maths NUMERIC,Music NUMERIC);
Now we have a correct pivot table with empty values in the proper places.
Pivot tables give us a different way to see our data. Moreover, we can create different pivot tables based on the same raw data by using the crosstab function. Try building a pivot table that shows the max temperature for each city and month based on the raw data in the table below.
CREATE TABLE weather (city text, when timestamp, temperature float);
The pivot table should have one row for each city and one column for each month. If you like, you can think of other pivot tables that could be made using the same data.