When we start a data warehousing project, the first thing we do is define the dimensional tables. Dimensional tables are the interesting bits, the framework around which we build our measurements. They come in many shapes and sizes. In this article, we are going to take a closer look at each type of dimensional table.
Dimensional tables provide context to the business processes we wish to measure. They tell us all we need to know about an event. Since they give substance to the measurements (i.e. fact tables) of the data warehouse (DWH) system, we spend more time on their definition and identification than on any other aspect of the project. Fact tables define the measurements; dimensional tables give context. (To read more about fact tables, check out these posts on data warehousing, the star schema, the snowflake schema, and facts about fact tables).
The main characteristic of dimension tables is their multitude of attributes. Attributes are the columns that we summarize, filter, or aggregate. They have low cardinality and are usually textual and temporal. Dimensional tables have one primary key based on the underlying business key or a surrogate key. This primary key is the basis for joining the dimension table to one or more fact tables.
Compared to fact tables, dimension tables are small in size, easy to store, and have little impact on performance.
Let’s now have a look at some of the dimension tables you’ll come across in a data warehouse environment.
A Common Dimensional Table: The Conformed Dimension
We’ll start with a basic type: the conformed dimension. These hold multiple attributes, which can be addressed in several source tables but which refer to the same domain (customer, contract, deal, etc.) Conformed dimensions are used with many facts and should be unique for grain/domain value in the data warehouse.
Let’s look at a typical dimensional table,
id– The dimension table’s primary key.
cust_natural_key– The natural key for the customer.
first_name– The customer’s first name.
last_name– The customer’s last name.
address_residence– The customer’s residential address.
date_of_birth– The customer’s date of birth.
marital_status– Is the customer married? Defined as Y (yes) or N (no).
gender– The gender of the customer, M (male) or F (female).
Dimension table attributes depend on the business need. We can expand the this type of table to hold industry-specific information (date of default, activity, etc.)
Slowly Changing Dimensional Tables
As time passes, dimensions can change their values. In the following paragraphs, we’ll examine dimensions classified by how they store (or don’t store) historical data.
Let’s say you have a customer dimension. Some attributes will likely change in the customer’s lifetime – e.g. phone number, address, marital status, etc. This type of table is what Ralph Kimball calls a slowly changing dimension, or SCD.
The SCD comes in many types; eight of them are fairly common. Of these, you’ll see types 0 through 4 the most ; types 5, 6, and 7 are hybrids of the first five. (Note: The numbering scheme of these SCDs starts with a 0 instead of a 1.)
Hybrid SCDs provide more flexibility and better performance, but at the cost of simplicity. We use these table types when we need to do analytical analysis of current data with some historical considerations.
SCD Type 0: Filling Once
This is the most basic type of the dimension table: you fill it once and never fill it again. Consider this as referential data. A typical example of this is the date dimension. We do not need to fill this dimension with every DWH load. The dimensional table does not change with time. You can not get more dates or alter dates.
The fact table connects to the original attributes of the dimension.
Let’s look at the time dimension:
The structure is pretty straightforward:
id– Surrogate key
time_date– Actual date
time_day– Day of the month
time_week– Week in the year
time_month– Month in a year
time_year– Numeral representation of a year
SCD Type 1: Rewriting Data
As the name suggests, we rewrite this type of dimensional table with every DWH load. We do not need to keep a history of them, but we expect there will be some changes.
The difference between a Type 0 SCD and a Type 1 is not in the structure of the table. It’s to do with the refreshing of the data. You never refresh the data in a Type 0, but you do sometimes in a Type 1.
A rewritable table is the simplest way to handle changes (delete/insert), but it adds little business value. Once you define a dimension table like this, it is hard to install historical tracing.
The fact table connects to the current attributes of the dimension.
Let’s look at the account dimension:
Its structure is as follows:
id– The surrogate key of the table
account_name– The name of the account
account_type– The category of the account
account_activity– Flags different types of activity
If we look at the data before the change, we’d see this:
If the account type has changed, the data would simply be overwritten:
SCD Type 2: Tracking Historical Attributes by Row
This is the most common form of historical tracking in a DWH system. SCD Type 2 tables add new rows for every historical change of dimensional attributes between DWH loads. In this type, we define the primary key as a surrogate key because the business key will have multiple representations across time. When rows that hold the change of data alter, we define a new value for the surrogate key that corresponds to the value in the fact table. We need to add at least two columns,
valid_to, to store history in this manner.
The fact table connects to the historic attributes of the dimension via the surrogate key. The aggregation is done on the natural key.
Let’s look at the previous customer dimension table, now expanded with two date columns:
Let’s look at the data:
Points to Consider:
- How can we flag the current row,
valid_to? (Usually with 31.12.9999, or NULL.)
- How can we flag the first row,
valid_from? (Usually with 01.01.1900, or the date of the first insertion).
- Do you define an inclusive or exclusive row? (Above, we use an inclusive
valid_fromand an exclusive
SCD Type 3: Tracking Historical Attributes by Column
As with the Type 2 SCD, this type adds something to represent historical values. In this case, though, we’re adding new columns. These represent the value of a dimensional row attribute before it changes. Usually, we only keep the previous version of the attribute.
Note: This SCD is rarely used.
The fact table connects to the current and prior attributes of the dimension.
Let’s look at the customer dimension, this time with a previous residential address:
In this example, we added a new column,
previous_address_residence, to represent the customer’s old address. If we look at our initial example, the data in this table would look like this:
All of the historical information, except the customer’s previous address, is lost.
SCD Type 4: Adding a Mini-Dimension
This type of dimension is not based on structural (Type 3) or on value (Type 2) changes. Rather, it is based on design changes to the model. If our dimensional table contains highly volatile data – i.e. data that changes frequently – the size of the dimensional table would grow significantly.
In order to mitigate this, we extract the volatile attributes into a mini-dimension. These mini-dimensions could then be aggregated to the business-relevant level. However, this aggregation should not be confused with fact aggregation. The example will clear this up.
The fact table connects to the historic attributes of the dimension.
Let’s look at an example from a simple financial data mart. Suppose you need to track how late certain customers are with their payments. Let’s call this attribute days past due, or DPD. If we were to track DPD every day in a Type 2 dimension, the table’s size would soon explode beyond manageable limits. So we extract the attribute and find business-relevant periods of DPD – say by 30-day increments (0-30 DPD, 30-60 DPD, 60-90 DPD, etc.)
We can take other high-volatility attributes, such as age, and construct business-relevant periods for them as well (e.g. 20-30 years old, 30-40 years old, etc.)
If we look at the data in the customer mini-dimension, we would have something like this:
The attributes are:
id– Surrogate primary key
DPD_period– Days past due period
DEM_period– Demographics period
The simple star schema would look like this:
Notice that to do any analytics on attributes from both tables, we would have to bridge them through the fact table.
SCD Type 5: Creating a Mini-Dimension with a Rewritable Extension
This is the first of our hybrid dimensional table constructs. In a Type 5 SCD, we add the current version of mini-dimensional data to the dimensional table. We must bear in mind that we will only add the current representation of the mini-dimension to the main dimension.
We refill this mini-dimension extension with every load (the Type 1 “rewritable” SCD).
Although the historization of this extension could very well lead to size problems, we’ve already mitigated them with the mini-dimension table.
We use this technique when we want to do analytics directly on the dimension tables.
Expanding the previous example with the current mini-dimension, we get:
dim_mini_customer_current table contains the most recent attribute values that correspond to the
dim_customer table. Now we can do customer-specific analytics without bridging through the fact table (which is very slow).
The fact table connects to the historic attributes of the dimension.
Type 6: Type 2 (Historical Row) Dimension with a Rewritable Attribute
This is a very common dimensional construct. We add an attribute which stores one thing, usually the last known value, that we rewrite with every load. This enables us to group all facts by their current value, while the historical attribute displays data as it was when events happened.
The fact table connects to the dimensional values at the event-time moment with extra current dimensional values.
Let’s expand the previous customer table with a
Now we have an attribute that we will update to the current value, using the natural key (
Type 7: Type 2 (Historical Row) Dimensions with a Current Mirror
We can use this type only if there is a natural key in the table dimension. The key must not change during the entity’s lifetime.
The idea is simple: we add a current representation of the dimensional table to the snowflake schema. Then we insert the natural key of the new dimension to the fact table. (The surrogate key of the historical dimension is still present.)
The fact table connects to the dimensional values at the event-time moment and to the current dimensional values.
Let’s look at our customer-account star schema. We add the new dimension,
dim_current_customer, to the fact table. This table is connected to the fact table via a natural key,
This construction enables us to do analytical queries on the star schema with both the current and historic values of customer attributes.
A domain dimension is a simple form of dimensional table. It holds information about the domain of the underlying measurement of a dimensional attribute. These tables store various codes and explanatory values.
A simple example would be a currencies table.
In this table, we store descriptive information about various monetary units.
In my personal opinion, the best use of the domain dimension is in the documentation of the data values we find in the DWH system.
Transactional source systems generate lots of indicators and flags. These attributes can be considered categorical data, but they are not business-relevant or self-explanatory. We can file all these indicators and flags in one dimensional table called a junk dimension. The junk dimension is the alternative to using degenerate dimensions. If we do not want to burden the fact table with many degenerate dimensions, we create one junk dimension.
We should note that we do not fill all possible combinations of indicators and flags. We only fill the ones that exist in the source system.
Dimension tables are the skeletons of the data warehousing world: everything is built around them. They are not as big as fact tables, but their structure can be more complex.
You can put together many types of dimension tables, even beyond the ones we have just discussed. What about your business and industry? If you’ve combined dimensional table types into something new, tell us about it!