Back to articles list

April 21, 2016 - 7 minutes read

Design patterns

The Snowflake Schema

Emil Drkušić

Database designer and developer, financial analyst.

Tags:

In a previous article we discussed the star schema model. The snowflake schema is next to the star schema in terms of its importance in data warehouse modeling. It was developed out of the star schema, and it offers some advantages over its predecessor. But these advantages come at a cost. In this article, we’ll discuss when and how to use the snowflake schema.

The Snowflake Schema

The snowflake schema’s name comes from the fact that dimension tables branch out and look something like a snowflake. When we look at the model above, we’ll notice it’s a fact table surrounded by a few dimension tables, some of which do the aforementioned branching. Unlike the star schema, dimension tables in the snowflake schema can have their own categories.

The ruling idea behind the snowflake schema is that dimension tables are completely normalized. Each dimension table can be described by one or more lookup tables. Each lookup table can be described by one or more additional lookup tables. This is repeated until the model is fully normalized. The process of normalizing star schema dimension tables is called snowflaking.

You’ll be hearing a lot about normalization in this article. What is normalization? Basically, it’s organizing a database in a way that minimizes redundancies and protects data integrity. Check out this post to learn more about normalization and denormalization.

Snowflake Schema Example: Sales Model

Previously, we used a star schema to model a fictional sales department – this would be akin to a data mart used to track sales activities and results. The model has five dimensions: product, time, store, sales type and employee. In the fact_sales table, price and quantity are stored and grouped based on values in dimension tables. For a refresher, have a look at the star schema sales model below:

Here is the same model organized as a snowflake schema:

The dim_employee and dim_sales_type dimension tables are exactly the same as in the star schema model because they are already normalized.

On the other hand, we applied normalization rules to the rest of the dimension tables.

dim_product dimension table split into two tables in snowflake model

The dim_product dimension table from the star schema is split into two tables in the snowflake model. The dim_product_type table was added to reference the matching type in the dim_product table. Using this, we avoided some data integrity problems.

It’s logical to assume that we’ll already have all of the product names and their related types inserted as part of the ETL process, but suppose we need to add more product names and types. In a star schema we could mistakenly enter the wrong product type into the table. In the snowflake schema:

If we encounter new product type name, we can add a new product type and then relate that type to a newly added record. However, this could result in the user entering wrong information, just as in star schema.
We could check to see if the product name that we want to add already exists. If so, we can get its ID; if not, a warning will come up asking us if we want to add a new product and a related type.

dim_store dimention-table represented by 5 tables in snowflake model

The dim_store dimension table from the star schema is represented by 5 tables in the snowflake schema. These divide the city, region, state and country attributes that were stored in the dim_store table. Normalizing this table not only avoided data integrity risk, it also saved some disk space.

dim_time table represented by 5 tables in snowflake model

The dim_time dimension is represented with five tables. We can think of dim_week, dim_month, dim_year and the dim_weekday tables as dictionaries that describe the dim_time table.

The dim_week, dim_month, dim_year and dim_weekday tables are four different hierarchies used to describe our time dimension. We could add more dimensions like quarters or other related tables if we needed them. In this example, dim_month is a dictionary containing 12 months; from this dimension alone, we have no way of knowing which year that month belongs to; that’s the function of the dim_year table.

Snowflake Schema Example: Supply Orders Model

The other data mart we discussed was for supply orders. The idea is to store and aggregate all supply order data for the following four dimensions: product, time, supplier and employee. Once again, we’ll take a look at the relevant star schema:

Converting this to the snowflake schema, we get the following model:

The same normalization rules as those described for the sales model were used on the dim_product, dim_time and dim_supplier dimension tables.

Advantages and Disadvantages of the Snowflake Schema

There are two main advantages to the snowflake schema:

Better data quality (data is more structured, so data integrity problems are reduced)
Less disk space is used then in a denormalized model

The most notable disadvantage for the snowflake model is that it requires more complex queries. These queries, with their increased number of joins, could decrease performance significantly.

We’ll rewrite the same query used in the star schema article for the snowflake schema sales model. Here’s the query needed to return the quantity of all phone-type products type sold in Berlin stores in 2016:

SELECT 
  dim_store.store_address,
  SUM(fact_sales.quantity) AS quantity_sold

FROM 
  fact_sales
  INNER JOIN dim_product ON fact_sales.product_id = dim_product.product_id
  INNER JOIN dim_product_type ON dim_product.product_type_id = dim_product_type.product_type_id
  INNER JOIN dim_time ON fact_sales.time_id = dim_time.time_id
  INNER JOIN dim_year ON dim_time.year_id = dim_year.year_id
  INNER JOIN dim_store ON fact_sales.store_id = dim_store.store_id
  INNER JOIN dim_city ON dim_store.city_id = dim_city.city_id

WHERE 
  dim_year.action_year = 2016
  AND dim_city.city = 'Berlin'
  AND dim_product_type.product_type_name = 'phone'

GROUP BY 
  dim_store.store_id,
  dim_store.store_address

The Starflake Schema

A starflake schema is a combination of the snowflake and the star schemas. We can view it as a snowflake schema that has some dimension tables denormalized. When used correctly, the starflake schema can give a best-of-both-worlds approach. Obviously, the snowflake part of the model should save disk space, while the star part should improve performance.

The model above is basically a snowflake model with a denormalized dim_time table. Since this schema reduces the number of query joins needed, it could improve performance. On the other hand, we won’t lose a notable amount of the disk space, as most of the table attributes and foreign key attributes share the int type.

The Galaxy Schema

In data warehousing, a galaxy schema is when two or more fact tables share one or more dimension tables. One reason to use this schema is to save disk space. We’ve created a sample galaxy schema below:

Here, we have two fact tables, fact_sales and fact_supply_order, that directly share three dimension tables: dim_product, dim_employee and dim_time. Notice that even dim_store and dim_supplier share the same lookup table, dim_city.

We’ll save space this way, but we must have a few things in mind before we join two data marts (in this case, sales and supply orders) into one galaxy schema:

Is there any logic behind joining them? E.g. Would both data marts be used by the same department?
Are we sure that we need precisely the same dimension and granulation for both data marts?

The snowflake schema is often used in data modeling. It may be the right choice in situations where disk space is more important than performance. If we want a balance between space-saving and performance, we can use the starflake schema. Still, the right fit for any specific problem depends on many parameters. This is one of the areas in IT where we can ‘play’ with factors to come up with the best solution.

Tags:

The Snowflake Schema

Snowflake Schema Example: Sales Model

Snowflake Schema Example: Supply Orders Model

Advantages and Disadvantages of the Snowflake Schema

The Starflake Schema

The Galaxy Schema

You may also like

Spider Schema – a Bridge Between OLTP and OLAP?

How (And How Not) to Decompose Relations

7 Common Database Design Errors

13 Blog Articles on Database Design Best Practices and Tips

19 Online Resources for Learning About Database Design Errors

Naming Conventions in Database Modeling