A data warehouse is a system that uses different technologies – including relational databases – to enable analytical reporting, which aids in tactical and strategic decision-making. Learn about all the key concepts of data warehousing in this article.
Understanding how a data warehouse (DWH) works means more than just understanding the database designs for that DWH. You also have to understand online analytical processing (OLAP) and online transaction processing (OLTP) systems and extract-transform-load (ETL) procedures. That’s exactly what we’ll cover in this article.
Let’s start with the differences between OLAP and OLTP. They’re quite different and are used for different things, as you’ll see in the following example.
An Example of OLTP and OLAP Queries
Imagine you’ve logged into an e-commerce website to check your order status. Suppose you ordered a laptop, a book, and some groceries. These orders were placed and paid at the same time. However, the items will be shipped from different locations and are likely to have different delivery dates.
In a typical database design, there will be one table to store the order header details (customer information, shipping address, etc.) and an order item table with one record for each of the items ordered – the delivery dates and item prices are the key attributes. When you click on Your Orders, a database query similar to the following will be executed:
SELECT * FROM order o JOIN order_items oi on o.id = oi.order_id WHERE o.customer_id = 100;
The filter clause
WHERE o.customer_id = 100 will eliminate millions or even billions of records from the tables and will end up fetching just a few records. This is a typical online transaction processing query running on an OLTP system.
Now, imagine that you’ve the head of inventory management for that e-commerce company. You need a report that compares a certain product’s inventory status across all warehouses for the current holiday season and the previous holiday season. This is an analytical report that will require an OLAP system. Such a query will scan many tables and aggregate data across millions of records and across time periods (the current year vs. last year). The performance expectation, the load on the system, and the objective of this query are very different from that of a customer order query.
Now that you understand the broad differences between OLTP and OLAP queries, let’s look at the differences between OLTP and OLAP systems and expectations.
OLTP and OLAP Systems and Expectations
The following areas show the main differences between online transaction processing and online analytical processing:
Frequency of Execution
A typical OLTP query may be executed hundreds or thousands of times an hour, triggered by the actions of customers who are checking the status of their orders.
An OLAP query may be triggered by the actions of a relatively small group of users and may be run infrequently.
Volume of Data Fetched
An OLTP query will retrieve a few records from the database.
An OLAP query will usually fetch and process many thousands or millions of records.
An OLTP query will return the data to the user in a few seconds.
An OLAP query can take many seconds before data is shown to the user.
An OLTP system user is interested in data at a very atomic level – a few orders, a few transactions, etc.
An OLAP user is more interested in aggregate data, e.g. in trends and patterns.
Time Period/Data Retention
OLTP systems are not generally expected to provide historical data.(Historical data, e.g. for orders that were placed three or more years ago, is usually of no interest to customers.) An archival or purging mechanism may be used to keep the OLTP database as small as possible so that query response times can be kept low.
OLAP systems treat historical data as critical. Comparisons across many years are standard in this type of system. While an archival or purging mechanism may be used, granular data will usually be retained for many years; aggregate data may be retained for even longer periods.
Number of Source Systems
All the data necessary for an OLTP request will be available in the same system. This is changing with a move towards microservice architecture patterns.
Data required for OLAP requests may need to be pulled from various systems.
From the above, we can conclude that:
- OLTP and OLAP systems need to be designed differently.
- Processes must be in place to move data from OLTP systems to OLAP systems.
- OLAP users need specialized reporting tools that focus on data analysis.
Thus, changes have occurred in database management systems. While conventional databases like Oracle and PostgreSQL kept adding features – e.g. materialized views, partitioning, etc. – to facilitate analytics, new databases that focused on analytical workloads also entered the market. Greenplum is a classic example. Cloud-native solutions such as Amazon Redshift and Snowflake and products for specific types of analysis (like TimescaleDB for time series analysis) are also available.
Now that we understand the difference between OLTP and OLAP, let’s move on to the next topic: the ETL process.
What Is ETL?
Data must be moved from one or more OLTP systems to the OLAP system. Very often, external data is also sourced and added to make analysis more meaningful. For example, banks may use data procured from credit rating agencies as well as their own internal data for better risk analysis. All this internal and external data needs to be processed and loaded into the OLAP, or data warehousing system. This involves data extraction, transformation, and loading, or ETL. (Note: sometimes data transformation is performed after loading; then the process is referred to as ELT.)
There are various tools that help automate these processes. Informatica, Ab Initio, Talend, SQL Server Integration Services, and other tools have been around for many years. AWS Glue and Fivetran are Cloud-native ETL tools you might want to consider if your infrastructure is mostly in the Cloud.
Database design for Data Warehouses
Now that we understand the general processes and use cases, let’s see how databases are designed for data warehousing.
Data warehouses usually have a star schema design. For a high-level walk through, please refer to this article.
In a DWH, tables are classified as either fact or dimension tables. One fact table will be surrounded by many dimension tables. This looks a bit like a star constellation, so it’s called a star schema. Table names starting with dim_ signify dimension tables, while names starting with fact_ signify fact tables. Here’s an example of a star schema:
Fact tables are used to store numbers or metrics related to transactions. Things like the number of units sold, price, time to ship, and time to deliver are all considered good candidates to be stored in fact tables. For a deeper understanding, see this article on fact tables.
Dimension tables contain master data. They are similar to master tables in an OLTP database, with a few crucial differences:
- Most of the attributes related to the entity will be stored in the dimension table. For example, a customer’s name and address fields such as city, country etc will be stored in the same dimension. In an OLTP system , this data may be distributed/stored in multiple tables (e.g. a master table for countries, another one for cities, etc.).
- History Retention. One typical question is “Did customer behaviour change as a result of an attribute change?” For example, is a customer’s spending going up as a result of their change in marital status? A history of attribute changes will be retained for such cases. For more details, refer to this article on dimension tables.
- Deep Hierarchy. Storing deep hierarchies in the dimension tables helps rollup and drill-down analyses. A standard date dimension may store the fiscal week, fiscal quarter, day of the week and many similar attributes as well as the date. These make rolling up from daily to quarterly or yearly sales easier.
A variation of the star schema called the snowflake schema, where dimensions are normalized to some extent, is also commonly used in data warehouses. You can read the comparison in our article Star Schema vs. Snowflake Schema.
The database is designed (using a star or snowflake or any other approach), and the data is populated using ETL or ELT. Now we need to make sure that end users can analyze the data in any way they want. This is where OLAP tools come into play.
Many OLAP tools feature standard reports that can be generated and published by a number of users. Similarly, some users (i.e. power users) can also generate ad hoc reports using a canvas where objects can be dragged and dropped, i.e. filters can be added and a display format can be selected – tabular, pie chart, histogram, etc.
More sophisticated tools also provide what-if analyses: if the price was X instead of Y, what would have been the profit margin? The ability to extrapolate data and see how numbers could change in the future is another useful analytical feature.
Learn More About Data Warehouses
In this article, we’ve covered the differences between OLTP and OLAP systems. We also explained the different pieces that go into designing and developing a data warehouse. If you want to continue learning about data warehousing, do go through the links in this article. Or you can browse the data warehousing articles on the Vertabelo blog. Happy learning!