I am currently working on a project where we need to create a database that will be primarily used to store data for reporting and forecasting. In the past, I have mostly worked with databases used for typical CRUD (create, retrieve, update, and delete) operations of data with some limited reporting features. When performing CRUD operations, normalization is important; while in analytics, a de-normalized structure is generally preferred.
Here is a simple example of a normalized order table:
As an example, I will create two simple products which can be included in an order: a Mountain Bike and a City Bike. Then I will create two orders; one of which contains 2 Mountain Bikes + 1 City Bike, and the other one which contains 1 Mountain Bike + 3 City Bikes. The tables required for these transactions would look like the following:
Here is a simple of example of how we could store the same information in a de-normalized database:
Again, showing the rows for these two orders:
As you can see information is duplicated across rows (Order_Created and Product_Name). The duplication is not a problem. A typical question asked in an analytics application is: were there more Mountain Bikes or City Bikes sold during each month of 2014? Answering this type of question is very fast since you only need to retrieve a few rows to get the complete order. No information is required from other tables so there will be no joins in the query. Once the data is inserted into this table, it is not modified so you don't have to worry about update anomalies.
Transactional Versus Analytical
CRUD applications are called on-line transactional processing (OLTP), while analytics applications are usually called on-line analytical processing (OLAP). OLAP is a somewhat new area for me. Of course, most of my OLTP database work also included analytics such as reports and simple analysis of data. However, in OLAP, the main focus of the database itself is simply to store data for analysis. There will be no real maintenance of data: no screens and transactions for creating, updating and deleting data. With this in mind, I thought it would be interesting to consider the difference in requirements between the OLTP world and the OLAP world.
The main difference between an analytical and transactional database is the nature of the application which is using the data. A transactional database is designed for an application where the user is more interested in CRUD, i.e., creating, reading, updating, and deleting records. With an analytical database, the user is more interested in analysis, reporting, and forecasting. Analytical databases have fewer inserts and very few updates; the main goal is to retrieve and analyze data as quickly as possible. Therefore, it is important to consider the nature of the usage of the data that you are designing your database for.
OLAP Versus OLTP
For a transactional database in OLTP, typically, the data structures are created in a highly normalized manner. However, you can't use a highly normalized database in OLAP. The normalization can quickly lead to issues with performance.
In other words, if you think inserts, updates, and deletes are more frequent and require the best performance then go for a normalized table design; if analysis and reports are more important, consider using a de-normalized table design with a flat table structure.
Of course, data must still be inserted into a database used for analytics, but the performance of insert operations can be lower than for a database used for transactional processing. Analytical databases might have their data inserted/updated once per month or once per year, in a batch process.
To oversimplify a bit, the main challenge when designing an analytical database is to “unlearn” the rules and practices that we typically follow when designing a transactional database – or, at least, put them on hold while working on the analytical database design. De-normalization will improve performance as there will be fewer joins of tables than in a highly normalized structure.
At the same time, recognize that you are making a trade-off by having a de-normalized data structure.
Typically, when working with data that will be used in analysis, reporting and forecasting, we consider a dimension and fact design by creating a fact table connected with dimension tables using foreign key relationships.
Fact tables group metrics that can be aggregated (“sum of …”, “count of …”) into business facts (referred to as measures) with foreign keys to dimension tables. Dimension tables contain attributes that are used to generate groups and filters for the data.
As mentioned, the fact tables should link to dimension data to allow filtering, e.g. “the sum of costs between 2010 and 2014.”
For example, we might have a fact table that records specific sales events. We might also have different dimension tables with many columns to facilitate analysis of time, geography, product, or some other dimensions, depending on our query requirements. For example, we might need a query that analyses the store locations that sold the most Mountain Bikes each month, or which cities or states had the most sales of City Bikes, or which day of the week has the highest bikes sales.
Another benefit of a dimensional model is that it will more easily accommodate unexpected data in the future. Existing tables can be modified by adding new data rows into the table or altering the table with little or no impact. Queries and applications that use the data do not need to be modified to accommodate changes; existing queries and applications work as previously and obtain the same results as before the change.
OLAP has a lot more flavors: there is multi-dimensional OLAP, relational OLAP, hybrid OLAP and other evolving acronyms in the OLAP space. I will not talk about them here. You can find more information here on Wikipedia and other resources (search for “Online analytical processing”, “OLAP Cube”, “MOLAP”, “ROLAP”, “HOLAP”).
Working on a project to create an application that is focused on analytics, reporting, and especially forecasting, has forced me to recognize the different requirements between transactional and analytical database design.
I would be interested to hear your experiences about working in an analytics world versus working in a transactional world.