The world is changing.
No – the world as we knew it in IT has changed.
Big Data & Agile are hot topics.
But companies still need to collect, report, and analyze their data. Usually this requires some form of data warehousing or business intelligence system. So how do we do that in the modern IT landscape in a way that allows us to be agile and either deal directly or indirectly with unstructured and semi structured data?
First off, we need to change our evil ways – we can no longer afford to take years to deliver data to the business. We cannot spend months doing detailed analysis to develop use cases and detailed specification documents. Then spend months building enterprise-scale data models only to deploy them and find out the source systems changed and the models have no place to hold the now-relevant data critical to business success.
We have to be more agile than that.
We need a way to do Agile Data Engineering if we ever expect to achieve the goal of Agile Data Warehousing.
The Data Vault System of Business Intelligence provides (among other things) a method and approach to modeling your enterprise data warehouse (EDW) that is agile, flexible, and scalable. This is the first in a series of posts introducing you to the Data Vault, what it is, and how to build one.
Why Data Vault?
Current approaches to modeling a data warehouse include developing 3rd Normal Form (3NF) type models or dimensional star schema models. Now, while there are indeed architects and modelers out there who have been wildly successful using these approaches, there are many more that have failed, and failed big.
Why? Mostly lack of experience, but more so that these approaches have some basic issues that, while resolvable, do require a certain level of engineering expertise that is not readily available in the industry today (and is declining daily).
What are these issues?
In order to collect history in the usual 3NF approach you need to add a timestamp or snapshot date to every table in your model. Not only that but it needs to be in the primary key of the table so that you do not load duplicate rows on any given day. This of course complicates all the cascading foreign key (FK) relationships. In Figure 1 you can see a simple example of a standard Product and Product Category table with Snapshot Dates added to the keys.
Figure 1. Basic model with Snapshot Dates added
Even with this simple model you can start to see the issues. Adding a snapshot date to every PK makes all the keys more complicated. Imagine a master-detail-detail relationship with three levels of snapshot dates. This certainly will complicate not only your change data capture for loading the data warehouse but also your SQL to get the data out. How do you manage all those dates to get a current view of the data (let alone a specific point in time)? While this was a great approach to ensure you had a full history of data, the resulting SQL becomes non-trivial to develop and maintain.
What if you missed a relationship in your initial modeling? Adding a new FK to this model would require reloading the table and potentially losing any change history you had already captured.
Similarly, in a dimensional model, it is not uncommon to get new requirements late in the game that cause you to rethink your design. For example, you have a sales fact table with the dimensions Customer, Product, and Supplier. You build the fact and the dims. Load the data from your source, aggregating by Customer, Product, and Supplier. Done.
Then you build a few reports or even prototype with a pivot table and the business realizes they forgot to say they need to do the analysis by Region and Sales Person as well so they can calculate commissions. You now have two choices – build a new fact table or alter the existing fact table.
If you build a new fact table, you have to load the data from your source system, because the current fact table only has aggregated data. So now you have two load routines hitting the source for the same transactions. If you alter the existing fact table to add the FKs for the new dimensions, you have to reload that table from scratch (possibly losing history) because the current data is at the wrong grain.
Yes, there are solutions for these issues. People do this all the time. But it is not agile. Nor very forgiving of missteps.
Regardless of your skill, these types of issues become bigger and harder to handle at scale. Imagine your warehouse is hundreds of tables with millions or billions of rows. Refactoring and reloading tables at that scale will at least take a lot of time.
The Data Vault Modeling Method gives us an Agile Data Engineering approach to avoid these issues.
Formal Definition of Data Vault
This is the formal definition as written by the inventor Dan Linstedt:
The Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business.
It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent, and adaptable to the needs of the enterprise. It is a data model that is architected specifically to meet the needs of today’s enterprise data warehouses.
The main point here is that Data Vault (DV) was developed specifically to address these and other issues in the data warehousing space. It was built to be a non-volatile, auditable, historical repository of enterprise data. (It does not, however, replace star schemas needed at the reporting layer – more on that later).
Foundational Keys to Data Vault
There are three key features found in the Data Vault approach that allow us to use it as an agile modeling technique. They are:
As we go through this series of posts, I will show you how to model using Data Vault and how that approach allows us to achieve these three foundation goals.
Where Data Vault fits in the architecture
To be clear, there is nothing really new here. The Data Vault fits neatly where you see an EDW in most architecture diagrams. In Bill Inmon’s Corporate Information Factory (CIF) approach, the DV fits right between the Stage layer and the Data Mart Layer. In a Kimball Bus Architecture, the DV becomes the Persistent Staging Area that feeds the Dimensional Data Warehouse.
So at a conceptual level, there is nothing new. It is rather the modeling technique (and the related loading paradigm) that are new (and much more suitable for agile).
Evolution of Data Vault Modeling
So how did we get here? Where did Data Vault come from?
Permit me for a moment to stroll down memory lane with regard to data modeling. Here is a brief timeline:
- E.F. Codd begins research on relational modeling (you know – 3NF and all that)
- Dimension & Fact Modeling is developed by General Mills & Dartmouth University
- E.F. Codd publishes “A Relational Model of Data for Large Shared Data Banks” – 1970
- Bill Inmon introduces the idea of a Data Warehouse
- AC Nielson popularizes the concepts of Facts and Dimensions
- Dr. Peter Chen invents Entity Relationship Diagraming (ERDs) – 1976
- Bill Inmon popularizes (and publishes) Data Warehousing
- Dr. Ralph Kimball popularizes Star Schema design
- Dr. Kimball and Barry Devlin release “Business Data Warehouse”
- Dan Linstedt begins R&D on Data Vault Modeling
- Dan Linstedt released the first 5 articles on Data Vault
- Dan starts giving public seminars on Data Vault
- Yours truly signs up!
- Bill Inmon is quoted as supporting Data Vault as the method for his new DW 2.0 approach.
- Dan formally introduces Data Vault 2.0 at the 1st Annual World Wide Data Vault Consortium (WWDVC) in St Albans, Vermont to an international audience
- 2nd Annual WWDVC is held in Stowe, Vermont
- 3rd Annual WWDVC will be held in Stowe, Vermont. (Sign up here)
Who is using Data Vault?
Data Vault has been taught and in use around the world since 2002. Organizations you might know and some you don’t have been using it for that long. Somewhere in the secret bowels of the US Department of Defense (USDOD) it is used (but Dan can’t say where). I set it up myself at Denver Public Schools, and used a hybrid of Data Vault 2.0 at McKesson Specialty Health. A huge bank in Australia is using DV 2.0 and adding automation!
Here is a short list of a few more notable implementations:
- WebMD Health Services
- Anthem Blue-Cross Blue Shield
- MD Anderson Cancer Center
- Independent Purchasing Cooperative (IPC, Miami) – Owner of Subway
You can see a long list on Dan’s site here.
Introducing Data Vault 2.0
Before I conclude this article, I wanted to give you a glimpse at what is new in Data Vault 2.0.
One of the notable features is that the modeling approach has been augmented to allow you to deploy parts of the Data Vault on a Hadoop or NoSQL engine (very cool). I will show you the specific changes for that in my next post.
But importantly, DV 2.0 is not just about the modeling. It also includes best practices for architecture, loading and methodology. The approach has been expanded to specifically include an agile method based on Scott Ambler’s Disciplined Agile Development.
So Why Try Data Vault?
There is an old Chinese proverb:“Unless you change direction, you're apt to end up where you're headed.”
So do you like where you are in your data warehouse practice today? Do you like where you are heading? Are you agile or agile-enough?
Will you be successful?
Stay tuned for the rest of this series where I will show you the basics of modeling a data vault, then look at the concepts of a Business Vault, and finally how to build your Information Marts from your Data Vault.