In my last post, we looked at the basics of modeling your data warehouse using the Data Vault 2.0 technique. This time, we get into the details of how we prepare the DV tables for business user access.
What is a Business Data Vault?
If you have done any investigation into Data Vault on various blogs or the LinkedIn discussion group, you have seen a few terms used that often cause confusion. These terms include:
- Business Vault
- Information Vault
- Business Data Vault
- Raw Vault
- Raw Data Vault
So first off, we need to say that from a terminology perspective:
Raw Vault = Raw Data Vault = Data Vault = The Vault
When the majority of people say “Data Vault,” they are referring to what we now may call a Raw Vault. Prior to the addition of the Business Vault, the Raw Vault was universally called the Data Vault, the term “raw” was added so we could be clear on which part of the architecture we were discussing.
What is a Raw Vault? It is what I described in my last post – it is the raw, unfiltered data from the source, loaded into Hubs, Links, and Satellites based on Business Keys.
Now for the definition of a Business Vault (BV): It is an extension of a Raw Vault that applies selected business rules, denormalizations, calculations, and other query assistance functions in order to facilitate user access and reporting. Business Vault tables should be refreshed once their dependent Raw Vault tables are refreshed.
Some more details about what a Business Vault is…
It is first off modeled in DV Style tables (that is Hubs, Links, and Sats), but it is not a complete copy of all the objects in the Raw Vault. It is selective in that we create only structures that hold some significant business value. In this case I mean we will transform the data in some way to apply rules or functions that most of the business users will find useful (as opposed to doing these repeatedly into multiple marts). This includes things like data cleansing, data quality, accounting rules, or well-defined repeatable calculations (e.g., net profit, total compensation, etc.).
Adding Business Vault tables is an option to the Raw Vault. It is not required. You can have a perfectly compliant Data Vault without one. As usually it depends on the business and reporting requirements you have to fulfill.
The primary group to use or access the BV is your power users. These are people who understand SQL and relational models well and are not afraid of having many table joins. They may need to do some exploration or data mining but either do not want to wait for the star schemas to be built or are not doing dimensional analysis (remember the BV is Data Vault style table and joins!). The majority of your users should not be given access to this layer of the architecture, rather they should be querying the data via the Information Mart layer (more on that in the next article), or via a Business Intelligence tool.As of Data Vault 2.0, the Business Vault also includes some specialty tables that help us in building more efficient queries against the Raw Vault. These are Point-in-Time (PIT) tables and Bridge Tables.
A Point-in-Time (PIT) table is a modified Satellite that helps when we need to query data from a Hub that has multiple Satellites.
Take for example, this Employee Hub which has four Satellites:
Figure 1 – Hub with Multiple Sats
A Hub can have multiple Sats for several reasons:
- There are multiple source systems with different attributes coming from each source
- There are specific data classifications that you want to split out
- The rate of change of certain attributes if faster/slower than others
- Combinations of the above
In the example in Figure 1, you see four Sats with different types of data:
Name (and other basic demographics),
Contacts. Now while they may each be coming from different source systems, we can’t really tell with the information we have. However we can guess that an employee’s address probably changes more often than their name, and hopefully their compensation more often than that! The same may be true for an employee’s contact information.
Why split on rate of change? Simple math. Every time a value changes on one attribute on a row, we have to load the entire row into the Sat. If one attributes changes 100 times, we load 100 rows. So even if the rest of the attributes did not change, we still have to load that data. This can cause an explosion in size for a large Sat with a few rapidly changing attributes.
So by separating these attributes out by rate of change, we help reduce the overall disk storage required for the Hub. And query performance for those Sats will be better if there are fewer rows (at least on most databases).
With multiple Sats being loaded independently (based on when the data changes), we now need to figure out how we get the data out of the Raw Vault and get a result that is consistent across time. This will usually require some form of nested sub query. Even to get the most current set of values will require something like this:
SELECT H.HUB_EMPLOYEE_KEY, H.EMPLOYEE_ID, S.FIRST_NAME, S.LAST_NAME, S.HIRE_DATE FROM HUB_EMPLOYEE H JOIN SAT_EMPLOYEE_NAME S ON H. HUB_EMPLOYEE_KEY = S. HUB_EMPLOYEE_KEY WHERE S.SAT_LOAD_DTS = (SELECT (MAX(S2.SAT_LOAD_DTS) FROM SAT_EMPLOYEE_NAME S2 WHERE H. HUB_EMPLOYEE_KEY = S2. HUB_EMPLOYEE_KEY)
If we want to include data from one of the other Sats, the query gets a bit more complex:
SELECT H.HUB_EMPLOYEE_KEY, H.EMPLOYEE_ID, S.FIRST_NAME, S.LAST_NAME, S.HIRE_DATE, C.SALARY FROM HUB_EMPLOYEE H JOIN SAT_EMPLOYEE_NAME S ON H. HUB_EMPLOYEE_KEY = S. HUB_EMPLOYEE_KEY WHERE S.SAT_LOAD_DTS = (SELECT (MAX(S2.SAT_LOAD_DTS) FROM SAT_EMPLOYEE_NAME S2 WHERE H. HUB_EMPLOYEE_KEY = S2. HUB_EMPLOYEE_KEY) JOIN SAT_EMPLOYEE_COMPENSATION C ON H. HUB_EMPLOYEE_KEY = C. HUB_EMPLOYEE_KEY WHERE C.SAT_LOAD_DTS = (SELECT (MAX(S3.SAT_LOAD_DTS) FROM SAT_EMPLOYEE_NAME S3 WHERE H. HUB_EMPLOYEE_KEY = S3. HUB_EMPLOYEE_KEY)
And so on… (you need additional clauses for every additional Sat table accessed).
Now imagine you need not the current view of the data, but a picture of the historical data from a point-in-time. After all this is a data warehouse, right?
You now need to align the various load dates so you get the row from each Sat that was valid during a specific time period or on a specific date. The SQL now gets longer and more complicated; beyond the abilities of many SQL people. They query now looks like this:
SELECT H.HUB_EMPLOYEE_KEY, H.EMPLOYEE_ID, S.FIRST_NAME, S.LAST_NAME, S.HIRE_DATE, C.SALARY FROM HUB_EMPLOYEE H JOIN SAT_EMPLOYEE_NAME S ON H. HUB_EMPLOYEE_KEY = S. HUB_EMPLOYEE_KEY WHERE S.SAT_LOAD_DTS = (SELECT (MAX(S2.SAT_LOAD_DTS) FROM SAT_EMPLOYEE_NAME S2 WHERE H. HUB_EMPLOYEE_KEY = S2. HUB_EMPLOYEE_KEY AND S2.SAT_LOAD_DTS < ’25-Nov-2014’) JOIN SAT_EMPLOYEE_COMPENSATION C ON H. HUB_EMPLOYEE_KEY = C. HUB_EMPLOYEE_KEY WHERE C.SAT_LOAD_DTS = (SELECT (MAX(S3.SAT_LOAD_DTS) FROM SAT_EMPLOYEE_NAME S3 WHERE H. HUB_EMPLOYEE_KEY = S3. HUB_EMPLOYEE_KEY AND S2.SAT_LOAD_DTS < ’25-Nov-2014’) …
(Note: this is why it is said that the Raw Vault is NOT for end user query – it is for efficiently storing the historical data.)
Hence the need for the Business Vault feature called the PIT table.
The PIT table will help us simplify the query by doing the load date alignment for us. The PIT table is loaded after or along with the Raw Vault (once it’s dependent RV tables are loaded of course). In Figure 2, there is now an
Figure 2 – Employee Hub with PIT added
TIP: Technically speaking, the PIT table is part of the BV while the rest of the tables are in the Raw Vault. From an implementation perspective, they can (and probably should) live in the same schema.
Here now is an example of what the data in this PIT table might look like:
So now if we need to get a picture of the data as it was on February 15, 2015, we simply join to the PIT table to find which load dates to use.
SELECT H.HUB_EMPLOYEE_KEY, H.EMPLOYEE_ID, S.FIRST_NAME, S.LAST_NAME, S.HIRE_DATE, C.SALARY, … FROM HUB_EMPLOYEE H JOIN SAT_EMPLOYEE_NAME NM ON H. HUB_EMPLOYEE_KEY = NM. HUB_EMPLOYEE_KEY JOIN SAT_EMPLOYEE_COMPENSATION COMP ON H. HUB_EMPLOYEE_KEY = COMP. HUB_EMPLOYEE_KEY JOIN SAT_EMPLOYEE_ADDRESS ADDR ON H. HUB_EMPLOYEE_KEY = ADDR. HUB_EMPLOYEE_KEY JOIN SAT_EMPLOYEE_CONTACTS CNTC ON H. HUB_EMPLOYEE_KEY = CNTC. HUB_EMPLOYEE_KEY JOIN EMPLOYEE_PIT P ON H. HUB_EMPLOYEE_KEY = P. HUB_EMPLOYEE_KEY AND P.PIT_LOAD_DTS = ’15-FEB-2015’ AND P.NAME_LOAD_DTS = NM.SAT_LOAD_DTS AND P.ADDRESS_LOAD_DTS = ADDR.SAT_LOAD_DTS AND P.CONTACT_LOAD_DTS = CNTC.SAT_LOAD_DTS AND P.COMENSATION_LOAD_DTS = COMP.SAT_LOAD_DTS)
So, if you are loading the PIT table every day, with no gaps, no more subqueries and no more MAX(LOAD_DTS) clauses. This is much easier to read, understand, and tune! (Yes – you should add outer joins if some of the dates in the PIT will be NULL.)
If there are gaps in the load dates of the PIT tables, then the query does need a modification to add one subquery:
SELECT H.HUB_EMPLOYEE_KEY, H.EMPLOYEE_ID, S.FIRST_NAME, S.LAST_NAME, S.HIRE_DATE, C.SALARY, … FROM HUB_EMPLOYEE H JOIN SAT_EMPLOYEE_NAME NM ON H. HUB_EMPLOYEE_KEY = NM. HUB_EMPLOYEE_KEY JOIN SAT_EMPLOYEE_COMPENSATION COMP ON H. HUB_EMPLOYEE_KEY = COMP. HUB_EMPLOYEE_KEY JOIN SAT_EMPLOYEE_ADDRESS ADDR ON H. HUB_EMPLOYEE_KEY = ADDR. HUB_EMPLOYEE_KEY JOIN SAT_EMPLOYEE_CONTACTS CNTC ON H. HUB_EMPLOYEE_KEY = CNTC. HUB_EMPLOYEE_KEY JOIN EMPLOYEE_PIT P ON H. HUB_EMPLOYEE_KEY = P. HUB_EMPLOYEE_KEY AND P.PIT_LOAD_DTS = (SELECT MAX( P2.PIT_LOAD_DTS) FROM EMPLOYEE_PIT P2 WHERE H. HUB_EMPLOYEE_KEY = P2. HUB_EMPLOYEE_KEY AND P2.PIT_LOAD_DTS <= ’15-FEB-2015’) AND P.NAME_LOAD_DTS = NM.SAT_LOAD_DTS AND P.ADDRESS_LOAD_DTS = ADDR.SAT_LOAD_DTS AND P.CONTACT_LOAD_DTS = CNTC.SAT_LOAD_DTS AND P.COMENSATION_LOAD_DTS = COMP.SAT_LOAD_DTS)
Even though we need a subquery, one subquery will process faster than the four subqueries needed if there is no PIT table.
Like a Point-in-Time table, a Bridge table also makes it easier to query the Data Vault and helps performance at the same time. In the case of a Bridge table, it is used to help with joins that involve multiple Hubs and Links. It is really a derived Link table (linking multiple Hubs and Links together).
As you will see, some of the query performance will come by reducing the number of tables required to get the answer we are looking for. As an example using our HR Data Vault model, suppose you wanted to get the full address of a
LOCATION with its
Figure 3 – Location Geography
The query requires an eight (8) table join:
SELECT H1.LOCATION_NAME, S1.STREET_ADDRESS, S1.CITY, S1.STATE_PROVINCE, S1.POSTAL_CODE, S2.COUNTRY_NAME, H3.REGION_NAME, S3.REGION_ID FROM HUB_LOCATION H1 JOIN SAT_LOCATIONS S1 ON H1.HUB_LOCATION_KEY = S1.HUB_LOCATION_KEY JOIN LNK_LOCATION_COUNTRY L1 ON H1.HUB_LOCATION_KEY = L1.HUB_LOCATION_KEY JOIN HUB_COUNTRY H2 ON L1.HUB_COUNTRY_KEY = H2.HUB_COUNTRY_KEY JOIN SAT_COUNTRIES S2 ON H2.HUB_COUNTRY_KEY = S2.HUB_COUNTRY_KEY JOIN LNK_REGION_COUNTRY L2 ON H2.HUB_COUNTRY_KEY = L2.HUB_COUNTRY_KEY JOIN HUB_REGION H3 ON L2.HUB_REGION_KEY = H3.HUB_REGION_KEY JOIN SAT_REGIONS S3 ON H3.HUB_REGION_KEY = S3.HUB_REGION_KEY
So to improve on this query (which is one the business may need on a regular basis), we can build a Bridge table in the Business Vault. The Bridge table will hold all the keys needed to get all the data from the right Satellites plus the text Business Keys for the Hubs (as they are often needed in the queries too).
(Note in Figure 4, I have added 3 invalid FKs in order to illustrate the new join path.)
Figure 4 – Geographic Bridge Table
By adding a derived Bridge table, we can do the same query with only four (4) tables.
SELECT BRDG.LOCATION_NAME, S1.STREET_ADDRESS, S1.CITY, S1.STATE_PROVINCE, S1.POSTAL_CODE, S2.COUNTRY_NAME, BRDG.REGION_NAME, S3.REGION_ID FROM BRIDGE_GEOGRAPHY BRDG JOIN SAT_LOCATIONS S1 ON BRDG.HUB_LOCATION_KEY = S1.HUB_LOCATION_KEY JOIN SAT_COUNTRIES S2 ON BRDG.HUB_COUNTRY_KEY = S2.HUB_COUNTRY_KEY JOIN SAT_REGIONS S3 ON BRDG.HUB_REGION_KEY = S3.HUB_REGION_KEY
To truly complete this picture, you may want to add a PIT table on the
BRIDGE_GEOGRAPHY to record the LOAD_DTS changes in the three associated Sats.
Bridge Tables in DV 2.0
With the addition of the denormalized text Business Keys in the Link tables (mentioned in the previous article in this series), Link tables now look much like the Bridge table. Notice in
COUNTRY_ABBR are in the table along with the Hub Keys. That means in simple cases, you can simply use the Link as a Bridge table.
If you are joining across multiple Links and even to Satellites on Links, you may still want to build a Bridge table to help with the query. As with many things in data warehousing, it depends on your reporting requirements and the complexity of the resulting query.
Add Bridge tables to your Business Vault only after you have determined you have a query performance issue.
There is one other main category of objects that you may need to add to your Business Vault – tables containing predefined calculations or derivations. These tables should be structured to look like Links or Sats attached to existing Hubs or Links in the Raw Vault.
One of the goals for flexibility in the Data Vault architecture is to be able to build and re-build Information Marts quickly and also to insure that the data in multiple marts is consistently calculated.
So in the case of derivations or calculations that will be used by many people in many contexts, it is often best to define that data once in the Business Vault then source the Information Marts from there.
As an example, imagine the HR department wants to be able to quickly report the minimum, maximum, and average annual salary, and total compensation to date of an employee over their entire history with the company. To make sure that these are calculated the same in every mart, we might add a salary summary Satellite to
Figure 5 – Hub Employee with Derived Satellite
In Figure 5, you can see I added a new Sat called
SAT_EMPLOYEE_SAL_SUM. It contains calculated columns
SAT_LOAD_DTS column will tell us when the calculation was last run. No
HASH_DIFF column or
REC_SRC column are needed because this table contains calculated data and is populated on demand.
It is up to you and the business requirements as to how often you rerun the aggregation. It may be wise to trigger the re-aggregation anytime
SAT_EMPLOYEE_COMPENSATION is updated for a particular employee, or maybe just run it once a quarter?
Again, since this is a Business Vault feature, it should be driven by the business requirements.
Notice also that I added
SUM_SAL_LOAD_DTS to the
EMPLOYEE_PIT table. This will make it easier to query the employee’s current NAME, SALARY, and other information alongside the summary information.
TIP: When you add new Satellites to your Raw Vault, be sure to check for PIT tables that need updating as well.
Final Data model
Next time I will show you how we use the Raw Vault and the Business Vault to build Information Marts.
Kent Graziano, CDVM2
The Data Warrior