In Data Vault Modeling, we use hubs to integrate data. Everybody knows that. This is one of the main reasons we choose Data Vault modeling: to integrate data from various data sources. But what does “integration” mean? So far, I haven’t seen a discussion on how we actually integrate data.
In this article, I will simplify my examples, knowing that reality might be more complex than presented here. However, for establishing the main concept, it should help to keep it simple.
In discussions with Carsten Schweiger, we have identified two different patterns: Integration and Unification.
This distinction is important as it changes how we can optimally process the data. Let’s characterize the two options:
Also called passive integration: we have two different source systems with the same business key. An example I often see is the Product with the Product Number, which is shared more or less successfully between different systems.
If we want to compare the list price in Switzerland and Japan, we can load the Swiss ERP data and the Japanese ERP data into the Product Hub. The Product Number is loaded without any prefix into the hub, and the attributes are loaded into a Japanese and a Swiss Satellite.
If we now create an output interface on top of the Data Vault (in the Datavault Builder, we call this a “Business Object”), we will select all the keys from the hub. We can then put the Swiss list price into one column and the Japanese list price into another column.
If the Swiss list price is the master but there are some Japan-exclusive products, we could create a third column with some prioritization rules like: if the Swiss price is filled, take that one; if the Swiss list price is empty, select the Japanese one.
This is the use case we, at least I, usually have in mind when we start a Data Vault project.
So why do we need to differentiate Unification? It differs in that the data sets coming from two different data sources are non-overlapping / exclusive.
A common example is the Order Hub. The orders from Switzerland and Japan are unified in this hub, but Order Number 101 can exist in both Switzerland and Japan, though it is not the same order by business definition.
This means that
The good thing is that usually the big data sets are Unifying and not Integrating data sets.
If we accept a general category of Unifying Hubs as its own pattern:
This allows faster preparation of the full data set but allows also filtering on separate systems and push down as well filters on specific satellite columns.
By differentiating between Integration and Unification, we can optimize the patterns for entities with large amounts of data. Additionally, expressing this information in our data model clarifies the purpose of the data integration pipeline and helps determine which objects can be joined for meaningful output.
You can see in the following video how these two use cases are configured fully automatically in the Datavault Builder:
2150 Datavault Builder AG
Thurgauerstrasse 60
CH-8050 Zurich
Switzerland