Unterfeldstrasse 18, 8050 Zurich
+41 32 511 27 89

On Multi-Active Satellites in Data Vault

In this article I describe why you could use multi active satellites in your Data Vault, why you probably shouldn't in most cases and how you can do it if you have to: use a document satellite design to prevent granularity issues down the road.

Petr Beles, 2150 GmbH, https://datavault-builder.com
reviewed by Torsten Glunde http://alligator-company.com/

2018-11-07

Implementing Multi-Active Satellites as Document Satellites

In the Data Vault standard there is a pattern called Multi-Active Satellite. This is also partly used to store Bi-Temporality in the Data Vault.

A business key in a hub with a certain granularity is assigned more than one valid entry in the satellite for a point in time (according to the DWH knowledge or assertion time).

Let me try to explain why in certain cases Multi-Active Satellites are being used, when one should better do without them and how one could implement them if one decided to use them anyway.

Reasons to use Multi-Active Satellites

  1. There is really no key to distinguish the different lines for the same business key

In reality, in my experience this is rarely the case and the data is often misinterpreted: even with a customer with different unclassified telephone numbers, each number can be uniquely identified: via the number.

Also: If there is no identifier, technical duplicates cannot be recognized in the source interface. Therefore, from my point of view, the source system should be urged if possible, to adapt the data delivery.

  1. For performance reasons

If a unique subkey is given, you can alternatively define a further hub on the finer granularity and connect it cleanly via a link as an alternative to a multi-active satellite. This allows you to use standard patterns for filling and querying, but you have to load and query two additional objects (although the additional hub in some databases should not affect the queries by using table elimination optimization).

Original Multi-Active Satellite Design

Reasons why you don’t want to use Multi-Active Satellites

A first challenge is that a multi-active satellite can only be filled with delta loads if all lines for a hub key are present in the delta extract. Alternatively, full loads are always possible.

A further challenge shows when querying the multi-active satellite together with the hub because the different granularity causes duplication of the hub records (also known as fanning-out) – even when queried for a given assertion time. This leads to increased complexity in the downstream layers.

Last but not least think about what happens if you want to create a PIT for two or more multi-active satellites on the same hub. I assume you will end in cross-product hell for ever 😊 .

But the most important argument in my opinion is, if you want to link to the finer granularity, you should create your own hub from it. For example consider insurance contracts: the instance of an insurance contract is definitely an independent object to which claims can be linked. Here I would personally always model a contract hub and a contract instance hub instead of a multi-active satellite.

Solution proposal for the implementation: Document Satellite

Let’s assume one has a valid case to implement multi-active satellite at hand – what would a good load pattern look like?

At the meeting of the German Data Vault User Group (DDVUG) in Hamburg in October 2018, Torsten Glunde, Andreas Heitmann, Matthias Müller, somembody I missed the name and me discussed a solution that would solve some of the implementation problems:

Assuming all active rows for a business key must be present at once in staging anyhow (either by using a full load or by using a delta designed accordingly), then you can create a set from these rows.

There are databases that allow native arrays as field types which can be used to store sets. More feasible nowadays might be a JSON document, which is already supported on most platforms. Besides hierarchies a JSON document can also represent sets.

Proposed document satellite design.

So, if you store all lines in a JSON document in one satellite record, the granularity of hub and satellite at a given assertion time will match again. Thus, one can use the same, standard load patterns for all satellites.

Additionally for downstream processing, there is no complexity increase neither in PIT creation nor other reading layers. Unless and only if you have the business requirement to resolve the set into single lines (meaning parsing the JSON document in our case) you can choose to do so. However, this happens consciously at the place where this is desired. This can be done in the presentation layer or the report where fanning-out is desired.

JSON documents in databases

With growing popularity of JSON documents there is also more and more support for handling in several databases like the binary compare in PostgreSQL. Using this feature the irrelevant differences in the document are ignored and a new satellite record only gets inserted if something has effectively changed in the contents. In all cases you need to sort your tuples when creating an array as json arrays are ordered.

On the other hand, if your database only supports text based JSON comparison, you would have to take active measures to ignore irrelevant differences by using some kind of prettify.

Epilog

Using JSON in satellites is not new, of course, and can also be used in other cases: Dan Linstedt himself has described the use of JSON documents for the processing of dynamic source data here: https://danlinstedt.com/allposts/datavaultcat/datavault-2-0-supports-dynamic-data-warehousing/

Wednesday November 7th, 2018