Bi-temporal data in the Data Vault: Inscription Time

As an example: how to solve the posting-date issue in banks

A challenge we come across often when working with Banking and Insurance clients is the data, which has already one or multiple timelines attached to it.

I like to differentiate here three main timelines for me. I know, in the insurance business there might be 6 or more timelines, but let’s start with the “basics”.

As we are working primarily with IT systems as the source of our data: A timeline that always exists is the inscription timeline. For me, that is the time when information was captured in the source system.

And for some data, that is it. It will never be transferred to a DWH and it doesn’t make any claims about the past or the future. Super cool: you’re done.

But in our case, this data is transferred to your DWH. Lets a assume you are using Data Vault as your design paradigm but it would apply to all other time-aware modeling techniques also (no Data Mesh is not a modeling but an organizational paradigm). Loading the data in your hubs, links and satellites add a so-called “Load Date” (which is funny enough, hopefully, a timestamp). Pay attention that I don’t use “valid from” or “valid to” in this context as it makes people mix up this timeline with business validity.

If you load this data with a weekly, daily, or even hourly cadence and take only the latest information from your source system, the whole thing stays simple as you are adding only one more timeline. Your Data Vault automation keeps track of everything, hopefully, presenting the information as-of-now and as-of-then on request, and you are happy.

Honestly speaking: we do have a lot of clients that have no more than this level of complexity, and they should stop reading now and be happy. Still, I recommend: if not done yet, please read my series about temporality in the Data Vault for such use cases as even the as-of-then timeline might be already too much for you.

As well if you have a business validity captured within your source system, reading my previous blog posts is sensible to learn how to store this information by extending the hub key.

Assuming the inscription timeline is very relevant for you please ignore the advice in my previous blog entries that inscription time can be ignored if the load time is just near enough.

Examples in which cases inscription time is relevant:

· You get changes in a very high cadence, and your DWH loads happen less often, and all these changes are relevant for you – either having business value or are required to track for auditability – most probably, your source of information will be a Change Data Capture (CDC) stream

· Your source is not the source system itself but some form of export, and it happens that this kind of export contains errors, incomplete data, or even both, and data is delivered, at least in certain circumstances.

· The inscription time is the more accurate timeline for your reporting than your load time – like posting time (also referred to by Tages-End-Vearbeitung (TEV), COB, EOB) based data which represents the “true” history

Let’s frame it differently: if inscription time is relevant for you – you usually will know it.

Solution-Approach: make the inscription time irrelevant

If you reduce the time between incsription in the source system and the load into the Data Vault you can ignore Inscription Time it almost any case and rely on Load Time IF your data arrives in order. In the caes of a Kafka stream as example it is not. If data arrives out-of-order this approach would generate completely wrong output.

Solution-Approach 1: Create a Persistent Staging Layer

Using your default Data Vault patterns, you can take your change key, which might be the technical key plus the inscription time or a change key delivered by the CDC stream, and first store all changes in your persistent staging area (PSA). Then create the latest view on top of it and load that into the Raw Vault. This will make you fully auditable as you keep all the information in the PSA.

The limitation is that the Raw Vault picture is only updated at your load intervals if you don’t create a custom load pattern. This means if changes happen in the seconds or microseconds area, they are stored in the PSA but not in the Raw Vault if they don’t create loops to load every and each record to the vault. This might be sufficient for most applications if your source is a CDC stream. It would be equivalent to just archiving your CDC stream together with your vault.

The real issue is that even creating loops to load the vault: if you get corrected or out-of-order values for the past you can’t sort them in into your Raw Vault satellite in the right location. So, if there is only a slight chance for such a scenario and it needs to be captured that is a deal breaker. And for a long time, I was convinced that at least for proper CDC streams like Attunity/Qlik Replicate or Golden Gate, that is a non-issue, but along came Kafka…

Solution-Approach 2: Inscription time as Load Time

I’m always for reducing complexity, so why not think about loading your inscription time just in your load time columns? Sounds like a good idea. Simple to implement. You don’t need to change the structures of your Data Vault tables and all the tools on top continue working.

In theory, this could be a good solution if you assume, that the data you receive is always correct and will never be corrected afterward. This is a really bad assumption:

In the case of exports delivered to you, there will always be corrections.

Even CDC streams: if they are interrupted, you might sync the status of the base table with your currently loaded data.

· Posting date information is as well delivered out of sync sometimes

And as soon you receive different information for the same inscription time, you would need to delete the already loaded information. Or move it to some archive table. Or do some other ugly stuff.

In short: even if it could be a solution in a very limited scenario, the chance is high that sooner or later, you run into cases where the information you store is, in the best case, incomplete or wrong in the worst one.

Still: if you understand all these problems, it could solve your problem. But as we need to provide general solutions in the Datavault Builder, we have removed this option after a few weeks of beta testing as it failed in all real-world tests. And if I say all: I mean even the ones where we tough, that will work for sure. Usually within a few days. Sometimes within hours.

Solution-Approach 3: Bi-Temporal Satellites

The proper approach is to use a multi-active satellite by extending the key in the satellite to include hash-key (HK), load-time, and inscription time.

When loading data, we now compare the HK and inscription time of the staged data against the data in the vault, and if the attributes are different or the entry is missing, we just insert it with the current timestamp as load time.

The advantage is: we can still use the insert-only pattern to load the satellite.

Disadvantage: we do have now a bi-temporal history in it. If something arrives out-of-order that is fine. We will compose the current knowledge about history only at reading time.

But that is a problem while querying that data. But this can be solved using analytical functions to partition either by “hash & inscription time” or only the “hash.”

The downside is that the performance can become bad if you join different queries, each using an analytical function to retrieve the correct entry. But as Markus Winand says: “Use the index Luke!”. That is what a PIT table does.

But even creating a PIT, there is the downside that the link is now defined by hash, inscription time, and load time.

Happily, Patrick Cuba wrote an interesting article on how to use sequences to create more compact PIT tables and for faster joins to the objects by using a sequence column in the satellites.

Does the solution fit into the DV2.0 framework? I would claim: yes. The bi-temporal satellite is a special form of the multi-active satellite. But compared to a regular multi-active satellite, we do know the “sub-partition key” out-of-the-box, and that allows us as well to use delta loading.
The part about using a sequence in the satellite for indexing is, for me, a technology-dependent implementation decision and doesn’t change anything about the logic of the approach. (just replacing three columns with one short column for join speed).

Does it solve all the challenges that might make inscription time loading necessary: as much as we thought: yes. But again reality has taught us that there is more to it.

Compression

You could assume that a CDC stream only delivers you relevant changes. In reality that is not always the case. But even if that is the case: as soon as you start spliting the data into different satellites there will be CDC rows which don’t cause a change in any given satellite. So we need to appy compression. At a first glance it looks like we can apply the same pattern as for normal satellites:

Out-of-Order arrival combined with compression

But as soon you get data redelivered for an older inscription time. This can cause severe problems if you ignore data rows for certain satellites as they didn’t contained any new information. But if later an information is delivered for an older inscription time it could cause wrong data. That is the reason why we need to keep track of all information that we compressed in the tracking satellite.

Now if a information is delivered out of order we can reconstruct from the satellite and the tracking satellite the full history and insert the right information.

Here an example were at t0,t1 and t2 the value “a” was delivered and compressed. Now at a later load time for inscription time t1 the value “c” is delivered. This value will be inserted at inscription t1 into the satellite AND also at inscription time t2 the value “a” which was compressed before needs to be inserted now as it is different from value “c” at t1.

For our German spearkers we do have a video available: https://www.youtube.com/watch?v=1KXL_XWMhbk

What now? Do I need to take care of inscription time?

But Petr, you were always preaching to ignore inscription times; why are you now writing about how to store and evaluate them?

First: I was always specifying under which conditions you can ignore the inscription time.

Second: Still, most of you should ignore, in most cases, the inscription time.

Third: I started my data career in banking, so I do understand the necessity for such patterns. Still, I also understood how complex handling multiple timelines can be, so I tried always to omit it if there is no business value and will continue to do so.

But now, with a growing customer base for our Datavault Builder software, we got more and more requests for a general solution and after long design and prototype sessions in the past months, we concluded that only the bi-temporal satellite is solving the issue properly and that is what we have done and have released in the 6.3 version of the Datavault Builder.

Kudos to Patrick Cuba (Snowpit) and Thomas Herzog (general idea of bi-temporal satellites as special type of multi-active satellites) and Dirk Lerner (all his documents on Multi-Temporality).