As an example: how to solve the posting-date issue in banks
A challenge we come across often when working with Banking and Insurance clients is the data, which has already one or multiple timelines attached to it.
I like to differentiate here three main timelines for me. I know, in the insurance business there might be 6 or more timelines, but let’s start with the “basics”.
As we are working primarily with IT systems as the source of our data: A timeline that always exists is the inscription timeline. For me, that is the time when information was captured in the source system.
And for some data, that is it. It will never be transferred to a DWH and it doesn’t make any claims about the past or the future. Super cool: you’re done.
But in our case, this data is transferred to your DWH. Lets a assume you are using Data Vault as your design paradigm but it would apply to all other time-aware modeling techniques also (no Data Mesh is not a modeling but an organizational paradigm). Loading the data in your hubs, links and satellites add a so-called “Load Date” (which is funny enough, hopefully, a timestamp). Pay attention that I don’t use “valid from” or “valid to” in this context as it makes people mix up this timeline with business validity.
If you load this data with a weekly, daily, or even hourly cadence and take only the latest information from your source system, the whole thing stays simple as you are adding only one more timeline. Your Data Vault automation keeps track of everything, hopefully, presenting the information as-of-now and as-of-then on request, and you are happy.
Honestly speaking: we do have a lot of clients that have no more than this level of complexity, and they should stop reading now and be happy. Still, I recommend: if not done yet, please read my series about temporality in the Data Vault for such use cases as even the as-of-then timeline might be already too much for you.
As well if you have a business validity captured within your source system, reading my previous blog posts is sensible to learn how to store this information by extending the hub key.
Assuming the inscription timeline is very relevant for you please ignore the advice in my previous blog entries that inscription time can be ignored if the load time is just near enough.
Examples in which cases inscription time is relevant:
· You get changes in a very high cadence, and your DWH loads happen less often, and all these changes are relevant for you – either having business value or are required to track for auditability – most probably, your source of information will be a Change Data Capture (CDC) stream
· Your source is not the source system itself but some form of export, and it happens that this kind of export contains errors, incomplete data, or even both, and data is delivered, at least in certain circumstances.
· The inscription time is the more accurate timeline for your reporting than your load time – like posting time (also referred to by Tages-End-Vearbeitung (TEV), COB, EOB) based data which represents the “true” history
Let’s frame it differently: if inscription time is relevant for you – you usually will know it.
Using your default Data Vault patterns, you can take your change key, which might be the technical key plus the inscription time or a change key delivered by the CDC stream, and first store all changes in your persistent staging area (PSA). Then create the latest view on top of it and load that into the Raw Vault. This will make you fully auditable as you keep all the information in the PSA.
The limitation is that the Raw Vault picture is only updated at your load intervals if you don’t create a custom load pattern. This means if changes happen in the seconds or microseconds area, they are stored in the PSA but not in the Raw Vault if they don’t create loops to load every and each record to the vault. This might be sufficient for most applications if your source is a CDC stream. It would be equivalent to just archiving your CDC stream together with your vault.
The real issue is that even creating loops to load the vault: if you get corrected or out-of-order values for the past you can’t sort them in into your Raw Vault satellite in the right location. So, if there is only a slight chance for such a scenario and it needs to be captured that is a deal breaker. And for a long time, I was convinced that at least for proper CDC streams like Attunity/Qlik Replicate or Golden Gate, that is a non-issue, but along came Kafka…
I’m always for reducing complexity, so why not think about loading your inscription time just in your load time columns? Sounds like a good idea. Simple to implement. You don’t need to change the structures of your Data Vault tables and all the tools on top continue working.
In theory, this could be a good solution if you assume, that the data you receive is always correct and will never be corrected afterward. This is a really bad assumption:
In the case of exports delivered to you, there will always be corrections.
Even CDC streams: if they are interrupted, you might sync the status of the base table with your currently loaded data.
· Posting date information is as well delivered out of sync sometimes
And as soon you receive different information for the same inscription time, you would need to delete the already loaded information. Or move it to some archive table. Or do some other ugly stuff.
In short: even if it could be a solution in a very limited scenario, the chance is high that sooner or later, you run into cases where the information you store is, in the best case, incomplete or wrong in the worst one.
Still: if you understand all these problems, it could solve your problem. But as we need to provide general solutions in the Datavault Builder, we have removed this option after a few weeks of beta testing as it failed in all real-world tests. And if I say all: I mean even the ones where we tough, that will work for sure. Usually within a few days. Sometimes within hours.
The proper approach is to use a multi-active satellite by extending the key in the satellite to include hash-key (HK), load-time, and inscription time.
When loading data, we now compare the HK and inscription time of the staged data against the data in the vault, and if the attributes are different or the entry is missing, we just insert it with the current timestamp as load time.
The advantage is: we can still use the insert-only pattern to load the satellite.
Disadvantage: we do have now a bi-temporal history in it. If something arrives out-of-order that is fine. We will compose the current knowledge about history only at reading time.
But that is a problem while querying that data. But this can be solved using analytical functions to partition either by “hash & inscription time” or only the “hash.”
The downside is that the performance can become bad if you join different queries, each using an analytical function to retrieve the correct entry. But as Markus Winand says: “Use the index Luke!”. That is what a PIT table does.
But even creating a PIT, there is the downside that the link is now defined by hash, inscription time, and load time.