Temporality in the Data Warehouse Part 1 of 4: Definition of the challenge
In the past years, I was confronted with the demand to create a reporting with SCD type 2 dimensions. But since there are usually 3 or more time lines in the data warehouse, I had to clarify again and again what the business requirement actually is. Here I try to sort my thoughts and give you a guideline for such discussions.
The article is divided into three parts:
– Definition of the different timelines
– Creation of easy to use reports by reducing the timelines included in the DWH output
– Which use cases makes it necessary to output the different timelines
– Technical solutions for the requirement to output entire time surfaces. What is a time slice. What is SCD Type 2. What is the relationship to as-is, as-was, as-of
In a simple world there is only this here and now: The present. If I want to record the development over time I need a timeline.
Which is a time dimension (time line). What is a time slice? What is SCD Type 2? What is the relationship to as-is, as-was, as-of
In a simple world there is only this here and now. So the present. If I want to record the development over time I need a time line.
I can transform a timeline into a point in time if I cut it at any place.
But if I now go there and correct my timeline, I can enter the times of change again on a separate timeline.
If I now record these two time bars at a 90-degree angle to each other. I get a time surface.
I can reduce this time surface at any time to a time bar again if I cut my time surface at any point either parallel to one or the other bar.
So I can reduce a time surface to a time line and this again to a point in time.
Such cuts are so-called as-of views. There is a special cut, namely the one at the currently valid time (which is not always the end of a time line): as-is. And another special cut which represents the combination of information at the time of an event: as-was. That means no matter if I make an as-is, as-was or as-of cut, the resulting simplification of my surface to the line or from the line to the point is the same.
In this simplified example, I can choose the business validity as a timeline and when this information was recorded in an IT system as a second time line. Thus we have a time surface.
In data warehousing, however, a third time bar is added in each case: when did the data warehouse learn of new or changed information. If you now draw this third timeline perpendicular to the existing time area, you get a cube, to put it simply. This is the reason why I like to talk about time dimensions. Because it is difficult to imagine more than 3 dimensions, I will not go into any other time lines beyond the three already mentioned ones.
In order to avoid the different problems with the different terminologies, I like the approach (because the different time lines, if they can be understood as dimensions arranged in a cube) are addressed with dimension numbers.
- 1d: when was an entry from a business perspective valid. According to Snodgrass quoted by Kaul “capturing the history of a changing reality”: Valid Time.
- 2d: Capturing time / System Time / Logged Time / Inscription Time / (((Transaction Time))) / Approximation for Assertion Time in the source IT system. Assertion Time according to Johnston quoted by Kaul: “the time during which a row of data is asserted to make a true statement”.
- 3d: Acquisition Time / System Time / Logged Time/ Load Time / Inscription Time / (((Transaction Time))) / Approximation Assertion Time in the DWH system. Assertion Time according to Johnston quoted by Kaul: “the time during which a row of data is asserted to make a true statement”.
A time slice is a section on a time line in which the attributes for a specific and identifiable object do not change. A time slice can be explicitly described with a start and an end date, or implicitly a time slice can be considered as finished by definition as soon as a new one begins. In this case, all unfinished time slices are valid for all eternity (or from technical reality up to the highest supported date of a database).
It is usual to define a start and an end time for the effective 1d validity, since there can also be times in which a business object is not valid (for example, a terminated contract). The contract can also be reactivated later with the same attributes, even though it has been dormant in the meantime.
With the 2d and 3d timeline, on the other hand, it is assumed that the time of a new entry automatically means the end of the previous one and, conversely, only a new entry can end the validity of the old one. This means that you do not have to save the end time.
Even though Christian Kaul has compiled very well how the different time lines can be called (https://www.linkedin.com/pulse/all-time-world-christian-kaul/) there seems to be no general consensus. The lowest common denominator could still be that the first timeline is the Valid Time (even though some people confusingly refer to the DWH’s Load Time as VALID_FROM and VALID_TO)
Here are my personal favorites (as of today 20 February 2020):
1d: Valid Time
2d: Inscription Time
3d: Load Time
The point is, we don’t have to agree on the names at all, as soon as everyone knows which timeline means what and we have a nameless shortcut through the 1d,2d and 3d naming, simply in the order of the times as they are usually created / recorded.
Now what does SCD Type 2 mean?
SCD type 2 is a procedure to document a new time line from information received at different times.
Usually the term is used as a synonym for the creation of the 3d time line in the data warehouse. The loading time into the DWH is usually used to start a new time slice. Implicitly the load time of the next data set for the same key is used to end the assertion period of the previous entry.
This method has the big advantage that all changes in the source system can be captured and output for each assertion time, what was considered true at that time.
The problem is that there are data warehouse implementations and also modelers who model DWHs and who don’t even ask themselves: which timeline do we actually represent here? Is this timeline even suitable to fulfill the requirements of our reporting? Do we even need this timeline for our evaluations? Are the recorded changes only corrections or do we really record changes in the real world?
From my personal point of view, this lack of analysis of the business requirements leads to ill-considered implementations in which all dimensions are output in SCD Type 2 form in all reports. This lack of understanding is also the reason for the bad habit of using the terms VALID_FROM and VALID_TO for this 3d timeline (fortunately Dan Linstedt established the term Load Time for the Data Vault world).
In the further course of this series, I will try to explain that in certain cases this is not only superfluous but can also be wrong in some cases. At the same time, I would like to offer alternative solutions on how to deal with time in the Data Vault (and Data Warehouse) in general.