Kimball Style dimensions – SCD Type 2 Output
If you haven’t read them I recommend reading the first 3 parts first.
After writing a nine-page long story about the topic including history, modeling, and performance consideration I came to the point where I understood that it is too much to talk about. So, I tried to reduce this to the minimum with a few bullet points.
In the case, you want to omit all this complexity described in this article you might consider using the Datavault Builder DWH automation tool which does all of this for you.
If I write PIT tables below: you could use views instead – the same result just with worse query performance but without the necessity to load the tables.
If I write dimension below the same idea can be applied to facts also.
Creating PIT and PIT+
If you still want to implement this yourself our findings in short:
- Check the 3 first parts of this blog series to be sure that you need SCD type 2 output. It makes reports more complicated as they need to be if there is no real business value of doing so.
- PIT tables are not an end in itself. The SCD type 2 dimension is. This means you shouldn’t ask if somebody or a tool is able to generate PIT tables, but if he/it can create the SCD type 2 dimension as you need it.
- PIT tables can be a means to make SCD type 2 output more performant
- If you create PIT tables do it only for the grains (hubs) that you need for SCD type 2 output
- If you create PIT tables just include time slices for the elements (i.e. satellites and links) involved in your SCD type 2 dimension output. This means you need to collect the business requirements first.
- Some people advocate creating PIT tables as snapshots. This can be a useful option if you only want to access month-end data or other defined time slices.
Since we want to be more flexible, we create continuous SCD type 2 PIT tables, since the snapshots can be extracted from the continuous ones very easily, but not the other way around.
Visual selection of attributes in the Datavault Builder
- If you don’t know which columns you shall output in the SCD type 2 dimension and therefore can’t derive which satellites to include in the PIT table go back to the point where you want to omit the complexity
- If you create PIT tables include also target hashes and history of many-to-one and of one-to-one links. You will need them to join PIT tables together. I call this PIT+
Include satellites and linked hash to the PIT for many-to-one and one-to-one relations
- If you don’t know anything about the expected cardinality of your links in your model you are in trouble. Go back to the section where you want to omit this kind of complexity.
- If based on the DWH timeline (3d: Load Time) you can load PIT tables delta by adding just new rows as the Load Time timeline is always going forward
- Adding Ghost records to satellites to improve PIT query performance wasn’t showing any benefit compared to just left join the satellites on the database platforms we tested (your mileage may vary – we tested Snowflake, Exasol, Oracle, MSSQL)*
- If an SCD type 2 dimension needs columns from more than one grain (satellites connected to different hubs) use a temporal join to cross out always a pair of hubs resulting in one combined timeline
If I didn’t do any mistake the operations are commutative: (((A B) C) D) = (((A D) B) C) ).
- If you want to know how to join PIT tables together just drop me an email or LinkedIn message
- At the end join all necessary satellites to this combined history table to get the necessary attributes
- If you want, you can remove unnecessary time slices. This makes the number of rows to be transferred to the reporting tool smaller but as well without removing the unnecessary time slices the output is perfectly correct.
And that is it. If you managed to do it: congratulations. Just as a short recap:
The next question is how to combine SCD type 1 or 2 facts with SCD type 2 dimensions. But I think I leave this topic for another blog article after you digested the content of this one.
* if somebody has executed any performance tests which indicate different results please feel free to contact me so I can learn. Also, I’m open to sharing our test scripts if anybody is interested.