Saturday, December 10, 2011

Slowly Changing Facts

I had a design challenge on a project a while back where I needed to be able to capture versions of the fact table data and allow the users to analyze the facts as they looked at a chosen point in time.  Sales analyses needed to reflect original amounts (as the fact originally looked), but all other analyses needed to reflect actual fact amounts (which commonly got overwritten with updates from the collection process).  This was further complicated by the fact that the source system was a hosted solution and the vendor was incapable of providing incremental extracts (legacy system of course).  So we simply got a full dump of history every night.  This was a tough one.



So as I started to design a solution, I looked for examples and came across an excellent resource from the masters at the Kimball Group.  In their book "The Kimball Group Reader", article 10.22 (Design Tip in the link below) they describe a similar design challenge that ultimately helped me deliver a solution.  Their design tip focuses on making a DW ready for a Compliance audit. Here's a link to the tip and a picture of their design.  I highly recommend the book too, as it covers a ton of interesting dimensional design challenges and how they've solved them over the years.  It's a collection of their design tips, but also includes updates and additional content, so definitely worth a read.


So the way the solution works is that the fact table changes are simply treated like SCD type 2 changes that are commonly used with dimension processing. Add a Start and End Effective Date to each row to track the versions and allow point-in-time analysis. Also, in addition to what the Kimball model shows, I ended up using a currentFlag to make it easier for my client to filter out history (similar to how one would do on a dimension with history).

With historical fact changes handled, let's focus on dimension changes.  In this case, the complication is that typically an SCD2 change in a dimension spawns a new surrogate key for the dimension row.  And that key would need to be embedded in the fact table as a foreign key.  But what if the fact row didn't have a change?  You would need to create a new fact row with the new dimension surrogate key to keep the relationship intact.  Obviously, that would cause some serious growth in your fact table.  Row splits without much benefit. 
Fortunately, that approach isn't necessary.  The Kimball article lays out a dimension design where you keep the same surrogate key on each Type 2 dimension row, but manage the End Dates appropriately. Equivalent of assigning a DW surrogate key for the master entity (not each SCD2 row) to replace the business key. Then use that master key, in combination with the End Effective Date, to define a unique row (or PK on the table). 
The reality with this approach is that you have a many-to-many between the fact table and the dimension on the dimension master key.  So when querying the dimension and fact table, you must force your query/anlaysis tools to choose a "version" or as-of date that falls between the dimension effective dates.  This allows you to avoid bringing back all combinations and resulting in overstated measurements. Makes the ETL and the presentation layer a bit more challenging, but avoids the fact table headaches that would have resulted from the row splitting approach.
Hope that is helpful.  I'll write more about the challenges of dealing with a hosted legacy system that can't provide incrementals in another post.  That was a fun one.

No comments:

Post a Comment