How do you type 2?!

Hi all!

I sometimes get asked from all different people how I deal with my slowly changing dimension tables and I thought it would be interesting to do a post on how I generally approach it and then see how everyone else does it too. That way we can all learn from the best bits of everyone’s ideas! Smile

Over the years the way I do this has evolved and changed from the experiences I have had (for better and for worse), and of course no two implementations are the same.

Now being an MS bod I generally use a combination of SSIS & SQL code for my ETL processing. Just before we crack on, I’d just like to make one assumption clear, in my opinion the SCD wizard transform / component of SSIS is pants!

It’s slow, memory hungry and clunky. I know some people do use it and don’t get me wrong it does work but, again in my opinion, it’s not great.

So, that assumption holding true we need to deal with SCD’s by hand so lets look at my simplistic template to how I approach the most common SCD, type 2.

Firstly what I’ll do is, for each dimension that needs this SCD behaviour,  I will set up what I call a MDCR table (Master Data Control Record) in the staging area (database) for my warehouse solution. Don’t ask me where this acronym came from because I honestly can’t remember!  Smile

So, if I have a dimension called dimProduct that requires type 2 behaviour I’d create a table something like this

CREATE TABLE MDCR_dimProduct( ProductKey INT NOT NULL, — The surrogate key for the dimension ProductNaturalKey VARCHAR(50) NOT NULL, — The Natural or Business key

RowHash INT NOT NULL, — A hash across all type 2 fields in the dimension EffectiveDate DATE NOT NULL, ExpirationDate DATE NULL, MostRecent BIT NOT NULL)

Ok, so what we have is the surrogate key for the dimension, the key for the customer from the source system (business key / BK), a hash value generated with the HASHBYTES() function with the MD5 hash and the dates the record is effective from and to.

Now every record in the dimension, in this case dimProduct, has a  corresponding record in here. In doing it this way I can avoid having any of the type 2 information actually in the dimension.

From here the process is quite straight forward, you bring in all the source data, ensuring that you add an MD5 hashto each row on the corresponding columns to those that you used in the target dimension.

This source data is then checked it against our MDCR table, joining on the business key and MostRecent =1. If the business key doesn’t exist then its a new record so I add a new column called IsNew or similar and tag the record as a new entity.

If the business key does exist then it progresses onto a second phase. In this second phase we compare the hashvalues for the source row to the hashstored in the MDCR table. If the hash values match then no change has been made and this record’s journey stops there. If they do not match then again I add a field called IsNew and the record gets tagged as an update.

Now I take both the new records and update records and combine them together before placing them in a pre created temporary processing table.

Data Flow

Once all relevant records have made it into this temporary processing table I then run a stored procedure.

This stored procedure takes every row in the temporary processing table and adds a new record to our dimension table, using an OUTPUT on the insert to ensure we get the newly added rows surrogate key values and business key.

This may seem strange but if you think about it, in a type two environment, new dimension members are new records but so are updates to existing members.

Once this is done we then take the surrogate key, business key and hash values for all of our new records and add them to the MDCR table.

The final step is to then simply set the MostRecent flag to 0 in the MDCR table for all previous versions of a dimension member (record) that has received an update

The code for the update SP looks something like this

CREATE PROCEDURE [dbo].[UpdateDimProduct]
AS
    SET NOCOUNT ON

    DECLARE @NewRecs TABLE
    (
        ProductKey INT,
        ProductNaturalKey VARCHAR(50),
        RowHASH    VARBINARY(16)
    )  

    -- add the new records and output the new product keys 
    INSERT INTO dbo.dimProduct
            ( ProductName ,
              ProductNaturalKey ,
              ProductType
            )
    OUTPUT 
        INSERTED.ProductKey
        ,INSERTED.ProductNaturalKey, NULL 
    INTO @NewRecs
    SELECT  TDP.ProductName ,
            TDP.ProductNaturalKey ,
            TDP.ProductType
    FROM dbo.temp_dimProduct AS TDP

    -- Get the record hash
    UPDATE NR
    SET NR.RowHASH = TDP.RowHASH
    FROM @NewRecs AS NR
    JOIN dbo.temp_dimProduct AS TDP ON
        NR.ProductNaturalKey = TDP.ProductNaturalKey  

    -- update the relevant maintence records
    UPDATE MDP
    SET MDP.ExpirationDate = dateadd(dd, -1, GETDATE()),
        MDP.MostRecent = 0
    FROM dbo.MDCR_dimProduct AS MDP
    JOIN dbo.temp_dimProduct AS TDP ON
        MDP.ProductNaturalKey = TDP.ProductNaturalKey
        AND MDP.MostRecent = 1
        AND TDP.IsNew = 0  

    --add the new maintenance record
    INSERT INTO dbo.MDCR_dimProduct
            ( ProductKey ,
              ProductNaturalKey ,
              EffectiveDate ,
              ExpirationDate ,
              MostRecent ,
              RowHASH
            )
    SELECT  NR.ProductKey ,
            NR.ProductNaturalKey ,
            GETDATE() AS EffectiveDate,
            NULL AS ExpirationDate,
            1 AS MostRecent,
            NR.RowHASH 
    FROM @NewRecs AS NR

    SET NOCOUNT OFF

And that’s about it.

As I said this is a very broad description of how I generally approach type 2 dimensions. So, hope this has been helpful and I love to hear how you type 2 as well!

Enjoy

Dave

Advertisements