How do you type 2?!

Hi all!

I sometimes get asked from all different people how I deal with my slowly changing dimension tables and I thought it would be interesting to do a post on how I generally approach it and then see how everyone else does it too. That way we can all learn from the best bits of everyone’s ideas! Smile

Over the years the way I do this has evolved and changed from the experiences I have had (for better and for worse), and of course no two implementations are the same.

Now being an MS bod I generally use a combination of SSIS & SQL code for my ETL processing. Just before we crack on, I’d just like to make one assumption clear, in my opinion the SCD wizard transform / component of SSIS is pants!

It’s slow, memory hungry and clunky. I know some people do use it and don’t get me wrong it does work but, again in my opinion, it’s not great.

So, that assumption holding true we need to deal with SCD’s by hand so lets look at my simplistic template to how I approach the most common SCD, type 2.

Firstly what I’ll do is, for each dimension that needs this SCD behaviour,  I will set up what I call a MDCR table (Master Data Control Record) in the staging area (database) for my warehouse solution. Don’t ask me where this acronym came from because I honestly can’t remember!  Smile

So, if I have a dimension called dimProduct that requires type 2 behaviour I’d create a table something like this

CREATE TABLE MDCR_dimProduct( ProductKey INT NOT NULL, — The surrogate key for the dimension ProductNaturalKey VARCHAR(50) NOT NULL, — The Natural or Business key

RowHash INT NOT NULL, — A hash across all type 2 fields in the dimension EffectiveDate DATE NOT NULL, ExpirationDate DATE NULL, MostRecent BIT NOT NULL)

Ok, so what we have is the surrogate key for the dimension, the key for the customer from the source system (business key / BK), a hash value generated with the HASHBYTES() function with the MD5 hash and the dates the record is effective from and to.

Now every record in the dimension, in this case dimProduct, has a  corresponding record in here. In doing it this way I can avoid having any of the type 2 information actually in the dimension.

From here the process is quite straight forward, you bring in all the source data, ensuring that you add an MD5 hashto each row on the corresponding columns to those that you used in the target dimension.

This source data is then checked it against our MDCR table, joining on the business key and MostRecent =1. If the business key doesn’t exist then its a new record so I add a new column called IsNew or similar and tag the record as a new entity.

If the business key does exist then it progresses onto a second phase. In this second phase we compare the hashvalues for the source row to the hashstored in the MDCR table. If the hash values match then no change has been made and this record’s journey stops there. If they do not match then again I add a field called IsNew and the record gets tagged as an update.

Now I take both the new records and update records and combine them together before placing them in a pre created temporary processing table.

Data Flow

Once all relevant records have made it into this temporary processing table I then run a stored procedure.

This stored procedure takes every row in the temporary processing table and adds a new record to our dimension table, using an OUTPUT on the insert to ensure we get the newly added rows surrogate key values and business key.

This may seem strange but if you think about it, in a type two environment, new dimension members are new records but so are updates to existing members.

Once this is done we then take the surrogate key, business key and hash values for all of our new records and add them to the MDCR table.

The final step is to then simply set the MostRecent flag to 0 in the MDCR table for all previous versions of a dimension member (record) that has received an update

The code for the update SP looks something like this

CREATE PROCEDURE [dbo].[UpdateDimProduct]
AS
    SET NOCOUNT ON

    DECLARE @NewRecs TABLE
    (
        ProductKey INT,
        ProductNaturalKey VARCHAR(50),
        RowHASH    VARBINARY(16)
    )  

    -- add the new records and output the new product keys 
    INSERT INTO dbo.dimProduct
            ( ProductName ,
              ProductNaturalKey ,
              ProductType
            )
    OUTPUT 
        INSERTED.ProductKey
        ,INSERTED.ProductNaturalKey, NULL 
    INTO @NewRecs
    SELECT  TDP.ProductName ,
            TDP.ProductNaturalKey ,
            TDP.ProductType
    FROM dbo.temp_dimProduct AS TDP

    -- Get the record hash
    UPDATE NR
    SET NR.RowHASH = TDP.RowHASH
    FROM @NewRecs AS NR
    JOIN dbo.temp_dimProduct AS TDP ON
        NR.ProductNaturalKey = TDP.ProductNaturalKey  

    -- update the relevant maintence records
    UPDATE MDP
    SET MDP.ExpirationDate = dateadd(dd, -1, GETDATE()),
        MDP.MostRecent = 0
    FROM dbo.MDCR_dimProduct AS MDP
    JOIN dbo.temp_dimProduct AS TDP ON
        MDP.ProductNaturalKey = TDP.ProductNaturalKey
        AND MDP.MostRecent = 1
        AND TDP.IsNew = 0  

    --add the new maintenance record
    INSERT INTO dbo.MDCR_dimProduct
            ( ProductKey ,
              ProductNaturalKey ,
              EffectiveDate ,
              ExpirationDate ,
              MostRecent ,
              RowHASH
            )
    SELECT  NR.ProductKey ,
            NR.ProductNaturalKey ,
            GETDATE() AS EffectiveDate,
            NULL AS ExpirationDate,
            1 AS MostRecent,
            NR.RowHASH 
    FROM @NewRecs AS NR

    SET NOCOUNT OFF

And that’s about it.

As I said this is a very broad description of how I generally approach type 2 dimensions. So, hope this has been helpful and I love to hear how you type 2 as well!

Enjoy

Dave

Advertisements

SQL Saturday 194 in Exeter – Part 2

HI all, so as promised here is my follow up post with the slides and demos from both the pre con and also my session on the Saturday. The files can be found here and are password with the password Dave and I gave out on the day (my Saturday session files are not passworded)

https://dl.dropbox.com/u/81661862/ProceduralVsSetBased.zip

https://dl.dropbox.com/u/81661862/SQL%20Sat%20Exeter.zip

https://dl.dropbox.com/u/81661862/SQL%20Server%202012.zip

I also just wanted to take the time to share something really nice that we received from the event, the session feed back. I can’t emphasise enough how much we as speakers appreciate it when people take the time to give feed back, especially comments on top of the scores.

Its also really nice when all the scores are high numbers Smile If you have ever been lucky enough or are ever lucky enough to get to present a conference session or a pre con, you’ll know that a lot of work goes into it, and getting feedback makes it all worth while. Good feedback is awesome but also any constructive criticism is always welcomed too. I don’t personally mind if you give me a low score as long as you tell me why, no one is perfect and this kind of feed back is the best way to improve.

That being said, the two feedbacks I got from SQL Saturday in Exter where all amazing, it’s genuinely humbling and means a lot when people give nice feed back and say that it has helped them

So, firstly I want to share the scores from Dave Ballantyne and my pre con session. By the way all scores are out of 5

image

As you can see we got pretty much full marks in all sections, again this is great and we really appreciate it. The comments as well really made us go “wow”. Again, amazing.

Then the feedback for my session on Saturday, again scores are out of 5

image

Again I was really amazed at the scores and also the comments. Overall I’m totally overwhelmed by how positive everyone was and all the nice things that people said, in feedbacks, on the day and also via twitter.

image

image

image

Again a selection of a couple of the really nice things that people said post the event.

Now you might be asking why “Dave, why are you posting all this ego stroking material” well for a few reasons, to say thank you to everyone that took the time to leave the great feedback. Also to say that it really helps everyone leaving feedback, even the organisers, so they can gauge how the event went down.

And finally to share it so that anyone thinking about speaking for the first time can see what to expect and also that generally everyone is really nice and that all feedback is good, even constructive criticism.

Anyway that’s it for now

Thanks

Dave!

SQL Saturday 194 in Exeter – Looking Back

So this weekend finally saw SQL Saturday Exeter arrive, for me its felt like ages in the build up so it was great when the time finally came!
I traveled up on the Thursday night ready for the pre con Dave Ballantyne and I were doing on the Friday.

We started the Friday morning the breakfast of champions way!

image

As you can see, it was a good way to start the day!

We then went on to our room and waited for our awesome delegates to arrive. We actually “won” the day in terms of numbers having about 16 attendees, something we didn’t hesitate to rib the other pre con trainers about 😉

Having never done a pre con before I wasn’t sure what to expect but the whole day was great, everyone that attended was interested and engaged and asking loads of questions, which is fantastic.

It all went mostly without a hitch other than the “demo gods” messing with us once it twice. The best of these was myself “loosing” a demo I wanted to run spending a couple of minutes trying to find it to no avail and then stumbling across it about 40 minutes later, to the sound of much laughter from the attendees I might add! 😉

image

We broke the day up into four chunks with Dave and I alternating between sections and chipping to each others segments too, which seemed to work really well.

image

A great day of training followed by the speaker meal and a few beers. Miss ninja also came down and joined us for the evening with a view to coming along on the Saturday.

Saturday came and the main event was upon us, was great to see all the speakers from the sql family there and loads of engaged attendees too.

I did my session at ten fourty five and had a full room of engaged people, again asking lots of interesting questions. I was even graced by Jamie Thompson which was good who asked a few questions and kept me on my toes! 🙂

We did unfortunately slip off a little early as myself and miss ninja started to feel a bit under the weather but all in all an awesome weekend was had!

I’m on a train as I write this so I will update the post later today adding a password zip file containing all slides and demos from both the pre con and Saturday session

As a final thing, I want to say a massive thank you to all the organisers and helpers, you guys made this an awesome event and I hope everyone recognises your hard work

Anyway that’s it for now

Dave

Running year number across date ranges

Hi guys,

I recently had an interesting sql problem passed to me and I thought I’d share the solution with you all as I couldn’t find anything nice online about it.

Background

So the issue was that we had a set of data that recorded peoples membership dates (lets say it was a gym memberships). The table looked like this (roughly Smile):

  • Person_ID
  • DateFrom
  • DateTo

The requirement was to run a query that would return a year number for each row that indicated how many contiguous years the person had held a membership at that point in time. The additional complexity with this was that memberships that had a gap of less than or equal to 365 days should also be considered contiguous.

I thought about all kinds of approaches to this (and I’m sure there are some I didnt think of) but I eventually I decided that I needed to do the following:

  • Assign every record a unique ID
  • Assign every every record a parent id which will be the above created unique id of the record that comes before this one, where applicable.
  • Use a recursive CTE to work out the year number

The Code

So lets set up the test data:

CREATE TABLE Memberships
(
    MembershipID INT NOT NULL IDENTITY,
    ParentID INT NULL,
    PersonID INT NOT NULL,
    StartDate DATE NOT NULL,
    EndDate DATE NOT NULL
)

GO

Insert into Memberships(PersonID,  StartDate, EndDate) 
Values(10154997, '1999-10-02', '2000-10-01')
Insert into Memberships(PersonID,  StartDate, EndDate) 
Values(10154997, '2000-10-02', '2001-12-28')
Insert into Memberships(PersonID,  StartDate, EndDate)
Values(10154997, '2002-01-10', '2003-02-10')
Insert into Memberships(PersonID,  StartDate, EndDate) 
Values(10154997, '2003-04-08', '2003-09-15')
Insert into Memberships(PersonID,  StartDate, EndDate) 
Values(10154997, '2003-11-26', '2004-03-15')
Insert into Memberships(PersonID,  StartDate, EndDate) 
Values(10154997, '2004-10-05', '2004-10-06')
Insert into Memberships(PersonID,  StartDate, EndDate) 
Values(10154997, '2005-02-02', '2005-02-09')
Insert into Memberships(PersonID,  StartDate, EndDate) 
Values(10154997, '2005-02-24', '2005-03-01')
Insert into Memberships(PersonID,  StartDate, EndDate) 
Values(10154997, '2006-09-28', '2007-01-07')
Insert into Memberships(PersonID,  StartDate, EndDate) 
Values(10154997, '2008-11-11', '2009-01-12')
Insert into Memberships(PersonID,  StartDate, EndDate) 
Values(613274, '1982-08-06', '1983-08-05')
Insert into Memberships(PersonID,  StartDate, EndDate) 
Values(613274, '1983-08-06', '1984-08-05')
Insert into Memberships(PersonID,  StartDate, EndDate) 
Values(613274, '1984-08-06', '1985-08-05')
Insert into Memberships(PersonID,  StartDate, EndDate) 
Values(613274, '1985-08-06', '1986-08-05')
Insert into Memberships(PersonID,  StartDate, EndDate) 
Values(613274, '1986-08-06', '1987-08-05')
Insert into Memberships(PersonID,  StartDate, EndDate) 
Values(613274, '1987-08-06', '1988-08-05')
Insert into Memberships(PersonID,  StartDate, EndDate) 
Values(613274, '1988-08-06', '1989-08-05')
Insert into Memberships(PersonID,  StartDate, EndDate) 
Values(613274, '1989-08-06', '1990-08-05')

GO

So firstly we need to get the id of the any date records that fall behind each record by no less than 365 days and set the parent id with it:

So as you can see here, we join the table to itself twice, firstly to get a list of all memberships that ended no more than 365 days before this one started and secondly to ensure if more than one record meets that requirement that only the closest one is returned as the “parent”.

UPDATE M
SET M.ParentID = M2.MembershipID 
FROM dbo.Memberships AS M
JOIN dbo.Memberships AS M2 ON 
        M.PersonID = M2.PersonID

        -- Where the enddate is less than 365 days away from my start date
        AND DATEDIFF(dd,M2.EndDate, M.StartDate) <= 365

        -- Stops dates before the start date coming through
        AND DATEDIFF(dd,M2.EndDate, M.StartDate) > 0 

        -- This little sub query max sure that if there are multiple 
        -- start dates within 365 days of my end date that only 
        -- the closest one is returned. 
        AND M2.StartDate =
        (
            SELECT Max(X.StartDate) 
            FROM dbo.Memberships AS X 
            WHERE X.PersonID = M.PersonID
            AND DATEDIFF(dd,X.EndDate, M.StartDate) <= 365
            AND DATEDIFF(dd,X.EndDate, M.StartDate) > 0
         );

Once this bit is done then its simply a matter of a reasonably simple recursive CTE:

WITH DateGroups (ParentID, MembershipID, PersonID, 
    StartDate, EndDate, YearNum)
AS
(
    SELECT  
        M.ParentID,
        M.MembershipID,
        M.PersonID ,
        M.StartDate ,
        M.EndDate ,
        1 AS YearNum
    FROM dbo.Memberships AS M 
    WHERE M.ParentID IS NULL

    UNION ALL 

    SELECT  
        M2.ParentID,
        M2.MembershipID,
        M2.PersonID ,
        M2.StartDate ,
        M2.EndDate ,
        DG.YearNum + 1
    FROM dbo.Memberships AS M2 
        JOIN DateGroups AS DG ON 
        M2.ParentID = DG.MembershipID
)

SELECT  DG.ParentID ,
        DG.MembershipID ,
        DG.PersonID ,
        DG.StartDate ,
        DG.EndDate ,
        DG.YearNum
FROM DateGroups AS DG
ORDER BY DG.PersonID, DG.StartDate, 
    DG.YearNum

and this then gives us the result set:

image

So that’s it really, dates can be a real pain to work with but I found this to be a nice elegant solution and its nice and quick too

A quick mention and thank you to my colleague Richard Thomas (@TR5AD) for passing this original problem my way

Enjoy!

Dave

Smile

Shredding CSV strings in SQL Server

Hi all!

So today I thought I’d go over how to break out a comma separated string stored in sql server into individual rows.

This can be a real PITA, if you’ve ever had to do this you’ll know why. Now before I start this post I want to just say that this is a simplistic example and has one main assumption, which is the each of the values within the string you are wanting to split out are of a fixed width. The number of values can be totally variable but the values within the string must be of a static length.

If you have a string with variable with values this approach isn’t what you need, however it might give you the starting point Smile

So lets assume we are working with a single field called “version” for example, and it contains strings such as:

image

What I want to do is get a line per version, now I’ve seen people use while loops, cursors and all sorts of craziness to solve this issue, when in fact it can be done in a single statement, using our old friend the numbers table!

So the first challenge is to determine how many values we are working with, I find the simplest way to do this is to count the number of delimiters and add one, something like the following:

(LEN([Version]) - LEN(REPLACE([Version], ',', ''))) + 1

What I’m doing here is taking the original length of the column we want to split then subtracting the length of the same column but with the delimiters removed, giving me the number of delimiters in the string, I simply then add one to that number to get the count of unique values I have to get, simples! Open-mouthed smile

Ok so now we have the number of values, how do I go about breaking them out? well firstly, as mentioned earlier, we are going to need a numbers table. Now if you have one of these already in your database great, if not then there is actually a built in one you can use! You can see this table for yourself by running the following:

SELECT N.number
FROM master..spt_values AS N
WHERE N.type = 'P'

Cool, no? So we have a number table and the number of items we need to work with so lets start building the query:

SELECT S.[Version], N.number
FROM dbo.Software AS S,
master..spt_values AS N
WHERE N.type = 'P'
AND N.number < (LEN(S.[Version]) - LEN(REPLACE(S.[Version], ',', '')))

Firstly, yes I KNOW comma “joins” are the devils play thing but in this case its fine, honest! Open-mouthed smile 

This will give us a row per item, in the list. You’ll notice that I’ve left off the +1 on the end if the LEN statements at the bottom. This is because the number table is base 0 and this is very important to how this process works.

So now we add the following to the select clause:

SUBSTRING(
    REPLACE(REPLACE(S.[Version], ' ', ''), ',', ''),
( 4 * N.number ) + 1, 4) AS VersionNum

This might look a bit crazy but stick with it. So the middle line of the three is removing all spaces and commas (you need to remove anything that is not part of the values you wish to split out)

Then we use the result of that as the source for the substring, now the “4 * N.number” part is important because 4 is the length I know each of my values are, so modify this number to suit you. The “+ 1” part is due to the fact that the number table is base 0.

The final “,4” is how many characters you wish to extract from that value, as I wanted the whole value I have used 4, if for example I only wanted the part in front of the decimal place (referring back to my original example at the beginning of the post) then I’d change this value only to “,2”.

The end result looks like this:

image

So we’ve taken our original string “Version” and created a row per individual version number. All in a set based manner without cursors or while loops or anything else, which will always be more performant. 

Hope this helps you out of string shredding hell Smile

Enjoy!

Dave

Destructive Selects – Fun With Deletes

Hi peeps! Today we are going to look at a concept I call destructive selects.

Imagine this scenario, you have a table that contains a unit of work, maybe a source for a data driven subscription or some sort of ETL processing. Obviously you only want to action this unit of work once (i.e. send the report once or process the row once), so you need to remove this record from the unit of work table or it’ll keep getting actioned.

Now you could select from the table and then go back and delete the relevant row(s). However this is two separate operations, which could add issues such as locking and blocking, especially if you’re doing this to support multithreading in an ETL process as per my example below. Not to mention you’ve got two operations to generate a plan for, assign resources to and execute.

The destructive selects method allows you to do this all in one operation. So lets see my example below:

image

Now I’ve seen things similar to the above done to take advantage of multi threading / multiple cores / concurrency in SSIS. The idea is you have all four sources feeding from a single table but doing the work in parallel. Now in this example, you only want each row once or you’ll end up with duplicates in the final destination and having to do a separate select and then delete would be a nightmare. So what we do is the put the below SQL in each data source:

DELETE TOP (100)
FROM dbo.UnitOfWorkTable
OUTPUT DELETED.Co1, DELETED.Col2, DELETED.Col3

This is a destructive select, you have removed the data from the original “UnitOfWork” table and “selected” it out to the data source.

That’s all there is to it really, as I said at the beginning of the post, its a neat little tip for doing things such as the above example but also stuff like adhoc deliveries using data driven subscriptions, ensuring that each row is only ever processed once.

The only caveat I’d add here is that you need to make sure you have a system in place to account for errors so that no rows “go missing”. For my example, I personally would pipe any rows that generate errors into a second output file or table and continue processing the rest.

Well thanks for reading, please post in the comments section below if you have any questions or drop me a message on twitter at @TSQLNinja

Enjoy!

Join me – Inverting Joins to Maintain SARGability

In this post, we’re going to look at a technique I first heard about from MVP and all round good guy called Rob Farley (@Rob_Farley | www.lobsterpot.com.au). He called it “Inverse Predicates” and some of you may have seen me speak about it a few times in the past at user groups and also SQLBits conferences.

So this concept is a work around for SARGability issues, that is the ability to use indexes. You see SQL Server can only use an index on a field if it can guarantee that the order of that data has not been altered, otherwise the index would be meaningless.

This means that most functions in SQL Server aren’t SARGable, that is they remove SQL Servers ability to use any indexes on a field.

So take the example of a online shop, they have a database with two tables, one called SalesInfo which stores information about the sale and one called ShippingInfo which stores information about when sales where shipped to the customer.

Now the MD of our imaginary sales company wants to know which sales that where shipped on a certain date where ordered the day before. That way he can see the lead times between sales and shipping for example.

So you write the query below:

SELECT * 
FROM Sales.SaleInfo AS SA
LEFT OUTER JOIN Sales.ShippingInfo AS SH
    ON SH.OrderKey = SA.OrderKey
    AND SH.ShipDate = DATEADD(dd,-1, SA.SaleDate)
WHERE SH.ShipDate = '2001-07-19'

This will give us the sales where they were sold on the 2001-07-18 and then shipped the next day on the 2001-07-19.

The query plan looks like this:

image

As you can see, we are getting an index seek on the ShippingInfo table but we are having to settle for a index scan on the SaleInfo table.

The reason for this is that the DateAdd function breaks the SARGability on the SaleDate column as it may very well change the order of the data as it is passed through.

So what can we do about this? Well we can invert the join! What do I mean by that? Well the issue is that DateAdd has broken the SARGability, so to get round that what we do is the below:

SELECT * 
FROM Sales.SaleInfo AS SA
LEFT OUTER JOIN Sales.ShippingInfo AS SH
    ON SH.OrderKey = SA.OrderKey 
    AND SH.ShipDate = DATEADD(dd,-1, SA.SaleDate)
    AND SA.SaleDate = DATEADD(dd,1, SH.ShipDate)
WHERE SH.ShipDate = '2001-07-19'

So the query is identical apart from now we have an additional join clause from the first iteration. What you’ll see is that the second and third join clauses are the inverse of each other. So what we’re saying is, give me orders where the ShipDate is equal to the SaleDate minus one day and that SaleDate is equal to ShipDate plus one day, which the same thing, just the other way round.

By doing this we now get the below plan:

image

As you can see we are now getting seeks on both sides! yay! Open-mouthed smile 

This is because we are giving the query optimiser the option to use the version of both fields that aren’t in the DateAdd functions as we inverted the join.

Well hope you enjoyed this post, any questions or comments please post them below or drop me a message on twitter (@TSQLNinja)

Enjoy!