Join me – Inverting Joins to Maintain SARGability

In this post, we’re going to look at a technique I first heard about from MVP and all round good guy called Rob Farley (@Rob_Farley | www.lobsterpot.com.au). He called it “Inverse Predicates” and some of you may have seen me speak about it a few times in the past at user groups and also SQLBits conferences.

So this concept is a work around for SARGability issues, that is the ability to use indexes. You see SQL Server can only use an index on a field if it can guarantee that the order of that data has not been altered, otherwise the index would be meaningless.

This means that most functions in SQL Server aren’t SARGable, that is they remove SQL Servers ability to use any indexes on a field.

So take the example of a online shop, they have a database with two tables, one called SalesInfo which stores information about the sale and one called ShippingInfo which stores information about when sales where shipped to the customer.

Now the MD of our imaginary sales company wants to know which sales that where shipped on a certain date where ordered the day before. That way he can see the lead times between sales and shipping for example.

So you write the query below:

SELECT * 
FROM Sales.SaleInfo AS SA
LEFT OUTER JOIN Sales.ShippingInfo AS SH
    ON SH.OrderKey = SA.OrderKey
    AND SH.ShipDate = DATEADD(dd,-1, SA.SaleDate)
WHERE SH.ShipDate = '2001-07-19'

This will give us the sales where they were sold on the 2001-07-18 and then shipped the next day on the 2001-07-19.

The query plan looks like this:

image

As you can see, we are getting an index seek on the ShippingInfo table but we are having to settle for a index scan on the SaleInfo table.

The reason for this is that the DateAdd function breaks the SARGability on the SaleDate column as it may very well change the order of the data as it is passed through.

So what can we do about this? Well we can invert the join! What do I mean by that? Well the issue is that DateAdd has broken the SARGability, so to get round that what we do is the below:

SELECT * 
FROM Sales.SaleInfo AS SA
LEFT OUTER JOIN Sales.ShippingInfo AS SH
    ON SH.OrderKey = SA.OrderKey 
    AND SH.ShipDate = DATEADD(dd,-1, SA.SaleDate)
    AND SA.SaleDate = DATEADD(dd,1, SH.ShipDate)
WHERE SH.ShipDate = '2001-07-19'

So the query is identical apart from now we have an additional join clause from the first iteration. What you’ll see is that the second and third join clauses are the inverse of each other. So what we’re saying is, give me orders where the ShipDate is equal to the SaleDate minus one day and that SaleDate is equal to ShipDate plus one day, which the same thing, just the other way round.

By doing this we now get the below plan:

image

As you can see we are now getting seeks on both sides! yay! Open-mouthed smile 

This is because we are giving the query optimiser the option to use the version of both fields that aren’t in the DateAdd functions as we inverted the join.

Well hope you enjoyed this post, any questions or comments please post them below or drop me a message on twitter (@TSQLNinja)

Enjoy!

Advertisements

Join Me – Semi Joins

Today we are going to be looking at “semi joins”. Semi joins are a logical operator (like join & left outer join for example) but aren’t a join we can explicitly tell SQL Server to use.

A bit of background before we begin, In my SQLBits X session, I spoke about the difference between logical and physical operators. In essence logical operators are things like JOIN, LEFT OUTER JOIN, SEMI JOIN etc and physical operators are how SQL Server actually physically executes that operation. So for example a LEFT OUTER JOIN may be physically executed by a NESTED LOOP JOIN, MERGE JOIN or HASH JOIN etc depending on the data volumes and cardinality / selectivity etc.

So back to the main story, semi joins are one of the ways that SQL Server can choose to implement certain operations. So lets take a look an example:

--lets borrow some data from adventureworks
SELECT 
    C.FirstName 
    ,C.LastName
    ,C.EmailAddress
INTO Contact
FROM AdventureWorks.Person.Contact AS C

GO

--get a subset of the email addresses
SELECT TOP 10000
    C.EmailAddress
INTO Emails
FROM AdventureWorks.Person.Contact AS C
GROUP BY C.EmailAddress
ORDER BY NEWID()

GO

ALTER TABLE dbo.Emails ALTER COLUMN EmailAddress NVARCHAR(50) NOT NULL
ALTER TABLE dbo.Contact ALTER COLUMN EmailAddress NVARCHAR(50) NOT NULL

So what we’re doing here is just building a quick table of all contact information from adventure works and ten thousand random(ish) Smile email addresses from that same table.

Then finally I’m just adding some constraints to keep the query plan tidy (try running the below query without adding these and see if you can figure out what’s going on, 10 TSQLNinja points for anyone that gets it correct Winking smile)

 

Now lets run some queries:

SELECT  C.FirstName ,
       
C.LastName ,
       
C.EmailAddress
FROMdbo.Contact ASC
WHEREC.EmailAddress NOT IN (SELECT E.EmailAddress
                         FROMdbo.Emails ASE)

 

This is asking for all the contact rows where their email address didn’t make it into the email list.  What you can see is we are using a NOT IN which is using a non-correlated sub query.

What we are effectively doing here is a join, we’re saying join these two data sets together, on the EmailAddress field from both tables, and give me the rows from contact where there’s no match.

Now obviously NOT IN isn’t a proper logical operator, so what SQL Server has to do is determine the best type of logical join to use and in some cases a semi join is the best choice.

Lets look at the query plan for the above query:

IMAGE1

 

As you can see, SQL Server has decided to use a “right anti semi join” (bit of a mouthful I know!). So lets break that term down logically:

  • Right – meaning return data from the bottom (right) operator, in this case the contact table
  • Anti – meaning not
  • Semi join – Well semi, from the Latin verb, means half, but how can we have half a join!?

 

Well you see what a semi join does is actually very simple, what it tells SQL Server is “I just need to know there’s data there, but I don’t care what that data is”. So to deal with this request SQL Server returns an empty row for every match on the join predicate (in out case the EmailAddress field). In doing so it correctly identifies that a matching row does indeed exist without the need to transfers the actual data for that row.

So going back to our break down, what the right anti semi join is saying is, “Give me all rows from the contact table (right) where there is no match (anti) from the email table (semi join)”, simples!

Now there are variants on this as below:

  • Left Anti Semi Join – All rows from the top operator where no match exists in the bottom operator
  • Left Semi Join – All rows from the top operator where a match does exist in the bottom operator
  • Right Anti Semi Join – All rows from the top operator where no match exists in the bottom operator
    • NB: this is the one we just looked at
  • Right Semi Join – All rows from the bottom operator where a match does exist in the top operator

 

Now in the query plan shown above, SQL Server has decided to use a nested loop join to physically implement this logical operator, however it can also be physically implemented by merge join and hash join as well.

 

Well I hope this has been informative, if you have any questions pop them in the comments section below (Nice comments about how great this post is are also always welcome Winking smile) or drop me a message on twitter (@TSQLNinja)

Enjoy!

 

Dave