Join Me – Semi Joins

Today we are going to be looking at “semi joins”. Semi joins are a logical operator (like join & left outer join for example) but aren’t a join we can explicitly tell SQL Server to use.

A bit of background before we begin, In my SQLBits X session, I spoke about the difference between logical and physical operators. In essence logical operators are things like JOIN, LEFT OUTER JOIN, SEMI JOIN etc and physical operators are how SQL Server actually physically executes that operation. So for example a LEFT OUTER JOIN may be physically executed by a NESTED LOOP JOIN, MERGE JOIN or HASH JOIN etc depending on the data volumes and cardinality / selectivity etc.

So back to the main story, semi joins are one of the ways that SQL Server can choose to implement certain operations. So lets take a look an example:

--lets borrow some data from adventureworks
SELECT 
    C.FirstName 
    ,C.LastName
    ,C.EmailAddress
INTO Contact
FROM AdventureWorks.Person.Contact AS C

GO

--get a subset of the email addresses
SELECT TOP 10000
    C.EmailAddress
INTO Emails
FROM AdventureWorks.Person.Contact AS C
GROUP BY C.EmailAddress
ORDER BY NEWID()

GO

ALTER TABLE dbo.Emails ALTER COLUMN EmailAddress NVARCHAR(50) NOT NULL
ALTER TABLE dbo.Contact ALTER COLUMN EmailAddress NVARCHAR(50) NOT NULL

So what we’re doing here is just building a quick table of all contact information from adventure works and ten thousand random(ish) Smile email addresses from that same table.

Then finally I’m just adding some constraints to keep the query plan tidy (try running the below query without adding these and see if you can figure out what’s going on, 10 TSQLNinja points for anyone that gets it correct Winking smile)

 

Now lets run some queries:

SELECT  C.FirstName ,
       
C.LastName ,
       
C.EmailAddress
FROMdbo.Contact ASC
WHEREC.EmailAddress NOT IN (SELECT E.EmailAddress
                         FROMdbo.Emails ASE)

 

This is asking for all the contact rows where their email address didn’t make it into the email list.  What you can see is we are using a NOT IN which is using a non-correlated sub query.

What we are effectively doing here is a join, we’re saying join these two data sets together, on the EmailAddress field from both tables, and give me the rows from contact where there’s no match.

Now obviously NOT IN isn’t a proper logical operator, so what SQL Server has to do is determine the best type of logical join to use and in some cases a semi join is the best choice.

Lets look at the query plan for the above query:

IMAGE1

 

As you can see, SQL Server has decided to use a “right anti semi join” (bit of a mouthful I know!). So lets break that term down logically:

  • Right – meaning return data from the bottom (right) operator, in this case the contact table
  • Anti – meaning not
  • Semi join – Well semi, from the Latin verb, means half, but how can we have half a join!?

 

Well you see what a semi join does is actually very simple, what it tells SQL Server is “I just need to know there’s data there, but I don’t care what that data is”. So to deal with this request SQL Server returns an empty row for every match on the join predicate (in out case the EmailAddress field). In doing so it correctly identifies that a matching row does indeed exist without the need to transfers the actual data for that row.

So going back to our break down, what the right anti semi join is saying is, “Give me all rows from the contact table (right) where there is no match (anti) from the email table (semi join)”, simples!

Now there are variants on this as below:

  • Left Anti Semi Join – All rows from the top operator where no match exists in the bottom operator
  • Left Semi Join – All rows from the top operator where a match does exist in the bottom operator
  • Right Anti Semi Join – All rows from the top operator where no match exists in the bottom operator
    • NB: this is the one we just looked at
  • Right Semi Join – All rows from the bottom operator where a match does exist in the top operator

 

Now in the query plan shown above, SQL Server has decided to use a nested loop join to physically implement this logical operator, however it can also be physically implemented by merge join and hash join as well.

 

Well I hope this has been informative, if you have any questions pop them in the comments section below (Nice comments about how great this post is are also always welcome Winking smile) or drop me a message on twitter (@TSQLNinja)

Enjoy!

 

Dave

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s