Today we are going to be looking at “semi joins”. Semi joins are a logical operator (like join & left outer join for example) but aren’t a join we can explicitly tell SQL Server to use.
A bit of background before we begin, In my SQLBits X session, I spoke about the difference between logical and physical operators. In essence logical operators are things like JOIN, LEFT OUTER JOIN, SEMI JOIN etc and physical operators are how SQL Server actually physically executes that operation. So for example a LEFT OUTER JOIN may be physically executed by a NESTED LOOP JOIN, MERGE JOIN or HASH JOIN etc depending on the data volumes and cardinality / selectivity etc.
So back to the main story, semi joins are one of the ways that SQL Server can choose to implement certain operations. So lets take a look an example:
--lets borrow some data from adventureworks SELECT C.FirstName ,C.LastName ,C.EmailAddress INTO Contact FROM AdventureWorks.Person.Contact AS C GO --get a subset of the email addresses SELECT TOP 10000 C.EmailAddress INTO Emails FROM AdventureWorks.Person.Contact AS C GROUP BY C.EmailAddress ORDER BY NEWID()
ALTER TABLE dbo.Emails ALTER COLUMN EmailAddress NVARCHAR(50) NOT NULL
ALTER TABLE dbo.Contact ALTER COLUMN EmailAddress NVARCHAR(50) NOT NULL
So what we’re doing here is just building a quick table of all contact information from adventure works and ten thousand random(ish) email addresses from that same table.
Then finally I’m just adding some constraints to keep the query plan tidy (try running the below query without adding these and see if you can figure out what’s going on, 10 TSQLNinja points for anyone that gets it correct )
Now lets run some queries:
SELECT C.FirstName ,
WHEREC.EmailAddress NOT IN (SELECT E.EmailAddress
This is asking for all the contact rows where their email address didn’t make it into the email list. What you can see is we are using a NOT IN which is using a non-correlated sub query.
What we are effectively doing here is a join, we’re saying join these two data sets together, on the EmailAddress field from both tables, and give me the rows from contact where there’s no match.
Now obviously NOT IN isn’t a proper logical operator, so what SQL Server has to do is determine the best type of logical join to use and in some cases a semi join is the best choice.
Lets look at the query plan for the above query:
As you can see, SQL Server has decided to use a “right anti semi join” (bit of a mouthful I know!). So lets break that term down logically:
- Right – meaning return data from the bottom (right) operator, in this case the contact table
- Anti – meaning not
- Semi join – Well semi, from the Latin verb, means half, but how can we have half a join!?
Well you see what a semi join does is actually very simple, what it tells SQL Server is “I just need to know there’s data there, but I don’t care what that data is”. So to deal with this request SQL Server returns an empty row for every match on the join predicate (in out case the EmailAddress field). In doing so it correctly identifies that a matching row does indeed exist without the need to transfers the actual data for that row.
So going back to our break down, what the right anti semi join is saying is, “Give me all rows from the contact table (right) where there is no match (anti) from the email table (semi join)”, simples!
Now there are variants on this as below:
- Left Anti Semi Join – All rows from the top operator where no match exists in the bottom operator
- Left Semi Join – All rows from the top operator where a match does exist in the bottom operator
- Right Anti Semi Join – All rows from the top operator where no match exists in the bottom operator
- NB: this is the one we just looked at
- Right Semi Join – All rows from the bottom operator where a match does exist in the top operator
Now in the query plan shown above, SQL Server has decided to use a nested loop join to physically implement this logical operator, however it can also be physically implemented by merge join and hash join as well.
Well I hope this has been informative, if you have any questions pop them in the comments section below (Nice comments about how great this post is are also always welcome ) or drop me a message on twitter (@TSQLNinja)