Need help with a Not Exists sub query please

Question

Post reply

Need help with a Not Exists sub query please

mrwillcostello

SSC Eights!

Points: 925
More actions
December 5, 2011 at 4:25 pm

#247322

List the empid, lastname, and firstname of all employees who did not take an order in the

month of February, 2007. Write a correlated subquery that uses NOT EXISTS.

SELECT DISTINCT O.empid, HR.firstname, HR.lastname

FROM Sales.Orders AS O

JOIN HR.Employees AS HR

ON O.empid = HR.empid

WHERE NOT EXISTS

(SELECT empid

FROM Sales.Orders

WHERE YEAR (orderdate) = '2007' AND

MONTH (orderdate) = '02')

What am I doing wrong?

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply

SQLALX SSC Eights! Points: 956 More actions · Answer 1

SQLALX

SSC Eights!

Points: 956

December 5, 2011 at 4:52 pm

#1417075

Year() and Month() function returns int data.

.....

WHERE NOT EXISTS

(SELECT empid

FROM Sales.Orders

WHERE YEAR (orderdate) = 2007 AND <==

MONTH (orderdate) = 02) <==

mrwillcostello SSC Eights! Points: 925 More actions · Answer 2

mrwillcostello

SSC Eights!

Points: 925

December 5, 2011 at 5:00 pm

#1417077

Thank you,

SELECT DISTINCT O.empid, HR.firstname, HR.lastname

FROM Sales.Orders AS O

JOIN HR.Employees AS HR

ON O.empid = HR.empid

WHERE NOT EXISTS

(SELECT DISTINCT empid

FROM Sales.Orders

WHERE YEAR (orderdate) = 2007 AND

MONTH (orderdate) = 02)

I still get no results

David Moutray SSCertifiable Points: 5704 More actions · Answer 3

Correct me if I am wrong, but this reads like a classroom problem. I'm happy to help you with it, but only if you promise to read through my entire answer and learn something, Junior! 🙂

You are misusing the WHERE NOT EXISTS clause. For one thing, it is not in a correlated sub-query. A correlated sub-query is a SELECT statement nested inside another T-SQL statement containing a reference to a column (or columns) in the outer query. The correlated (or inner) portion of the query is executed once for each record in the outer portion of the query.

The reason I say this must be a classroom problem is that correlated sub-queries are notoriously inefficient. They are very CPU-Intensive. Often, cursors will perform better than correlated sub-queries.

Let's take a look at how I would solve this problem at work, and then we'll answer the class question.

At work I would solve this problem with an Outer Join, like so:

WITH [SalesOrders]



AS

( SELECT   EmpID

  FROM	   Sales.Orders

  WHERE    OrderDate >= '2007-02-01'

    AND    OrderDate < '2007-03-01'

  GROUP BY EmpID

 )

SELECT     Employees.EmpID

          ,Employees.FirstName

          ,Employees.LastName

FROM       HR.Employees AS Employees

LEFT JOIN  SalesOrders

       ON  Employees.EmpID = SalesOrders.EmpID

WHERE SalesOrders.EmpID IS NULL;

This illustrates a classic problem in T-SQL programming - find the missing record. We want all of the employees who didn't sell something in the month of February, 2007. That is just the sort of question management would want to answer. So, you've got a list of sales orders for the month of February with the employee ID's and another list of all employees.

This is just the sort of thing that outer joins were designed for. Outer joins come in 3 flavors: RIGHT, LEFT & FULL. The LEFT JOIN that I have used here returns all of the records from the left set (Employees) and all matching records in the right set (Sales Orders). We want the records where there is not a match - that is the Sales Order record is missing (IS NULL).

Now, you will have noticed that I've made a few other changes to your query. Let me explain those. I took the sub-query out of the middle of the statement and put it at the top as a Common Table Expression (CTE). That is the WITH [SalesOrders] AS clause. I like to use CTE's because they are easier to read, but it works just like a sub-query.

You'll also note that I put a GROUP BY clause in the sub-query. We don't need a list of all of the sales that were made in February 2007. We just need a list of employees who made a sale in February. GROUP BY eliminates the duplicates. I also just return the fields I need (in this case Employee ID.) This makes for a more efficient query. SELECT DISTINCT is equivalent to saying GROUP BY every field in the SELECT list. DISTINCT is also a notoriously CPU-Intensive T-SQL construct. (At least it can be. I think it is best avoided, if possible.)

You will also note that I changed the way you are checking the order dates. First of all, your use of the YEAR and MONTH functions is wrong in that both of these functions return integers and you are comparing them to strings. This requires a completely unnecessary implicit type conversion.

More importantly, using functions in the WHERE clause like this renders it non-SARGable. That's an invented T-SQL word. (SARG stands for Search ARGument.) If there is an index on the Order Date field (and there probably would be) it can't be used in your original query, because you aren't searching on Order Date. You are searching on the results of a function. So, the query optimizer has no choice but to do a scan of the entire table, running these two functions on the Order Date column and then checking one by one to see if they match the values you have indicated.

My WHERE clause, on the other hand, compares the search criteria to the Order Date column itself, directly. Now the index can be used. Either a more efficient Index Seek will be used, or the Index Scan will be done against a limited range of the index. Either way, it is more efficient.

It is very important to remember when you are writing T-SQL code: write the code as if it will run against a table with 10,000,000 records, because one day it will be running against a table that large. You are probably developing against a table with a few hundred records. What works well there will not work well in production.

OK, now lets solve the class problem. This is the part you are looking for, but I hope you learned something from the previous discussion as well! 🙂

The problem is the same: find all of the employees who didn't sell something during February 2007. This time, however, we have to use a correlated sub-query (yuck). A correlated sub-query is a SELECT statement nested inside another T-SQL statement containing a reference to a column (or columns) in the outer query. As I stated previously, the correlated subquery will run once for each record selected by the outer query - making it hideously inefficient.

The original problem also directed us to use a WHERE NOT EXISTS clause. So, to restate the problem in these terms, we want to find all employees in the Employee table WHERE an employee ID does NOT EXIST in the Sales Order table for February 2007. So, now let's write this awful query as directed:

SELECT HR.EmpID



      ,HR.FirstName

      ,HR.LastName

FROM   HR.Employees AS HR

WHERE  NOT EXISTS ( SELECT O.EmpID

                    FROM   Sales.Orders AS O

                    WHERE  O.EmpID = HR.EmpID

                      AND  YEAR(O.OrderDate) = 2007

                      AND  MONTH(O.OrderDate) = 2

);

mrwillcostello SSC Eights! Points: 925 More actions · Answer 4

mrwillcostello

SSC Eights!

Points: 925

December 5, 2011 at 5:43 pm

#1417081

Thanks for your help. I figured it out

kauainurse Valued Member Points: 55 More actions · Answer 5

kauainurse

Valued Member

Points: 55

January 30, 2012 at 8:48 pm

#1440330

Thanks for a valuable lesson between JOINs and the uncorrelated subqueries. Its is very kind of you to take the time to thoroughly explain how it works and differs. I wish teachers would actually take the time to TEACH instead of just dump information on us newbies.

Aloha!

-V

Gail Shaw SSC Guru Points: 1004494 More actions · Answer 6

Late I know, but I didn't see this when it was posted

David Moutray (12/5/2011)

The reason I say this must be a classroom problem is that correlated sub-queries are notoriously inefficient. They are very CPU-Intensive. Often, cursors will perform better than correlated sub-queries.

That unfortunately is a myth, nothing more. The only time correlated subqueries are a performance problem is when the have a TOP 1 ... ORDER BY construct or when the correlation condition is an inequality. Otherwise they are very efficient, SQL can quite easily convert them into a join, semi-join or anti-semi-join and produce an optimal execution plan.

At work I would solve this problem with an Outer Join, like so:

The use of an outer join to find the missing rows is less efficient than using NOT EXISTS. Not much less efficient, but it is slightly less efficient (higher CPU, higher duration than a comparable NOT EXISTS)

http://sqlinthewild.co.za/index.php/2010/03/23/left-outer-join-vs-not-exists/

http://sqlinthewild.co.za/index.php/2010/04/27/in-exists-and-join-a-roundup/

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Paul White SSC Guru Points: 150457 More actions · Answer 7

Expressing this sort of requirement using a logical (anti-) semi-join using EXISTS, INTERSECT, or EXCEPT seems more natural to me than introducing NULLs with an outer join, and then filtering them out. A common misconception is that logical semi-joins are always implemented using nested loops, this is not so:

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

TIMPARROTT Old Hand Points: 303 More actions · Answer 8

Thanks for the useful information.

I would have written it as below as I am just getting familiar with the "With" syntax"

SELECT

Employees.EmpID

,Employees.FirstName

,Employees.LastName

FROM HR.Employees AS Employees

LEFT OUTER JOIN

(SELECT

EmpID

FROM Sales.Orders

WHERE

OrderDate >= convert(DateTime,'20070201',112)

AND OrderDate < convert(DateTime,'20070301',112)

GROUP BY EmpID) as SalesOrders

ON

Employees.EmpID = SalesOrders.EmpID

WHERE SalesOrders.EmpID IS NULL;

Ignoring any typos, how does this compare to the suggested best solution?

Gail Shaw SSC Guru Points: 1004494 More actions · Answer 9

http://sqlinthewild.co.za/index.php/2010/03/23/left-outer-join-vs-not-exists/

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass