Efficiency of IN vs. other ways to filter rows

  • Knights,

    I just wonder whether I should use IN or other better ways to filter rows based on a constraint set.

    I have tables and their contents look like:

    declare @YearStart varchar(50), @companyID int, @SYTermID int

    set @companyID = 1370

    set @SYTermID = 8

    set @YearStart = dbo.fn_GetSchoolStartYearOfATerm (@companyID, @SYTermID)

    --1/ @YearStart for selected year

    select @YearStart as '@YearStart'

    @YearStart

    --------------------------------------------------

    2005

    (1 row(s) affected)

    --2/ table 'Term' and select only termID belonging to @YearStart

    declare @tt table (companyID int, termID int, description varchar(50), StartDate datetime, EndDate datetime)

    insert @tt

    select companyID, termID, description, StartDate,EndDate

    from term where companyID=@companyID

    and dbo.fn_GetSchoolStartYearOfATerm (@companyID, termID)=@YearStart

    --> 2A/ check @tt

    select * from @tt

    companyID termID description StartDate EndDate

    ----------- ----------- -------------------------------------------------- ------------------------------------------------------ ------------------------------------------------------

    1370 8 2005 - 2006 2005-08-26 00:00:00.000 2006-08-15 00:00:00.000

    1370 9 Sem 1 05/06 2005-08-26 00:00:00.000 2006-01-14 00:00:00.000

    1370 11 Q1 05-06 2005-08-26 00:00:00.000 2005-10-21 00:00:00.000

    1370 12 Q2 05-06 2005-10-24 00:00:00.000 2006-01-14 00:00:00.000

    1370 13 Q3 05-06 2006-01-16 00:00:00.000 2006-03-17 00:00:00.000

    1370 14 Sem 2 05/06 2006-01-16 00:00:00.000 2006-08-15 00:00:00.000

    1370 15 Q4 05-06 2006-03-27 00:00:00.000 2006-08-15 00:00:00.000

    1370 26 Progress Q1 2005-08-26 00:00:00.000 2005-10-21 00:00:00.000

    (8 row(s) affected)

    --> 2B/ and termIDs belonging to @YearStart of '2005'

    select companyID, termID from @tt

    companyID termID

    ----------- -----------

    1370 8

    1370 9

    1370 11

    1370 12

    1370 13

    1370 14

    1370 15

    1370 26

    (8 row(s) affected)

    -- 3/ table 'Class' in general (not filetered by termIDs of @tt yet)

    select * from Class where companyID=@companyID

    companyID classID termID CourseID

    ----------- ----------- ----------- -----------

    1370 14 1 25

    1370 15 1 26

    1370 16 1 27

    1370 17 1 18

    1370 18 1 19

    1370 19 1 20

    1370 45 1 17

    1370 46 1 29

    1370 272 8 60

    1370 47 1 18

    1370 48 5 22

    1370 49 2 21

    1370 50 2 38

    1370 275 8 50

    1370 276 9 57

    1370 277 14 107

    1370 278 9 54

    -- 4/ table 'Class' and select only class belonging to entried of table @tt.

    -- The results I want as follows using IN

    select companyID, classID, TermID, CourseID from Class

    where companyID=@companyID and termID=@SYTermID

    and termID IN (select termID from @tt)

    1370 272 8 60

    1370 275 8 50

    1370 276 9 57

    1370 277 14 107

    1370 278 9 54

    My questions:

    A/ I should use IN or some other BETTER ways for query? runing time? Advantages?...

    B/ Any ideas about the drawbacks of IN?

    C/ In which situations and scenarios, IN should be used in case other ways cannot be applied?

    Thanks so much for any ideas, discussions.

  • Just my opinion, based on experience in our environment:

    - IN is way to go if you want to name a few values : WHERE user.login IN ('tom','bob','mike')

    - IN is absolutely horrible solution in cases where the number of values goes into several hundreds or thousands, if you list the values explicitly (not as SELECT value FROM table).. this can happen if the code is not in SQL, but part of application and the final construct containing several hundred/thousand values is then sent to SQL server. Terrible performance, and I wouldn't want to maintain and troubleshoot such code.

    - about other situations - better test it in your environment, but any time you can find another working and well performing solution based on JOIN to a table, I would prefer it over IN. However, in some cases we found that IN performs quite well if the values are supplied as a SELECT statement. Also be aware that if you are writing a code that will be used by several clients in different environment, some of them can have problems while other are OK... in my opinion, JOIN is much safer.

    /*EDIT*/

    As to your example, I would write the query like this:

    SELECT Class.companyID, Class.classID, Class.TermID, Class.CourseID 

    FROM Class

    JOIN @tt ON @tt.termID = Class.TermID

    WHERE companyID=@companyID AND termID=@SYTermID

  • Sometimes an EXISTS clause can give the desired results.

    Let us suppose that we have a sales territory broken down into

    Country - > Region - > Area.

    The marketing dept are forever buggering around with realigning the various levels of the hierarchy but the application must not list a region that does not have areas.

    You may select Regions WHERE IN (Select RegionID FROM dbo.Area AS A WHERE A.RegionID = R.RegionID)

    The quicker method would be

    WHERE EXISTS (Select RegionID FROM dbo.Area AS A WHERE A.RegionID = R.RegionID)

  • Knights,

    I applied the 3 ways to do the query, in order, IN, EXISTS, INNER JOIN. They give me the same DTS results

    -- 4/A table 'Class' and select only class belonging to entried of table @tt

    -- IN

    select companyID, classID, TermID, CourseID from Class

    where companyID=@companyID

    and termID IN (select termID from @tt)

    -- 4B/ using Exists

    -- Exists

    select companyID, classID, TermID, CourseID from Class cl

    where cl.companyID=@companyID

    and exists (select termID from @tt where termID=cl.termID)

    -- 4C/ using inner join

    -- INNER JOIN

    select cl.companyID, cl.classID, t.TermID, cl.CourseID from Class cl

    inner join @tt t on t.termID=cl.termID

    where cl.companyID=@companyID

    But when I refer to the Execution Plan, I get the overheads for the 3 method, in order:

    Query 9: Query Cost (relative the bath): 19.88 %

    Query 10: Query Cost (relative the bath): 19.88 %

    Query 10: Query Cost (relative the bath): 20.42 %

    Any ideas, dicussions?

    Thanks in advance.

  • Couple of questions & points:

    Use DBCC DropCleanBuffers before each test so that you're running with a 'cold' SQL data cache each time. You want perf differences to be related to the queries, not dependent on which tables are already read from disk and cached.

    Is this the entire contents of a DTS exec SQL task ? If so, is @tt used for anything else ? If not, just make it a derived table and join to it.

    Are the data volumes you've posted typical and representative ? i.e. will @tt typically have a small number of rows like 8, and not orders of magnitude more ? If yes, I don't think you need to worry about the details of how you express this. The differences between EXISTS, IN or JOIN'ing won't really become apparent until you have significantly more data than this.

     

  • The Exists clause is a lot faster than the IN clause for a very simple reason. With an IN clause, SQL Server gets the entire dataset returned by the query and processes it for matches. With an Exists clause, it processes the matches as part of the query and doesn't need to consider the resultset as whole individual. The same is true with an Inner Join.

    Inner Join can be as fast or nearly as fast or nearly as fast as an Exists clause or sometimes even faster. Ocasionally, an Inner Join can be a lot slower because there are multiple matches for each record and the resulting recordset is a lot larger than it would be with an Exists clause. Also, if you are merely  doing "Select *" the row size will be larger with an Inner Join and this will slow down the actual returning of data.

    Your Exists clause can be sped up even more. When doing a subquery in an Exists clause, use Select 1 instead of Select * or Select FieldName. This is faster as SQL Server recognizes that it doesn't need to send back any data, just a constant value.

    select companyID, classID, TermID, CourseID from Class cl

    where cl.companyID=@companyID

    and exists (select 1 from @tt where termID=cl.termID)


    My blog: SQL Soldier[/url]
    SQL Server Best Practices:
    SQL Server Best Practices
    Twitter: @SQLSoldier
    My book: Pro SQL Server 2008 Mirroring[/url]
    Microsoft Certified Master: SQL Server, Data Platform MVP
    Database Engineer at BlueMountain Capital Management[/url]

  • >> When doing a subquery in an Exists clause, use Select 1 instead of Select * or Select FieldName. This is faster as SQL Server recognizes that it doesn't need to send back any data, just a constant value.

    That was true in version 6.5 and earlier, due to a bug/feature of the optimizer as inherited from Sybase.

    It hasn't made a difference for quite a while, the query plan will be exactly the same. The optimizer now recognizes that what is being selected is irrelevant to determining existence.

     

  • I don't know if it is still true in SQL 2005 or not, but it is still true in SQL 2000.


    My blog: SQL Soldier[/url]
    SQL Server Best Practices:
    SQL Server Best Practices
    Twitter: @SQLSoldier
    My book: Pro SQL Server 2008 Mirroring[/url]
    Microsoft Certified Master: SQL Server, Data Platform MVP
    Database Engineer at BlueMountain Capital Management[/url]

  • *shrug*

    Using the Pubs database on SQL2K, all 3 of these have an identical query plan:

    select *

    from authors as a1

    where exists (select * from authors as a2

                  where a1.au_lname = a2.au_lname

                  and a1.au_id <> a2.au_id)

    select *

    from authors as a1

    where exists (select 1 from authors as a2

                  where a1.au_lname = a2.au_lname

                  and a1.au_id <> a2.au_id)

    select *

    from authors as a1

    where exists (select au_lname from authors as a2

                  where a1.au_lname = a2.au_lname

                  and a1.au_id <> a2.au_id)

    I stopped using the "Select 1" trick years ago because it's no longer necessary.

    [Edit] Note, the key point is that in all cases, an index seek is used on a2. We don't care what the data is, just whether 1 row exists.

    The old optimizer bug that was fixed in ver 7.0, was that the query plan would change with "Select *" or "Select ColName", and sometimes the better performing index seek would be abandoned in exchange for a different plan to get data pages, pages which were subsequently discarded because the subquery doesn't care about data, just true/false.

     

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply