picking the "best" row out of a set

  • Does anyone have a better solution for this problem?

    In the table there are one or more rows per person (pid). Each row is an aspect of the pid with other columns (a, b, c) defining the aspect. Each row as unique identifier (i). I need a query to pick the "best" row for a particular person.

    1) The "best" row is the row with the lowest value for a.

    2) If multiple rows have the same value for a, pick the row with the highest value for b.

    3) If multiple rows have the same value for b, pick the row with the lowest value for c.

    4) If multiple rows have the same value for c, pick the row with the lowest value for i.

    The query below returns the correct result; but it is unwieldily. In the real world, there are four additional columns of attributes and they are a mix of int's, varchar's, and datetime's, and some are nullable. Does anyone have a different solution to this problem that may be eaiser to manage, understand, code, or with better performance?

    create table #tbl (pid int, a int, b int, c int, i int identity)

    insert #tbl (pid, a, b, c) values (1, 1, 1, 1)

    insert #tbl (pid, a, b, c) values (1, 1, 2, 1)

    insert #tbl (pid, a, b, c) values (1, 1, 2, 2)

    insert #tbl (pid, a, b, c) values (2, 2, 2, 2)

    insert #tbl (pid, a, b, c) values (3, 1, 2, 1)

    insert #tbl (pid, a, b, c) values (3, 2, 2, 1)

    insert #tbl (pid, a, b, c) values (3, 2, 2, 2)

    select *

    from #tbl best

    where (pid = 1)

    and not exists(

       select *

       from #tbl bad

       where

          (bad.pid = best.pid)

          and (bad.i <> best.i)

          and (

             (bad.a < best.a)
             or (
                (bad.a = best.a)
                and (
                   (bad.b > best.b)

                   or (

                      (bad.b = best.b)

                      and (

                         (bad.c < best.c)
                         or (
                            (bad.c = best.c)
                            and (bad.i < best.i)
                            )
                         )
                      )
                   )
                )
             )
       )

    pid a b c i
    --- - - - -
    1   1 2 1 2

    pid a b c i
    --- - - - -
    2   2 2 2 4

    pid a b c i
    --- - - - -
    3   1 2 1 5

  • Select top 1 *

    From #tbl

    Where pid = 1

    Order By a asc, b desc, c asc, i asc

    pid         a           b           c           i          

    ----------- ----------- ----------- ----------- -----------

    1           1           2           1           2

    pid         a           b           c           i          

    ----------- ----------- ----------- ----------- -----------

    2           2           2           2           4

    pid         a           b           c           i          

    ----------- ----------- ----------- ----------- -----------

    3           1           2           1           5

     

  • Worth noting that you can get a list of everyone's best rows with:

    select * from #tbl t

    where t.i =

    (

    select top 1 t1.i FROM #tbl t1

    where t1.pid = t.pid

    order by a asc, b desc, c desc, i asc

    )

    Rob Farley
    LobsterPot Solutions & Adelaide SQL Server User Group
    Company: http://www.lobsterpot.com.au
    Blog: http://blogs.lobsterpot.com.au

  • Rob's answer is the best with one exception, you will need to make the sort for "c" to be ascending instead of descending.

    Joe, I'm curious why you said you never use an IDENTITY column in an RDBMS?  They do have their uses.

    hawg

    ----------------------------------------------------------

    01010011010100010100110000100000010100110110010101110010011101100110010101110010
    001000000101001001101111011000110110101101110011

  • If I can speak for Joe I think his school of thought is that you should be able to define a PK from a set of column(s) that are naturally unique otherwise you have not done you job in designing the table.

  • I will agree that it is best to build your database where there is a logical PK but I have had multiple instances where I simply needed a unique ID for my records to use as a primary key when there is no good option and an IDENTITY field provides what I need.

    For example, I have built a system to help with my SQL administration and one of the things I do is list all of the SQL Servers in a table.  I don't want to use the SQL Server name as the key because I prefer to use an integer as my key in this system so I use an IDENTITY field to automatically create a server ID that I then use for all my foreign key constraints throughout the system.

    ----------------------------------------------------------

    01010011010100010100110000100000010100110110010101110010011101100110010101110010
    001000000101001001101111011000110110101101110011

  • Calm down Joe before you blow a gasket. 

    I've never claimed to be a complete SQL expert and this was just a quick example.  Regardless of what the definition is I still stand by my statement that IDENTITY [columns] have their place in an RDBMS.

    hawg

    ----------------------------------------------------------

    01010011010100010100110000100000010100110110010101110010011101100110010101110010
    001000000101001001101111011000110110101101110011

  • Back to the original question...

    Column a is more important than b, and b is more important than c. A row with min(a) and max(b) and min(c) may not exist. Unfortunately, my sample data did not reflect that. Joe Celko's solution will not work.

    This additional sample data illustrates the issue.

    insert #tbl (pid, a, b, c) values (4, 1, 2, 1)

    insert #tbl (pid, a, b, c) values (4, 2, 3, 1)

  • Rob's answer still works only with changing the "C" sort from desc to asc like the following:

    select * from #tbl t

    where t.i =

    (

    select top 1 t1.i FROM #tbl t1

    where t1.pid = t.pid

    order by a asc, b desc, c asc, i asc

    )

     

    ----------------------------------------------------------

    01010011010100010100110000100000010100110110010101110010011101100110010101110010
    001000000101001001101111011000110110101101110011

  • Yes, (4, 1, 2, 1) is the row the query should return. However, your query does not return any row for pid=4.

  • If the values for a, b, and c are all positive numbers under ten:

    If it were only so simple!

  • Using TOP with ORDER BY is a great idea. So simple.

    However, I simplified my situation a little too much. There are actually other columns that govern the sort order - different orders for past, current, and future rows. I'm trying to figure a way to use Rob's solution, anyway. I know expressions including CASE may be used in an ORDER BY clause; maybe that's the key. I will create a new thread if I can adequately describe the situation. Thanks all.

  • Sorry - it was just a typo, putting 'desc' instead of 'asc'.

    If you can describe what your actual problem is, it should be fairly easy to get a solution using similar technique.

    As for the 'i' field... just use whatever is appropriate for the key. Of course, you might have a bit of fun if you have a multi-column key.

    RobF

    Rob Farley
    LobsterPot Solutions & Adelaide SQL Server User Group
    Company: http://www.lobsterpot.com.au
    Blog: http://blogs.lobsterpot.com.au

  • Just a thought: TOP is a reserved word in T-SQL and is not part of ANSI-99-SQL. If for some unexplained and undoubtedly ridiculous reason your company changes databases or you work in a   multiple db environment SELECT statements using TOP will break.

    The issue of portability should always be considered if only to reach the conclusion that it is not an issue because you company will never change no matter what happens.

    Mike

  • Point well taken.  But what does that mean when it comes to using stored procedures?  Do we not use them?

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply