Conditional Group by?

  • I am looking for a way to exclude rows that have a null value in a column if the are other rows that have a value in that column for the same name. if there are only nulls in that column i would like it returned with the nulls.

    create table #tmp(name varchar(10),

    email varchar(50),

    code varchar(5),

    location varchar(50))

    insert into #tmp(name,email,code,location)

    values('bob','bob@bob.com','ABCDE','USA')

    insert into #tmp(name,email,code)

    values('bob','bob@bob.com','ZYXWV')

    insert into #tmp(name,email)

    values('joe','joe@bob.com')

    insert into #tmp(name,email,code,location)

    values('joe','joe@bob.com','ABCDE','USA')

    insert into #tmp(name,email)

    values('dan','dan@bob.com')

    select name,email,ISNULL(code,'') AS code,MAX(isnull(location,'')) as location from #tmp

    group by name,email,ISNULL(code,'')

    desired result

    name email code location

    bob bob@bob.com ABCDE USA

    bob bob@bob.com ZYXWV

    dan dan@bob.com

    joe joe@bob.com ABCDE USA

    I am looking for something that would theoretically be accomplished by the following code, although this code does not work.

    select name,

    email,

    case

    when code is null then (select isnull(code,'') from #tmp b where code IS not null and email = a.email)

    else code

    end as code,

    MAX(isnull(location,'')) as location

    from #tmp a

    group by name,email,case

    when code is null then (select isnull(code,'') from #tmp b where code IS not null and email = a.email)

    else code

    end

    Bob
    -----------------------------------------------------------------------------
    How to post to get the best help[/url]

  • How about something like this:

    SELECT * FROM #tmp A

    WHERE code IS NOT NULL

    OR location IS NOT NULL

    OR EXISTS (SELECT COUNT(*) FROM #tmp B WHERE A.name=B.name AND A.email=B.email GROUP BY name,email HAVING COUNT(*)=1)

    HTH,

    Supriya

  • Here is one other way:SELECT

    name,

    email,

    code,

    location

    FROM

    (

    select

    name,

    email,

    ISNULL(code,'') AS code,

    MAX(isnull(location,'')) as location,

    ROW_NUMBER() OVER

    (

    PARTITION BY

    Name,

    Email

    ORDER BY

    CASE WHEN code IS NULL THEN 1 ELSE 0 END,

    code

    ) AS RowNum

    from

    #tmp

    group by

    name,

    email,

    code

    ) AS T

    WHERE

    Code <> ''

    OR (RowNum = 1 AND Code = '')

  • Thank for the replies.

    Supriya, I was able to use your example to get the results I needed.

    Bob
    -----------------------------------------------------------------------------
    How to post to get the best help[/url]

  • @Lamprey13: Nice trick using Row_Number(); going in my file of "SQL tips and tricks" 🙂 It had one issue though, your script ignores the row say when code is NULL but the location is not NULL. How do you take care of that?

    @bob-2: glad it worked out for you.

    Thanks,

    Supriya

  • Supriya.S (11/12/2009)


    @Lamprey13: Nice trick using Row_Number(); going in my file of "SQL tips and tricks" 🙂 It had one issue though, your script ignores the row say when code is NULL but the location is not NULL. How do you take care of that?

    I just was messing around with a query to get the expected results based on the sample data. At the time I was looking at the expected output no one had come up with a solution. While, my code produces the expected output, it probably doesn't cover everything. If I ahve some time I'll see if I can tweak it.

  • Here's another solution based on ROW_NUMBER.

    It covers the issue of having code is null and location is not null.

    Comparing the two row_number solutions will result in almost identical execution plans.

    The "Exists approach" does look a lot faster when just looking at the execution plan but when comparing statistics, it will be behind the two row_number solutions, for IO, CPU and execution time. (should be verified using larger data volume though...)

    ; WITH cte AS

    (

    SELECT row_number() OVER (partition BY name ORDER BY code desc, location desc) AS row,

    name,

    email,

    code,

    location

    FROM #tmp

    )

    SELECT name, email, ISNULL(code,'') AS code, isnull(location,'') AS location

    FROM cte

    WHERE row = 1 OR code IS NOT NULL OR location IS NOT NULL

    ORDER BY name,code



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • If you change my WHERE cluase to:WHERE

    RowNum = 1 OR Code <> '' OR Location <> ''I think that'll work.

    lmu92's solution and mine differ slightly because of there the GROUPing happens. So, the method one would choose would depend on the actual results desired...

    Cheers!

  • lmu92 (11/12/2009)


    Here's another solution based on ROW_NUMBER.

    It covers the issue of having code is null and location is not null.

    Comparing the two row_number solutions will result in almost identical execution plans.

    The "Exists approach" does look a lot faster when just looking at the execution plan but when comparing statistics, it will be behind the two row_number solutions, for IO, CPU and execution time. (should be verified using larger data volume though...)

    ; WITH cte AS

    (

    SELECT row_number() OVER (partition BY name ORDER BY code desc, location desc) AS row,

    name,

    email,

    code,

    location

    FROM #tmp

    )

    SELECT name, email, ISNULL(code,'') AS code, isnull(location,'') AS location

    FROM cte

    WHERE row = 1 OR code IS NOT NULL OR location IS NOT NULL

    ORDER BY name,code

    Lutz, thanks for posting the query. That goes into my "SQL tips and tricks" file too. :-). Anyway I was wondering if you can please explain why my "Exists approach" will be behind in terms of IO, CPU and execution time. Will adding a non-clustered index on [name] and help? Comparing all three queries I saw the query cost of the exists approach is almost 6% while the other two came at 47%. Even checked the Subtree, CPU and IO costs on the execution plan and all are less for the exists approach. What am I missing here?

    Sorry, if I am asking too many questions. 😀

    Thanks,

    Supriya

  • Supriya.S (11/12/2009)


    Anyway I was wondering if you can please explain why my "Exists approach" will be behind in terms of IO, CPU and execution time. Will adding a non-clustered index on [name] and help? Comparing all three queries I saw the query cost of the exists approach is almost 6% while the other two came at 47%. Even checked the Subtree, CPU and IO costs on the execution plan and all are less for the exists approach. What am I missing here?

    Like I said: The "Exists approach" does look a lot faster when just looking at the execution plan. So we both figured the same...

    After that I created the following "test scenario":

    create and fill sample data

    SET STATISTICS TIME ON

    SET STATISTICS TIME ON

    print 'row_number cte'

    my query

    print 'Exists approach'

    your query

    print 'row_number sub query'

    Lamprey13' query

    SET STATISTICS TIME OFF

    SET STATISTICS TIME OFF

    The results I got showed that the 'Exists' query took longer than each

    Row_number query and had more IO's and CPU time.

    Since we're not talking about a large number of rows I recommended testing it against a larger data volume and see which one of the three would perform best.

    Will adding a non-clustered index on [name] and help?

    Seems like your query would benefit most from this index due to the join (a table scan on one side and an index seek for #tmp B. The row_number solutions probably still will perform a table scan since code and location are not part of the index (would require bookmark lookup which most probably would be too expensive).

    Again, testing against a larger data volume would help eliminate my guessing (I'm by far no expert on performance tuning compared to some of the folks around here...).

    Edit:

    For testing purposes you probably should change my ROW_Number statement from

    SELECT row_number() OVER (partition BY name ORDER BY code desc, location desc) AS row,

    to

    SELECT row_number() OVER (partition BY name,email ORDER BY code desc, location desc) AS row,

    in order to use the same assumptions like both of you did...



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Thanks for taking the time out to explain Lutz, appreciate it.

    Thanks,

    Supriya

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply