Conditional Group by?

Question

Post reply

Conditional Group by?

Robert Klimes

SSCoach

Points: 17865
More actions
November 12, 2009 at 9:36 am

#135240

I am looking for a way to exclude rows that have a null value in a column if the are other rows that have a value in that column for the same name. if there are only nulls in that column i would like it returned with the nulls.

create table #tmp(name varchar(10),
email varchar(50), code varchar(5), location varchar(50)) insert into #tmp(name,email,code,location) values('bob','bob@bob.com','ABCDE','USA') insert into #tmp(name,email,code) values('bob','bob@bob.com','ZYXWV') insert into #tmp(name,email) values('joe','joe@bob.com') insert into #tmp(name,email,code,location) values('joe','joe@bob.com','ABCDE','USA') insert into #tmp(name,email) values('dan','dan@bob.com') select name,email,ISNULL(code,'') AS code,MAX(isnull(location,'')) as location from #tmp
group by name,email,ISNULL(code,'')

desired result

name email code location bob bob@bob.com ABCDE USA bob bob@bob.com ZYXWV dan dan@bob.com joe joe@bob.com ABCDE USA

I am looking for something that would theoretically be accomplished by the following code, although this code does not work.

select name, email, case when code is null then (select isnull(code,'') from #tmp b where code IS not null and email = a.email) else code end as code, MAX(isnull(location,'')) as location from #tmp a group by name,email,case when code is null then (select isnull(code,'') from #tmp b where code IS not null and email = a.email) else code end

Bob
-----------------------------------------------------------------------------
How to post to get the best help[/url]

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

Supriya.S Ten Centuries Points: 1131 More actions · Answer 1

Supriya.S

Ten Centuries

Points: 1131

November 12, 2009 at 10:10 am

#1078462

How about something like this:

SELECT * FROM #tmp A



WHERE code IS NOT NULL

         OR location IS NOT NULL

         OR EXISTS (SELECT COUNT(*) FROM #tmp B WHERE A.name=B.name AND A.email=B.email GROUP BY name,email HAVING COUNT(*)=1)

HTH,

Supriya

Lamprey13 SSCertifiable Points: 6190 More actions · Answer 2

Here is one other way:SELECT



	name,

	email,

	code,

	location

FROM

	(

		select

			name,

			email,

			ISNULL(code,'') AS code,

			MAX(isnull(location,'')) as location,

			ROW_NUMBER() OVER

				(

					PARTITION BY

						Name,

						Email

					ORDER BY

						CASE WHEN code IS NULL THEN 1 ELSE 0 END,

						code

				) AS RowNum

		from

			#tmp

		group by

			name,

			email,

			code

	) AS T

WHERE

	Code <> ''

OR (RowNum = 1 AND Code = '')

Robert Klimes SSCoach Points: 17865 More actions · Answer 3

Robert Klimes

SSCoach

Points: 17865

November 12, 2009 at 12:50 pm

#1078551

Thank for the replies.

Supriya, I was able to use your example to get the results I needed.

Bob
-----------------------------------------------------------------------------
How to post to get the best help[/url]

Supriya.S Ten Centuries Points: 1131 More actions · Answer 4

Supriya.S

Ten Centuries

Points: 1131

November 12, 2009 at 1:09 pm

#1078567

@Lamprey13: Nice trick using Row_Number(); going in my file of "SQL tips and tricks" 🙂 It had one issue though, your script ignores the row say when code is NULL but the location is not NULL. How do you take care of that?

@bob-2: glad it worked out for you.

Thanks,

Supriya

Lamprey13 SSCertifiable Points: 6190 More actions · Answer 5

Supriya.S (11/12/2009)

@Lamprey13: Nice trick using Row_Number(); going in my file of "SQL tips and tricks" 🙂 It had one issue though, your script ignores the row say when code is NULL but the location is not NULL. How do you take care of that?

I just was messing around with a query to get the expected results based on the sample data. At the time I was looking at the expected output no one had come up with a solution. While, my code produces the expected output, it probably doesn't cover everything. If I ahve some time I'll see if I can tweak it.

LutzM SSC Guru Points: 107049 More actions · Answer 6

Here's another solution based on ROW_NUMBER.

It covers the issue of having code is null and location is not null.

Comparing the two row_number solutions will result in almost identical execution plans.

The "Exists approach" does look a lot faster when just looking at the execution plan but when comparing statistics, it will be behind the two row_number solutions, for IO, CPU and execution time. (should be verified using larger data volume though...)

; WITH cte AS



(

	SELECT row_number() OVER (partition BY name ORDER BY code desc, location desc) AS row,

		name,

		email,

		code,

		location

	FROM #tmp

)

SELECT name, email, ISNULL(code,'') AS code, isnull(location,'') AS location

FROM cte

WHERE row = 1 OR code IS NOT NULL OR location IS NOT NULL

ORDER BY name,code

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

Lamprey13 SSCertifiable Points: 6190 More actions · Answer 7

Lamprey13

SSCertifiable

Points: 6190

November 12, 2009 at 2:28 pm

#1078613

If you change my WHERE cluase to:WHERE

RowNum = 1 OR Code <> '' OR Location <> ''I think that'll work.

lmu92's solution and mine differ slightly because of there the GROUPing happens. So, the method one would choose would depend on the actual results desired...

Cheers!

Supriya.S Ten Centuries Points: 1131 More actions · Answer 8

lmu92 (11/12/2009)

Here's another solution based on ROW_NUMBER.

It covers the issue of having code is null and location is not null.

Comparing the two row_number solutions will result in almost identical execution plans.

The "Exists approach" does look a lot faster when just looking at the execution plan but when comparing statistics, it will be behind the two row_number solutions, for IO, CPU and execution time. (should be verified using larger data volume though...)

; WITH cte AS
( SELECT row_number() OVER (partition BY name ORDER BY code desc, location desc) AS row, name, email, code, location FROM #tmp ) SELECT name, email, ISNULL(code,'') AS code, isnull(location,'') AS location FROM cte WHERE row = 1 OR code IS NOT NULL OR location IS NOT NULL
ORDER BY name,code

Lutz, thanks for posting the query. That goes into my "SQL tips and tricks" file too. :-). Anyway I was wondering if you can please explain why my "Exists approach" will be behind in terms of IO, CPU and execution time. Will adding a non-clustered index on [name] and help? Comparing all three queries I saw the query cost of the exists approach is almost 6% while the other two came at 47%. Even checked the Subtree, CPU and IO costs on the execution plan and all are less for the exists approach. What am I missing here?

Sorry, if I am asking too many questions. 😀

Thanks,

Supriya

LutzM SSC Guru Points: 107049 More actions · Answer 9

Supriya.S (11/12/2009)

Anyway I was wondering if you can please explain why my "Exists approach" will be behind in terms of IO, CPU and execution time. Will adding a non-clustered index on [name] and help? Comparing all three queries I saw the query cost of the exists approach is almost 6% while the other two came at 47%. Even checked the Subtree, CPU and IO costs on the execution plan and all are less for the exists approach. What am I missing here?

Like I said: The "Exists approach" does look a lot faster when just looking at the execution plan. So we both figured the same...

After that I created the following "test scenario":

create and fill sample data

SET STATISTICS TIME ON

print 'row_number cte'

my query

print 'Exists approach'

your query

print 'row_number sub query'

Lamprey13' query

SET STATISTICS TIME OFF

The results I got showed that the 'Exists' query took longer than each

Row_number query and had more IO's and CPU time.

Since we're not talking about a large number of rows I recommended testing it against a larger data volume and see which one of the three would perform best.

Will adding a non-clustered index on [name] and help?

Seems like your query would benefit most from this index due to the join (a table scan on one side and an index seek for #tmp B. The row_number solutions probably still will perform a table scan since code and location are not part of the index (would require bookmark lookup which most probably would be too expensive).

Again, testing against a larger data volume would help eliminate my guessing (I'm by far no expert on performance tuning compared to some of the folks around here...).

Edit:

For testing purposes you probably should change my ROW_Number statement from

SELECT row_number() OVER (partition BY name ORDER BY code desc, location desc) AS row,

to



SELECT row_number() OVER (partition BY name,email ORDER BY code desc, location desc) AS row,

in order to use the same assumptions like both of you did...

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

Supriya.S Ten Centuries Points: 1131 More actions · Answer 10

Supriya.S

Ten Centuries

Points: 1131

November 13, 2009 at 1:58 pm

#1079265

Thanks for taking the time out to explain Lutz, appreciate it.

Thanks,

Supriya