Just For Fun: An Impossible Delete

Question

Post reply

Just For Fun: An Impossible Delete

Viewing 15 posts - 76 through 90 (of 156 total)

You must be logged in to reply to this topic. Login to reply

Andrew in WV SSCommitted Points: 1655 More actions · Answer 1

Andrew in WV

SSCommitted

Points: 1655

August 6, 2008 at 10:26 am

#853979

You weren't the only one to make either of those mistakes. I see several contributors stuck in the same tar pits over the last 7 pages of comments.

Andrew

--Andrew

john.arnott SSChampion Points: 11882 More actions · Answer 2

RBarry,

Thanks for a great article and wonderful thread of comments. Others have said it, but I'll gladly reinforce the idea that the techniques and generalized information are the point here, and that goal has been more than met. Yes, the use of a tally table pushes up next to the constraint of no temp table, which may be read as no supplemental table, but so what? And the use of a RBAR solution that grinds and grinds forever in a multi-GB table would maybe still be OK in the 70's because in 1970 who knew what a megabyte was, never mind a gigabyte?

Speaking of memory constraints, this reminds me of a few years ago hearing one of the old-timers who in the 1950's helped design the MICR (Magnetic Ink Character Recognition) cheque encoding system at Stanford Research Institute and Bank of America. He said they had to take great care to be efficient in their hand-built machine code so everything would fit into 8K. Oh, and the "K" was for KiloBits, not bytes.

RBarryYoung SSC Guru Points: 143329 More actions · Answer 3

RBarryYoung

SSC Guru

Points: 143329

August 6, 2008 at 1:04 pm

#854079

Thanks for the feedback, John.

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

john.arnott SSChampion Points: 11882 More actions · Answer 4

OK. So I couldn't let go and just had to try to come up with a query to do this. [p] To restate the problem, we want a list of the second and subsequent dup's for each row. All others, including singlets would be deleted. The overall process would be pretty much as Barry proposed:

--Use the given table as temp-in-place to hold the desired final result set, with each row somehow distinguished from the original data. I chose to make the age negative, ignoring the possibility of age=0.

--Remove all of the original data. (I don't show that code here)

--Flip the distinguishing factor (the sign of "age") so that the result set has the characteristics of the original data. (I don't show that code here either)

[/p]

[p]Of course, the core question is in that first step how to create a list of the duplicates, not including the "original". If we have a Tally table, then this code will do the trick by allowing the db engine to hold the intermediate result internally as a subquery rather than in un-used bits in the original table. (I used the create/insert code from another post above -- this thread is a collaboration, right?):[/p]



create table #tab (lname varchar(5), age smallint, sex char(1))

insert into #tab (lname, age, sex) select 'ABC', 24, 'M'

insert into #tab (lname, age, sex) select 'ABC', 24, 'M'

insert into #tab (lname, age, sex) select 'LMN', 27, 'M'

insert into #tab (lname, age, sex) select 'LMN', 27, 'M'

insert into #tab (lname, age, sex) select 'LMN', 27, 'M'

insert into #tab (lname, age, sex) select 'PQRS', 25, 'F'

insert into #tab (lname, age, sex) select 'XYZ', 24, 'M'

insert into #tab (lname, age, sex) select 'XYZ', 25, 'M'

select  lname

         ,age * -1 as negAge

         ,sex

from

(select T.lname

	,T.age

	,T.sex

	,COUNT(*) as Cnt

  from #TAB T

 group by T.lname

	,T.Age

	,T.sex

 having count(*) > 1) as CountList

join TALLY

on CountList.Cnt -1 >= Tally.SerialNumber

Now I'll confess this is the first time I've tried using the Tally table method for anything, so just made one in my sandbox db, but without an index of any sort, so I suspect a 3,000,000 row table scan is why this query seems to take 8 seconds no matter how many rows I've poured into the #TAB. And one more thing: this was a wonderful learning experience. I wasn't trying so much to improve on anyone's solution, but found that just trying to find an alternative approach helped me understand the concept of a tally table. The hour or so spent was well worth it just for that.

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 5

john.arnott (8/6/2008)

I wasn't trying so much to improve on anyone's solution, but found that just trying to find an alternative approach helped me understand the concept of a tally table. The hour or so spent was well worth it just for that.

Gotta love it when some unitended good comes about. 🙂

I don't know if you need more on the Tally table, but if you do, I try to explain a bit in the following article. And, nice job using the Tally table for this solution!

http://qa.sqlservercentral.com/articles/TSQL/62867/

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Marek Skotnica Newbie Points: 3 More actions · Answer 6

Marek Skotnica

Newbie

Points: 3

August 7, 2008 at 4:44 am

#854346

Hallo, to the original joke, why not to return desired result directly like this?

-- insert into SOURCE

select [Name], Age, case(Sex) when 'M' then 'm' else 'f' end

from (

select * , Count(*) as cnt from SOURCE

Group By Name, Age, Sex

having 1 < Count(*)

) x

join Tally

on cnt > Tally.iValue

-- Marek

RBarryYoung SSC Guru Points: 143329 More actions · Answer 7

Marek Skotnica (8/7/2008)

Hallo, to the original joke, why not to return desired result directly like this?

-- insert into SOURCE

select [Name], Age, case(Sex) when 'M' then 'm' else 'f' end

from (

select * , Count(*) as cnt from SOURCE

Group By Name, Age, Sex

having 1 < Count(*)

) x

join Tally

on cnt > Tally.iValue

Because that was not the problem. The problem was to change the existing table's contents, which is considerably harder than just returning a dataset.

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

Andy DBA SSCommitted Points: 1776 More actions · Answer 8

john.arnott (8/6/2008)

Now I'll confess this is the first time I've tried using the Tally table method for anything, so just made one in my sandbox db, but without an index of any sort, so I suspect a 3,000,000 row table scan is why this query seems to take 8 seconds no matter how many rows I've poured into the #TAB.

John,

In the original article, RBarry used a filter in the select:

Where Tally.Number Between 2 and 127

Adding that and putting a clustered index on your tally table might improve your response time. (Next chance I get, I'm going to re-read Jeff's article on tally tables.) 🙂

RBarry,

I think Marek was just suggesting some minor tweaks to John's subselect:

1) Changing case on gender instead of flipping the sign of age, and

2) joining on count > tally.number instead of count-1 >= tally.number.

I'll say it again: Great article!

I'll bet you never thought it would generate so much discussion!

RBarryYoung SSC Guru Points: 143329 More actions · Answer 9

Andy DBA (8/7/2008)

RBarry,

...

I'll say it again: Great article!

I'll bet you never thought it would generate so much discussion!

Thanks, Andy. And yeah, I thought that hardly anyone would even be interested in this, let alone want to discuss it.

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

john.arnott SSChampion Points: 11882 More actions · Answer 10

john.arnott

SSChampion

Points: 11882

August 14, 2008 at 11:36 am

#858272

Andy DBA,

Adding the clustered index to my TALLY table did the trick. As I suspected it would, the query returns instantly now without bothering with a range filter. Shame on me for not taking the 30 seconds to do that originally.

John R. Hanson SSC-Addicted Points: 441 More actions · Answer 11

These are Important Ideas even more so today, as Enterprise Data is growing. My concern with the solution is that you are inserting rows to the table, and I don't see the difference between this and using an intermediate table.

I believe you should be able to do this with out doing the insert if you just update the available bits instead.

There are many times with a disk full condition that you would not be able to do the insert, but need to remove the dupes.

Cheers,

John R. Hanson

Chief Applications Architect

DesignMind Business Solutions

Shaun McGuile SSCarpal Tunnel Points: 4111 More actions · Answer 12

Shaun McGuile

SSCarpal Tunnel

Points: 4111

August 22, 2008 at 1:16 am

#861677

John - you missed the whole point - its to keep the duplicates 🙂

Hiding under a desk from SSIS Implemenation Work :crazy:

RBarryYoung SSC Guru Points: 143329 More actions · Answer 13

John R. Hanson (8/21/2008)

These are Important Ideas even more so today, as Enterprise Data is growing. My concern with the solution is that you are inserting rows to the table, and I don't see the difference between this and using an intermediate table.

As I said in the article, the rules posed by the original poster did seem a little arbitrary. I freely admit that direct application of this puzzle to everyday situations today, might strain credulity a bit. In the (remote?) past it was somewhat different, space issues were very important. but structural issues were absolutely overwhelming. It was much easier to temporarily expand and then shrink an existing table (the shrink did have to be done off-line) than to create a whole new table in a managed database environment.

I believe you should be able to do this with out doing the insert if you just update the available bits instead.

I could not figure out how to do this John, but I would be very interested in seeing such a solution.

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

John R. Hanson SSC-Addicted Points: 441 More actions · Answer 14

To Shaun,

I don't believe I missed the whole point.

I may have mistyped.

the general idea is that in a constrained environment, there are ways at updating the extra bits of data, rather than adding additional bits, rows or tables.

after using the updated extra bits to help update / delete the target records in a solution set, one would also have to delete the the extra bits of information to return the table to its proper state.

I am suggesting that one can use only manipulation of the extra bits available and not resort to adding additional rows to arrive at a solution.

In the same way the author (who got my point) had to delete added rows, one would have delete any added meta data in unused bits as addressed above.

Cheers,

John R. Hanson

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 15

John R. Hanson (8/21/2008)

These are Important Ideas even more so today, as Enterprise Data is growing. My concern with the solution is that you are inserting rows to the table, and I don't see the difference between this and using an intermediate table.

There are many times with a disk full condition that you would not be able to do the insert, but need to remove the dupes.

I absolutely agree...

I believe you should be able to do this with out doing the insert if you just update the available bits instead.

Would absolutely love to see that code... got any ideas on how you might be able to pull that off? 🙂

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions