Just For Fun: An Impossible Delete

  • Aaron Cabrele (10/31/2009)


    It says no additional identity column so that doesn't outrule the hidden one that already exists... As for adding a guid it depends on how strictly you say what is an identity column - is that simply a column with the identity property, therefor any other column like uniqueidentifier can be added?...!

    I love the way, once most or all possibilities are exhausted, we try and re-interpret the question, bend

    the rules...

    LOL. Well, the objective is to find the simplest solution within the rules given. As with any game or puzzle, a large part of the solution is knowing what the boundaries really are.

  • Barry, someone sent me this link again, although I'd read it the last time around, and your description of the problem got me to reminiscing. It takes me back to the days of fixed file extents on disk. Probably the primary reason why you had to do an update in place was that the rest of the disk was spoken for. If things were really tight though, you might even have written a loop to de-dupe one set of identical rows at a time. There were so many hardware limitations to work around, and I'd all but forgotten them. Thanks for taking me back to the "bad old days".

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Maybe I'm missing something, but the query the original poster was looking for is simple with the appropriate GROUP BY and HAVING clauses.

    SELECT Name, Age, Sex

    FROM Source

    GROUP BY Name, Age, Sex

    HAVING COUNT(*) > 1

    This ensures that the output from the query includes only a row for duplicate rows. Note no row will be output for a unique row.

  • This would select (or delete) everything that is duplicated. The requirement was to delete only the first occurrence of a duplicate, or the only occurrence of any unique rows.

    The "first" is obviously irrelevant given there is no order and they are duplicates so I read that as "one of"

    JLSSCH (11/1/2009)


    Maybe I'm missing something, but the query the original poster was looking for is simple with the appropriate GROUP BY and HAVING clauses.

    SELECT Name, Age, Sex

    FROM Source

    GROUP BY Name, Age, Sex

    HAVING COUNT(*) > 1

    This ensures that the output from the query includes only a row for duplicate rows. Note no row will be output for a unique row.

  • I have not seen this solution in the above list. Not fancy, but performs the initial request to delete duplicate rows leaving the initial entry, with minimal resource impact (especialy disk space) and no cursors, additional table/column, while not modifying the data?????:

    DECLARE @NM VARCHAR(50), @Age TINYINT, @Sex CHAR(1);

    SELECT @NM=Name, @Age = Age, @Sex = Sex

    FROM dbo.xSource

    GROUP BY Name, Age, Sex

    HAVING COUNT(*) > 1

    WHILE @NM IS NOT NULL

    BEGIN

    DELETE TOP (1)

    FROM dbo.xSource

    WHERE Name = @NM AND Age = @Age AND Sex = @Sex

    SELECT @NM = NULL

    SELECT @NM=Name, @Age = Age, @Sex = Sex

    FROM dbo.xSource

    GROUP BY Name, Age, Sex

    HAVING COUNT(*) > 1

    END

  • Great article. I really liked the 'go back to the 70' part.

    Sometimes when I see how stuff is done just because the computer can handle the load it drives me nuts.

    ---------------------------------------------
    [font="Verdana"]Nothing is impossible.
    It is just a matter of time and money.[/font]

  • Great one....

  • Thanks for your comments. The following modified solution works when you use the row_number() function to number the rows in an output query.

    For the following query, the output for "row_number() over (order by name) as ident" will produce a column with the name of "ident" and a monotonically increasing integer value starting with 1. This is functionally the same as an identity column with a seed of 1!

    select name, age, sex, row_number() over (order by name) as ident

    from #temp

    Here is the complete script:

    create table #temp

    (

    name varchar(4),

    age int,

    sex varchar(1)

    )

    insert #temp

    select 'ABC',24,'M'

    union all

    select 'ABC',24,'M'

    union all

    select 'LMN', 27, 'M'

    union all

    select 'LMN', 27, 'M'

    union all

    select 'LMN', 27, 'M'

    union all

    select 'PQRS', 25, 'F'

    union all

    select 'XYZ', 24, 'M'

    union all

    select 'XYZ', 25, 'M'

    select a.name, a.age, a.sex, a.ident

    from

    (

    select name, age, sex, row_number() over (order by name) as ident

    from #temp

    ) a

    inner join

    (

    select b.name, b.age, b.sex, min(b.ident) as ident

    from

    (

    select name, age, sex, row_number() over (order by name) as ident

    from #temp

    ) b

    group by b.name, b.age, b.sex

    having count(*) > 1

    ) c on a.name = c.name and a.age = c.age and a.sex = c.sex and a.ident > c.ident

    drop table #temp

    --***************************

    The desired output that is obtained from executing this script is:

    ABC 24 M 2

    LMN 27 M 4

    LMN 27 M 5

    I hope that this helps.

  • One final update to my solution. Using CTE's, the output query is much easier to understand.

    Here is the complete script:

    create table #temp

    (

    name varchar(4),

    age int,

    sex varchar(1)

    )

    insert #temp

    select 'ABC',24,'M'

    union all

    select 'ABC',24,'M'

    union all

    select 'LMN', 27, 'M'

    union all

    select 'LMN', 27, 'M'

    union all

    select 'LMN', 27, 'M'

    union all

    select 'PQRS', 25, 'F'

    union all

    select 'XYZ', 24, 'M'

    union all

    select 'XYZ', 25, 'M';

    with tmp

    as

    (

    select name, age, sex, row_number() over (order by name) as ident

    from #temp

    )

    select a.name, a.age, a.sex, a.ident

    from tmp a inner join

    (

    select b.name, b.age, b.sex, min(b.ident) as ident

    from tmp b

    group by b.name, b.age, b.sex

    having count(*) > 1

    ) c on a.name = c.name and a.age = c.age and a.sex = c.sex and a.ident > c.ident

    drop table #temp

    --***************************

    The desired output that is obtained from executing this script is:

    ABC 24 M 2

    LMN 27 M 4

    LMN 27 M 5

  • hmi (10/30/2009)


    Good brain training but are the restrictions as mentioned realistic? Which company would definitely not want to use a temp table or table variable?

    Quite right, it's really just brain training. 🙂

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • honza.mf (10/30/2009)


    Phil Factor (8/5/2008)


    Surely this is the simplest solution? (be warned of a nasty catch when creating these 'quirky updates'. the order of execution of the update is -variable assignments first then column assignments-both left to right!)

    Very nice solution.

    I Just prefer little change, it's more readable

    [font="Courier New"]--create the sample table

    DECLARE @Sample TABLE (TheName VARCHAR(4),Identifier INT, sex VARCHAR(1))

    INSERT INTO @Sample (TheName,Identifier,Sex)

    SELECT 'ABC', 24, 'M' UNION ALL

    SELECT 'ABC', 24,'M' UNION ALL

    SELECT 'LMN', 27, 'M' UNION ALL

    SELECT 'LMN', 27, 'M' UNION ALL

    SELECT 'LMN', 27, 'M' UNION ALL

    SELECT 'PQRS', 25, 'F' UNION ALL

    SELECT 'XYZ', 24, 'M' UNION ALL

    SELECT 'XYZ', 25, 'M'

    DECLARE @hash VARCHAR(80), @sex VARCHAR(1)

    UPDATE @sample

    SET

    @sex=CASE WHEN COALESCE(@hash,'')

    <>TheName+CONVERT(VARCHAR(5),Identifier)+sex

    THEN 'd' ELSE SEX END,

    @hash= TheName+CONVERT(VARCHAR(5),Identifier)+sex,

    Sex = @sex

    DELETE FROM @sample WHERE sex='d'

    SELECT * FROM @sample[/font]

    The quirky update trick isn't really reliable without a Clustered index. For instance, when I execute you code, thi sis what I get:

    TheName Identifier sex

    ------- ----------- ----

    ABC 24 M

    LMN 27 M

    LMN 27 M

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Stephen.Richardson (10/30/2009)


    ...

    -- Step 1 Get list of only DUPLICATE rows

    INSERT INTO xSource (Name, Age, Sex)

    SELECT '~'+Name, Age, Sex FROM xSource GROUP BY Name, Age, Sex Having COUNT(*) > 1

    The problem is that this will fail if someone's name is already using all 50 characters (not unheard of). Otherwise I would have used this approach myself. 🙂

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • skjoldtc (10/30/2009)


    Thanks for the memories. A time when storage and memory was rare and expensive and you had to really know how the guts of the operating system, disk system, and memory allocation worked. Some of the old techniques are still useful today.

    Great article and a great solution. 😎

    Thanks!

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • jghali (10/30/2009)


    Funny how this article just came out today... I was actually on a similar thread last week and someone came up with a great solution...

    My personal opinion the article is interesting but ... hmmm...

    How about these solutions from this thread...

    http://qa.sqlservercentral.com/Forums/Topic793765-145-1.aspx

    There's an undocumented identity key for every row of any table that can be used... in one simple delete statement you can delete the duplicate rows...

    I was amazed...

    Check it out.

    Thanks

    I am familiar with that thread (in fact I think that I am posted there), however, 1) it came out long after this article was originally published (15 months ago) and 2) those internal IDs are a lot hard to use than you might think, esp. given the restrictions. For instance, how would you propose using them to solve this problem?

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • arms.dan (10/30/2009)


    Your off-hand comment about naming the field "Gender" rather than "Sex" piqued my curiosity. Why? As I understand the words, sex means biological differences: chromosomes and sexual organs. Gender refers to the characteristics a society or culture delineates as masculine or feminine. There's some blurring of the definitions, but I think they generally hold. So why do you prefer Gender for the column label?

    In English "Sex" can also be a verb, thus leading to the joke about filling forms where is says "Sex[_]" with "Y". The PC way to stop this is to change the field name to "Gender[_]".

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

Viewing 15 posts - 136 through 150 (of 156 total)

You must be logged in to reply to this topic. Login to reply