OUTPUT clause returning the wrong row?

  • I'm looking at various methods for deleting duplicate rows. Among the alternatives, one works just fine but gives me results that make me go, "huh?".

    Consider this script:

    declare @t table (a int, b int, c int, d int, e int)

    insert into @t (a, b, c, d, e) values

    (1, 2, 3, 4, 5),

    (3, 4, 2, 3, 4),

    (1, 2, 3, 4, 5)

    select a,b,c,d,e, rn = row_number() over (

    partition by a,b,c,d,e

    order by a,b,c,d,e)

    from @t;

    with cte as (

    select a,b,c,d,e, rn = row_number() over (

    partition by a,b,c,d,e

    order by a,b,c,d,e)

    from @t

    )

    delete

    from cte

    output deleted.*

    where rn > 1;

    The code works -- that is, the duplicate row is deleted. However the output clause returns:

    a b c d e rn

    1 2 3 4 5 1

    So....why? Why does the output clause show that the row with rn=1 was deleted, when the where clause stipulates rn > 1?

    Gerald Britton, Pluralsight courses

  • I presume this is just an 'out of interest' type question, because when you're deleting exact duplicates from a heap, there is no real row number? You could not use this information to relate back to the original data and match the duplicate rows one to one.

    If you add another dupe row, you'll see that the deleted row numbers are 1 and 2. So it seems that the SQL engine removes the correct rows and then somehow loses track of the row which remains. Looks a bit strange, but not really a problem.

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • Phil Parkin (11/6/2014)


    I presume this is just an 'out of interest' type question, because when you're deleting exact duplicates from a heap, there is no real row number?

    The row number is generated by the ROW_NUMBER() function and the sort order is specified. AFAIK there is no "real" row number in a clustered index either. The closest you get is the CI key which maps to RID which is just a file/page/rowid.

    Even if I change the table definition to:

    declare @t table (id int identity(1,1) primary key, a int, b int, c int, d int, e int)

    to make it a CI, the output clause still returns incorrect results.

    Gerald Britton, Pluralsight courses

  • g.britton (11/6/2014)


    Phil Parkin (11/6/2014)


    I presume this is just an 'out of interest' type question, because when you're deleting exact duplicates from a heap, there is no real row number?

    The row number is generated by the ROW_NUMBER() function and the sort order is specified. AFAIK there is no "real" row number in a clustered index either. The closest you get is the CI key which maps to RID which is just a file/page/rowid.

    Even if I change the table definition to:

    declare @t table (id int identity(1,1) primary key, a int, b int, c int, d int, e int)

    to make it a CI, the output clause still returns incorrect results.

    Well, using your revised table definition, the delete should be done like this:

    delete t

    output deleted.*

    from @t t

    join cte on t.Id = cte.id

    where rn > 1;

    And everything works.

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • My guess is it's just some quirk with the output clause and ROW_NUMBER(), because the expected row gets kept even when the OUTPUT seems to indicate otherwise:

    declare @t table (a int, b int, c int, d int, e int, f int)

    insert into @t (a, b, c, d, e, f) values

    (1, 2, 3, 4, 5, 99),

    (3, 4, 2, 3, 4, 98),

    (1, 2, 3, 4, 5, 96),

    (1, 2, 3, 4, 5, 97)

    select a,b,c,d,e, rn = row_number() over (

    partition by a,b,c,d,e

    order by a,b,c,d,e)

    from @t;

    with cte as (

    select a,b,c,d,e, rn = row_number() over (

    partition by a,b,c,d,e

    order by a,b,c,d,e,f)

    from @t

    )

    delete

    from cte

    output deleted.*

    where rn > 1;

    select * from @t

    SQL DBA,SQL Server MVP(07, 08, 09) "Money can't buy you happiness." Maybe so, but it can make your unhappiness a LOT more comfortable!

  • well, isn't that interesting!

    Don't know about the "should be done" bit though. My version doesn't need a self-join. It's just weird that SQL loses track of what it just deleted.

    Gerald Britton, Pluralsight courses

  • g.britton (11/6/2014)


    well, isn't that interesting!

    Don't know about the "should be done" bit though. My version doesn't need a self-join. It's just weird that SQL loses track of what it just deleted.

    You're missing something.

    The issue has to do with using the row_number() against the DELETED "magic table". If you were to add in the ID number like you mentioned and control which one to delete you will see it's wiping out the expected row. That said the ROW_NUMBER value isn't a physical value in the table, so DELETED doesn't have it to return. As a result the query attempt to provide based on what it knows from the CTE, but only find one row (in the delete), ergo RN is 1.

    Confirm the results this way:

    declare @t table (a int, b int, c int, d int, e int, id int identity(1,1))--<-- adding in the identity

    insert into @t (a, b, c, d, e) values

    (1, 2, 3, 4, 5),

    (3, 4, 2, 3, 4),

    (1, 2, 3, 4, 5)

    select a,b,c,d,e,ID, row_number() over (

    partition by a,b,c,d,e

    order by a,b,c,d,e, ID) as RN --<-- making SURE that the "second" duplicate gets deleted

    from @t;

    with cte as (

    select a,b,c,d,e,ID, row_number() over (

    partition by a,b,c,d,e

    order by a,b,c,d,e, ID) as RN

    from @t

    )

    delete

    from cte

    output deleted.*

    where rn > 1;

    You will see that ID = 3 is the one that is gone. The ID is reliable since it's a physical column in the table which is what exists in the DELETED pseudo table

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Quick thought, there are actually two row_number instances in the plan, one used for the delete and another for the output. As only one row is outputted, the row number is of course 1.

    😎

    [Table Scan]-->[Sort]--[Segment]-->[Sequence Project]-->[Filter]-->[Table Delete]-->[Sort]-->[Segment]-->[Sequence Project]-->[DELETE]

    |<---------- 1st row number -------->| delete |<--------- 2nd row number ---------->| output

  • Good observation, though the output becomes misleading. I never doubted that the "right" row was deleted. I just found the results from the output clause to be confusing. I've lowered that to "misleading" 🙂

    Gerald Britton, Pluralsight courses

  • I actually encountered this exact behaviour last weekend. Got me a bit worried!

    My explanation would be that the CTE is never materialised as such, so the rn value is never stored, and the ROW_NUMBER() clause is re-evaluated in the deleted table.

  • Gazareth (11/6/2014)


    I actually encountered this exact behaviour last weekend. Got me a bit worried!

    My explanation would be that the CTE is never materialised as such, so the rn value is never stored, and the ROW_NUMBER() clause is re-evaluated in the deleted table.

    That's in a way correct, there are two row number (segment-sequence project) operators in the execution plan, one for the delete and one for the output.

    😎

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply