Detect duplicate rows - with a twist

  • I just read Part 2 of RBarryYoung's article "There Must Be 15 Ways To Lose Your Cursors" (http://qa.sqlservercentral.com/articles/T-SQL/66494/) in which he claims that "by SQL Server 2005 and later, cursors were no longer necessary to SQL development and in fact were significantly harmful".

    While I would completely agree that cursors are "significantly harmful", I'm not yet convinced that they are "no longer necessary". I have pasted a stored proc below that detects "duplicate" rows in a table (really just duplicate values in a particular column). There are certainly set-based methods of detecting and eliminating duplicate rows in a table, but this particular procedure has a twist: a "dupicate" row can only be removed if the "billed" column = 0.

    Here's the real-world problem that this procedure solves:

    We receive flat files from 3rd parties that contain millions of records on a regular basis (daily, weekly, monthly, it depends). We parse/analyze the data using a Windows Forms application then Bulk Insert the rows into SQL Server. Every so often (usually monthly) we aggregate the rows and generate a bill for them. At that time, the rows that are used to create an invoice are marked as "billed = 1". We never, ever, want to remove a row that has "billed = 1". Those rows represent billing detail are necessary for us to justify the invoices we send out.

    The problem, then, is that these 3rd parties sometimes send us duplicate rows. My first thought was to make the "RawData" column of the table (which stores the entire unparsed fixed-length record) "unique". This of course destroys performance on inserted rows and causes other problems - like the entire Bulk Insert of 5000 rows failing because there is one duplicate in there. I decided that it was faster to allow the duplicate rows to go in and then try to remove them later (I'm open to suggestions to the contrary). So consider the following scenarios:

    1. Duplicate rows where all of them are "unbilled". All duplicates should be removed and one row should remain.

    2. Duplicate rows where one (or more) of them is "billed" and the others are "unbilled". All of the "unbilled" duplicate rows should be removed.

    3. Duplicate rows where all of them are "billed" (shouldn't happen, but what if?). No rows should be removed.

    Here is the RBAR stored proc that accomplishes this task. I'd like suggestions on how to eliminate the cursor and (more importantly) improve performance.

    SET ANSI_NULLS OFF

    GO

    SET QUOTED_IDENTIFIER OFF

    GO

    ALTER PROCEDURE [dbo].[sbs_remove_duplicate_calls]

    /*

    Procedure: sbs_remove_duplicate_calls

    Author: Scott Roberts

    Created: 1/4/2008

    Parameters: None

    Description:

    This procedure removes duplicate rows from the CallDetailRecords table. Rows are considered duplicate if the "rawdata" columns match exactly.

    Revision History:

    01/04/2008 - sgr - Copied from sbs_remove_dup_emidata

    */

    AS

    set nocount ON

    -- Get all rows with duplicate "rawdata".

    print('Finding duplicate raw data...')

    select RawData, COUNT(*) as cnt

    into #cdrdupes

    from CallDetailRecords

    group by RawData

    having COUNT(*) > 1

    -- Table for holding ids of duplicate rows.

    CREATE TABLE #delids (id uniqueidentifier NOT NULL )

    -- Step through the duplicate row ids keeping only one row.

    print('Deciding which rows to remove...')

    declare @keepraw char(300), @rawdata char(300), @cdr_id UNIQUEIDENTIFIER, @billed BIT

    declare tmpcur cursor

    for

    select rawdata, id, billed

    from CallDetailRecords

    where rawdata in (select rawdata from [#cdrdupes])

    order by rawdata, billed desc

    open tmpcur

    set @keepraw = ''

    fetch tmpcur into @rawdata, @cdr_id, @billed

    while (@@FETCH_STATUS = 0) BEGIN

    if (@keepraw <> @rawdata) BEGIN

    -- Keep the first row we encounter (should be the "billed" one, if there is a "billed" one).

    set @keepraw = @rawdata

    end

    ELSE IF (@billed <> 1) BEGIN

    -- Mark any unbilled duplicates for deletion.

    insert into #delids (id) values (@cdr_id)

    end

    fetch tmpcur into @rawdata, @cdr_id, @billed

    end

    close tmpcur

    deallocate tmpcur

    update CallDetailRecords SET Rated = 0, ErrorText = 'Duplicate Call' where id in (select id from #delids)

    set nocount OFF

  • It would sure be helpful to have table structures and some test data. Plus, you get some tested code back, instead of "this should work". Please see the link in my signature for what helpers on this site would be looking for.

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • Sure, no problem.

    There is only one table and only a few pertinent columns:

    /****** Object: Table [dbo].[CallDetailRecords] Script Date: 04/27/2009 12:10:02 ******/

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    SET ANSI_PADDING ON

    GO

    CREATE TABLE [dbo].[CallDetailRecords](

    [Id] [uniqueidentifier] NOT NULL,

    [RawData] [varchar](5000) NOT NULL,

    [ErrorText] [varchar](200) NULL,

    [Rated] [bit] NOT NULL,

    [Billed] [bit] NULL

    ) ON [PRIMARY]

    GO

    SET ANSI_PADDING OFF

    And here is some sample data:

    -- Duplicate Unbilled Calls (one should be removed)

    INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'1234567890',NULL,1,0 )

    INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'1234567890',NULL,1,0 )

    -- Duplicate Billed Calls (neither one should be removed)

    INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'9876543210',NULL,1,1 )

    INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'9876543210',NULL,1,1 )

    -- Mixed duplicates. All unbilled calls should be removed.

    INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'5555555555',NULL,1,1 )

    INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'5555555555',NULL,1,0 )

    INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'5555555555',NULL,1,0 )

    And finally, the desired results:

    [font="Courier New"]

    Id RawData ErrorText Rated Billed

    B37131CE-5F51-4AD4-A368-1B5B54600529 1234567890 NULL 1 0

    2D3E2A5A-D511-4883-BA6C-F47B7616E7C7 1234567890 Duplicate Call 0 0

    F7E3FA78-25C1-4F4B-8359-9E9B16A3A744 9876543210 NULL 1 1

    C925F42C-F568-4AEA-B325-7285C491403F 9876543210 NULL 1 1

    D53DCE66-BDB7-4730-BC96-75DC1D13AD11 5555555555 NULL 1 1

    1C7DDD68-C22D-432C-8C3F-D8F141E9DC8A 5555555555 Duplicate Call 0 0

    422EED15-4BA6-40CA-AC24-2445C8CAAD87 5555555555 Duplicate Call 0 0[/font]

    Thanks in advance!

  • This should work for you. I just revised this to break duplicate RawData into a separate temp table and index it. The original version using just the CTE was going to scan your input table multiple times. This version does it once to build #DupRawData and once to delete rows. Depending on the ratio of duplicates to non duplicates, and billed to unbilled, it might be faster to just pull the rows that are going to be saved rather than delete rows. Less logging, and index maintenance, just something to consider. An index over rawData on the input file might also buy you some speed in the long run.

    Please let me know if you have any questions about the logic below.

    Bob

    --

    -- in this example "rawdata" represents a set of columns that you would consider duplicates.

    -- but we will just use the three scenarios above to separate them.

    --

    declare @sample table (rowID int identity, rawdata varchar(30), billed tinyint)

    insert into @sample

    select '1: All UNbilled',0 union all

    select '1: All UNbilled',0 union all

    select '1: All UNbilled',0 union all

    select '1: All UNbilled',0 union all

    select '2: One or more billed',0 union all

    select '2: One or more billed',1 union all

    select '2: One or more billed',0 union all

    select '2: One or more billed',1 union all

    select '2: One or more billed',0 union all

    select '3: All billed',1 union all

    select '3: All billed',1 union all

    select '3: All billed',1 union all

    select '3: All billed',1 union all

    select '4: No duplicate',0 union all

    select '5: No duplicate',1

    --

    select * from @sample

    --

    set statistics io on;

    ;with dupRawData as -- identify duplicate groups

    (select rawdata

    ,count(*) as rowsCount

    ,SUM(case when billed = 0 then 1 else 0 end) as unbilledCount

    ,MIN(case when billed = 0 then rowid else null end) as keeper

    from @sample

    group by rawdata

    having COUNT(*) > 1

    and SUM(case when billed = 0 then 1 else 0 end) > 0 -- must have SOME unbilled

    )

    select *

    into #dupRawData

    from dupRawData

    --

    create unique clustered index #pk_dupRawData on #duprawdata (rawdata) with(fillfactor = 100)

    --

    delete @sample

    from @sample s

    join #dupRawData d on s.rawdata = d.rawdata

    where (d.rowsCount > d.unbilledCount and s.billed = 0) -- if some are billed, flush the unbilled

    or (d.rowsCount = d.unbilledCount and s.rowID keeper) -- if all unbilled, keep the first 1

    --

    set statistics io off;

    select * from @sample

    drop table #dupRawData

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Scott, please note that I revised my original solution to cut down on the number of scans of your input table. It now creates a temp table (#dupRawData) that tells us whether or not a group is scenario 1 or scenario 2 and also preserves the rowID to be saved, all in one pass. The deletion pass then joins to this table on the rawdata.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Thanks Bob. I noticed that your original solution generated 4 table scans on the main table (that has millions of records) but I was reserving judgment until I ran some actual comparisons. My RBAR solution is running now, and I'll try your new solution once it finishes.

  • Oops. The RowId column (or "Id" column) is a uniqueidentifier so "MIN(rowid)" doesn't work. I don't see a "First()" function for SQL Server. Any other ideas?

  • Tsk... I'll never understand what is wrong with a simple integer.

    cast(min(cast(ID as varchar(50))) as uniqueidentifier)

    I can give you a cross apply solution, but nothing will build #dupRawData as quickly as a MIN() would.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • How about this?

    EDIT: Code posted here was removed as it failed!

    I noticed that your table def had RawData defined as varchar(5000) but your cursor function had it defined as varchar(300). Based on that I changed the table def.

  • Lynn, doesn't this assume that all the duplicates are going to fall together in sequence?

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Bob Hovious (4/27/2009)


    Lynn, doesn't this assume that all the duplicates are going to fall together in sequence?

    Possibly, but based on the trivial data, table structure, and expected results provided, it works.

    For a better answer, perhaps a more detailed description of the requirements would be helpful as would the DDL of the tables, a more representative sampling of data, and (of course) the expected results based on that data.

  • Bob Hovious (4/27/2009)


    Tsk... I'll never understand what is wrong with a simple integer.

    I'm certain that you (and everyone else in these forums) is keenly aware of the pros and cons of each player in the never-ending "int vs guid" debate. 😀

    Bob Hovious (4/27/2009)


    cast(min(cast(ID as varchar(50))) as uniqueidentifier)

    That seems to work fine.

    Thanks again.

  • Actually, its broke, something was missing. Ignore the current code, i need to rework it a bit.

  • Lynn Pettis (4/27/2009)


    Possibly, but based on the trivial data, table structure, and expected results provided, it works.

    Actually, it doesn't.

    Check your results against the "expected results" in post #3 again. Your solution marks both rows with RawData = '1234567890' as "Duplicates", but only one of them is a duplicate.

  • Scott Roberts (4/27/2009)


    Lynn Pettis (4/27/2009)


    Possibly, but based on the trivial data, table structure, and expected results provided, it works.

    Actually, it doesn't.

    Check your results against the "expected results" in post #3 again. Your solution marks both rows with RawData = '1234567890' as "Duplicates", but only one of them is a duplicate.

    Actually, if you check the post just before yours, I noticed. I have, however, fixed it:

    /****** Object: Table [dbo].[CallDetailRecords] Script Date: 04/27/2009 12:10:02 ******/

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    CREATE TABLE [dbo].[CallDetailRecords](

    [Id] [uniqueidentifier] NOT NULL,

    [RawData] [varchar](300) NOT NULL,

    [ErrorText] [varchar](200) NULL,

    [Rated] [bit] NOT NULL,

    [Billed] [bit] NULL

    ) ON [PRIMARY]

    GO

    CREATE CLUSTERED INDEX IX_CallDetailRecords1 ON [dbo].[CallDetailRecords](

    [RawData]

    ) ON [PRIMARY]

    ;

    GO

    -- Duplicate Unbilled Calls (one should be removed)

    INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'1234567890',NULL,1,0 );

    INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'1234567890',NULL,1,0 );

    -- Duplicate Billed Calls (neither one should be removed)

    INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'9876543210',NULL,1,1 );

    INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'9876543210',NULL,1,1 );

    -- Mixed duplicates. All unbilled calls should be removed.

    INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'5555555555',NULL,1,1 );

    INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'5555555555',NULL,1,0 );

    INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'5555555555',NULL,1,0 );

    select

    *

    from

    dbo.CallDetailRecords

    ;

    declare @RawData varchar(300),

    @ErrorText varchar(200);

    set @RawData = '';

    update dbo.CallDetailRecords set

    @ErrorText = ErrorText = case when @RawData = cdr.RawData and cdr.Billed = 0 then 'Duplicate Call' else cdr.ErrorText end,

    @RawData = cdr.RawData

    from

    dbo.CallDetailRecords cdr with (Index = 1)

    ;

    select

    *

    from

    dbo.CallDetailRecords

    ;

    drop table dbo.CallDetailRecords

    ;

Viewing 15 posts - 1 through 15 (of 40 total)

You must be logged in to reply to this topic. Login to reply