Detect duplicate rows - with a twist

Question

Post reply

Detect duplicate rows - with a twist

Scott Roberts-430649

SSC-Addicted

Points: 450
More actions
April 27, 2009 at 9:49 am

#136944

I just read Part 2 of RBarryYoung's article "There Must Be 15 Ways To Lose Your Cursors" (http://qa.sqlservercentral.com/articles/T-SQL/66494/) in which he claims that "by SQL Server 2005 and later, cursors were no longer necessary to SQL development and in fact were significantly harmful".

While I would completely agree that cursors are "significantly harmful", I'm not yet convinced that they are "no longer necessary". I have pasted a stored proc below that detects "duplicate" rows in a table (really just duplicate values in a particular column). There are certainly set-based methods of detecting and eliminating duplicate rows in a table, but this particular procedure has a twist: a "dupicate" row can only be removed if the "billed" column = 0.

Here's the real-world problem that this procedure solves:

We receive flat files from 3rd parties that contain millions of records on a regular basis (daily, weekly, monthly, it depends). We parse/analyze the data using a Windows Forms application then Bulk Insert the rows into SQL Server. Every so often (usually monthly) we aggregate the rows and generate a bill for them. At that time, the rows that are used to create an invoice are marked as "billed = 1". We never, ever, want to remove a row that has "billed = 1". Those rows represent billing detail are necessary for us to justify the invoices we send out.

The problem, then, is that these 3rd parties sometimes send us duplicate rows. My first thought was to make the "RawData" column of the table (which stores the entire unparsed fixed-length record) "unique". This of course destroys performance on inserted rows and causes other problems - like the entire Bulk Insert of 5000 rows failing because there is one duplicate in there. I decided that it was faster to allow the duplicate rows to go in and then try to remove them later (I'm open to suggestions to the contrary). So consider the following scenarios:

1. Duplicate rows where all of them are "unbilled". All duplicates should be removed and one row should remain.

2. Duplicate rows where one (or more) of them is "billed" and the others are "unbilled". All of the "unbilled" duplicate rows should be removed.

3. Duplicate rows where all of them are "billed" (shouldn't happen, but what if?). No rows should be removed.

Here is the RBAR stored proc that accomplishes this task. I'd like suggestions on how to eliminate the cursor and (more importantly) improve performance.

SET ANSI_NULLS OFF GO SET QUOTED_IDENTIFIER OFF GO ALTER PROCEDURE [dbo].[sbs_remove_duplicate_calls] /* Procedure: sbs_remove_duplicate_calls Author: Scott Roberts Created: 1/4/2008 Parameters: None Description: This procedure removes duplicate rows from the CallDetailRecords table. Rows are considered duplicate if the "rawdata" columns match exactly. Revision History: 01/04/2008 - sgr - Copied from sbs_remove_dup_emidata */ AS set nocount ON -- Get all rows with duplicate "rawdata". print('Finding duplicate raw data...') select RawData, COUNT(*) as cnt into #cdrdupes from CallDetailRecords group by RawData having COUNT(*) > 1 -- Table for holding ids of duplicate rows. CREATE TABLE #delids (id uniqueidentifier NOT NULL ) -- Step through the duplicate row ids keeping only one row. print('Deciding which rows to remove...') declare @keepraw char(300), @rawdata char(300), @cdr_id UNIQUEIDENTIFIER, @billed BIT declare tmpcur cursor for select rawdata, id, billed from CallDetailRecords where rawdata in (select rawdata from [#cdrdupes]) order by rawdata, billed desc open tmpcur set @keepraw = '' fetch tmpcur into @rawdata, @cdr_id, @billed while (@@FETCH_STATUS = 0) BEGIN if (@keepraw <> @rawdata) BEGIN -- Keep the first row we encounter (should be the "billed" one, if there is a "billed" one). set @keepraw = @rawdata end ELSE IF (@billed <> 1) BEGIN -- Mark any unbilled duplicates for deletion. insert into #delids (id) values (@cdr_id) end fetch tmpcur into @rawdata, @cdr_id, @billed end close tmpcur deallocate tmpcur update CallDetailRecords SET Rated = 0, ErrorText = 'Duplicate Call' where id in (select id from #delids) set nocount OFF

Viewing 15 posts - 1 through 15 (of 40 total)

You must be logged in to reply to this topic. Login to reply

WayneS SSC Guru Points: 95405 More actions · Answer 1

It would sure be helpful to have table structures and some test data. Plus, you get some tested code back, instead of "this should work". Please see the link in my signature for what helpers on this site would be looking for.

Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes

If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

Links:
For better assistance in answering your questions
Performance Problems
Common date/time routines
Understanding and Using APPLY Part 1 & Part 2

Scott Roberts-430649 SSC-Addicted Points: 450 More actions · Answer 2

Sure, no problem.

There is only one table and only a few pertinent columns:



/****** Object:  Table [dbo].[CallDetailRecords]    Script Date: 04/27/2009 12:10:02 ******/

SET ANSI_NULLS ON

GO

SET QUOTED_IDENTIFIER ON

GO

SET ANSI_PADDING ON

GO

CREATE TABLE [dbo].[CallDetailRecords](

	[Id] [uniqueidentifier] NOT NULL,

	[RawData] [varchar](5000) NOT NULL,

	[ErrorText] [varchar](200) NULL,

	[Rated] [bit] NOT NULL,

	[Billed] [bit] NULL

) ON [PRIMARY]

GO

SET ANSI_PADDING OFF

And here is some sample data:



-- Duplicate Unbilled Calls (one should be removed)

INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'1234567890',NULL,1,0 )

INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'1234567890',NULL,1,0 ) 

-- Duplicate Billed Calls (neither one should be removed)

INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'9876543210',NULL,1,1 )

INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'9876543210',NULL,1,1 ) 

-- Mixed duplicates. All unbilled calls should be removed.

INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'5555555555',NULL,1,1 )

INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'5555555555',NULL,1,0 )

INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'5555555555',NULL,1,0 )

And finally, the desired results:

[font="Courier New"]

Id RawData ErrorText Rated Billed

B37131CE-5F51-4AD4-A368-1B5B54600529 1234567890 NULL 1 0

2D3E2A5A-D511-4883-BA6C-F47B7616E7C7 1234567890 Duplicate Call 0 0

F7E3FA78-25C1-4F4B-8359-9E9B16A3A744 9876543210 NULL 1 1

C925F42C-F568-4AEA-B325-7285C491403F 9876543210 NULL 1 1

D53DCE66-BDB7-4730-BC96-75DC1D13AD11 5555555555 NULL 1 1

1C7DDD68-C22D-432C-8C3F-D8F141E9DC8A 5555555555 Duplicate Call 0 0

422EED15-4BA6-40CA-AC24-2445C8CAAD87 5555555555 Duplicate Call 0 0[/font]

Thanks in advance!

The Dixie Flatline SSC Guru Points: 53353 More actions · Answer 3

This should work for you. I just revised this to break duplicate RawData into a separate temp table and index it. The original version using just the CTE was going to scan your input table multiple times. This version does it once to build #DupRawData and once to delete rows. Depending on the ratio of duplicates to non duplicates, and billed to unbilled, it might be faster to just pull the rows that are going to be saved rather than delete rows. Less logging, and index maintenance, just something to consider. An index over rawData on the input file might also buy you some speed in the long run.

Please let me know if you have any questions about the logic below.

Bob



--

-- in this example "rawdata" represents a set of columns that you would consider duplicates.

-- but we will just use the three scenarios above to separate them.

--

declare @sample table (rowID int identity, rawdata varchar(30), billed tinyint)

insert into @sample

select '1: All UNbilled',0 union all

select '1: All UNbilled',0 union all

select '1: All UNbilled',0 union all

select '1: All UNbilled',0 union all

select '2: One or more billed',0 union all

select '2: One or more billed',1 union all

select '2: One or more billed',0 union all

select '2: One or more billed',1 union all

select '2: One or more billed',0 union all

select '3: All billed',1 union all

select '3: All billed',1 union all

select '3: All billed',1 union all

select '3: All billed',1 union all

select '4: No duplicate',0 union all

select '5: No duplicate',1

--

select * from @sample

--

set statistics io on;

;with dupRawData as						-- identify duplicate groups

(select rawdata

		,count(*) as rowsCount

		,SUM(case when billed = 0 then 1 else 0 end)  as unbilledCount

		,MIN(case when billed = 0 then rowid else null end) as keeper

 from @sample

 group by rawdata

 having COUNT(*) > 1

    and SUM(case when billed = 0 then 1 else 0 end) > 0   --  must have SOME unbilled

)

select *

into #dupRawData

from dupRawData

--

create unique clustered index #pk_dupRawData on #duprawdata (rawdata) with(fillfactor = 100)

--

delete @sample

from @sample s

join #dupRawData d on s.rawdata = d.rawdata

where (d.rowsCount > d.unbilledCount and s.billed = 0)    -- if some are billed, flush the unbilled

   or (d.rowsCount = d.unbilledCount and s.rowID  keeper)  -- if all unbilled, keep the first 1

 --

set statistics io off;

select * from @sample

drop table #dupRawData

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

The Dixie Flatline SSC Guru Points: 53353 More actions · Answer 4

Scott, please note that I revised my original solution to cut down on the number of scans of your input table. It now creates a temp table (#dupRawData) that tells us whether or not a group is scenario 1 or scenario 2 and also preserves the rowID to be saved, all in one pass. The deletion pass then joins to this table on the rawdata.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Scott Roberts-430649 SSC-Addicted Points: 450 More actions · Answer 5

Scott Roberts-430649

SSC-Addicted

Points: 450

April 27, 2009 at 12:14 pm

#984560

Thanks Bob. I noticed that your original solution generated 4 table scans on the main table (that has millions of records) but I was reserving judgment until I ran some actual comparisons. My RBAR solution is running now, and I'll try your new solution once it finishes.

Scott Roberts-430649 SSC-Addicted Points: 450 More actions · Answer 6

Scott Roberts-430649

SSC-Addicted

Points: 450

April 27, 2009 at 12:32 pm

#984575

Oops. The RowId column (or "Id" column) is a uniqueidentifier so "MIN(rowid)" doesn't work. I don't see a "First()" function for SQL Server. Any other ideas?

The Dixie Flatline SSC Guru Points: 53353 More actions · Answer 7

Tsk... I'll never understand what is wrong with a simple integer.

cast(min(cast(ID as varchar(50))) as uniqueidentifier)

I can give you a cross apply solution, but nothing will build #dupRawData as quickly as a MIN() would.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Lynn Pettis SSC Guru Points: 442447 More actions · Answer 8

Lynn Pettis

SSC Guru

Points: 442447

April 27, 2009 at 12:59 pm

#984601

How about this?

EDIT: Code posted here was removed as it failed!

I noticed that your table def had RawData defined as varchar(5000) but your cursor function had it defined as varchar(300). Based on that I changed the table def.

The Dixie Flatline SSC Guru Points: 53353 More actions · Answer 9

The Dixie Flatline

SSC Guru

Points: 53353

April 27, 2009 at 1:02 pm

#984605

Lynn, doesn't this assume that all the duplicates are going to fall together in sequence?

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Lynn Pettis SSC Guru Points: 442447 More actions · Answer 10

Bob Hovious (4/27/2009)

Lynn, doesn't this assume that all the duplicates are going to fall together in sequence?

Possibly, but based on the trivial data, table structure, and expected results provided, it works.

For a better answer, perhaps a more detailed description of the requirements would be helpful as would the DDL of the tables, a more representative sampling of data, and (of course) the expected results based on that data.

Scott Roberts-430649 SSC-Addicted Points: 450 More actions · Answer 11

Bob Hovious (4/27/2009)

Tsk... I'll never understand what is wrong with a simple integer.

I'm certain that you (and everyone else in these forums) is keenly aware of the pros and cons of each player in the never-ending "int vs guid" debate. 😀

Bob Hovious (4/27/2009)

cast(min(cast(ID as varchar(50))) as uniqueidentifier)

That seems to work fine.

Thanks again.

Lynn Pettis SSC Guru Points: 442447 More actions · Answer 12

Lynn Pettis

SSC Guru

Points: 442447

April 27, 2009 at 1:13 pm

#984613

Actually, its broke, something was missing. Ignore the current code, i need to rework it a bit.

Scott Roberts-430649 SSC-Addicted Points: 450 More actions · Answer 13

Lynn Pettis (4/27/2009)

Possibly, but based on the trivial data, table structure, and expected results provided, it works.

Actually, it doesn't.

Check your results against the "expected results" in post #3 again. Your solution marks both rows with RawData = '1234567890' as "Duplicates", but only one of them is a duplicate.

Lynn Pettis SSC Guru Points: 442447 More actions · Answer 14

Scott Roberts (4/27/2009)

Lynn Pettis (4/27/2009)

Possibly, but based on the trivial data, table structure, and expected results provided, it works.

Actually, it doesn't.

Check your results against the "expected results" in post #3 again. Your solution marks both rows with RawData = '1234567890' as "Duplicates", but only one of them is a duplicate.

Actually, if you check the post just before yours, I noticed. I have, however, fixed it:



/****** Object:  Table [dbo].[CallDetailRecords]    Script Date: 04/27/2009 12:10:02 ******/

SET ANSI_NULLS ON

GO

SET QUOTED_IDENTIFIER ON

GO

CREATE TABLE [dbo].[CallDetailRecords](

    [Id] [uniqueidentifier] NOT NULL,

    [RawData] [varchar](300) NOT NULL,

    [ErrorText] [varchar](200) NULL,

    [Rated] [bit] NOT NULL,

    [Billed] [bit] NULL

) ON [PRIMARY]

GO

CREATE CLUSTERED INDEX IX_CallDetailRecords1 ON [dbo].[CallDetailRecords](

    [RawData]

) ON [PRIMARY]

;

GO

-- Duplicate Unbilled Calls (one should be removed)

INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'1234567890',NULL,1,0 );

INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'1234567890',NULL,1,0 );

-- Duplicate Billed Calls (neither one should be removed)

INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'9876543210',NULL,1,1 );

INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'9876543210',NULL,1,1 );

-- Mixed duplicates. All unbilled calls should be removed.

INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'5555555555',NULL,1,1 );

INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'5555555555',NULL,1,0 );

INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'5555555555',NULL,1,0 );

select

    *

from

    dbo.CallDetailRecords

;

declare @RawData varchar(300),

        @ErrorText varchar(200);

set @RawData = '';

update dbo.CallDetailRecords set

    @ErrorText = ErrorText = case when @RawData = cdr.RawData and cdr.Billed = 0 then 'Duplicate Call' else cdr.ErrorText end,

    @RawData = cdr.RawData

from

    dbo.CallDetailRecords cdr with (Index = 1)

;

select

    *

from

    dbo.CallDetailRecords

;

drop table dbo.CallDetailRecords

;