April 27, 2009 at 9:49 am
I just read Part 2 of RBarryYoung's article "There Must Be 15 Ways To Lose Your Cursors" (http://qa.sqlservercentral.com/articles/T-SQL/66494/) in which he claims that "by SQL Server 2005 and later, cursors were no longer necessary to SQL development and in fact were significantly harmful".
While I would completely agree that cursors are "significantly harmful", I'm not yet convinced that they are "no longer necessary". I have pasted a stored proc below that detects "duplicate" rows in a table (really just duplicate values in a particular column). There are certainly set-based methods of detecting and eliminating duplicate rows in a table, but this particular procedure has a twist: a "dupicate" row can only be removed if the "billed" column = 0.
Here's the real-world problem that this procedure solves:
We receive flat files from 3rd parties that contain millions of records on a regular basis (daily, weekly, monthly, it depends). We parse/analyze the data using a Windows Forms application then Bulk Insert the rows into SQL Server. Every so often (usually monthly) we aggregate the rows and generate a bill for them. At that time, the rows that are used to create an invoice are marked as "billed = 1". We never, ever, want to remove a row that has "billed = 1". Those rows represent billing detail are necessary for us to justify the invoices we send out.
The problem, then, is that these 3rd parties sometimes send us duplicate rows. My first thought was to make the "RawData" column of the table (which stores the entire unparsed fixed-length record) "unique". This of course destroys performance on inserted rows and causes other problems - like the entire Bulk Insert of 5000 rows failing because there is one duplicate in there. I decided that it was faster to allow the duplicate rows to go in and then try to remove them later (I'm open to suggestions to the contrary). So consider the following scenarios:
1. Duplicate rows where all of them are "unbilled". All duplicates should be removed and one row should remain.
2. Duplicate rows where one (or more) of them is "billed" and the others are "unbilled". All of the "unbilled" duplicate rows should be removed.
3. Duplicate rows where all of them are "billed" (shouldn't happen, but what if?). No rows should be removed.
Here is the RBAR stored proc that accomplishes this task. I'd like suggestions on how to eliminate the cursor and (more importantly) improve performance.
SET ANSI_NULLS OFF
GO
SET QUOTED_IDENTIFIER OFF
GO
ALTER PROCEDURE [dbo].[sbs_remove_duplicate_calls]
/*
Procedure: sbs_remove_duplicate_calls
Author: Scott Roberts
Created: 1/4/2008
Parameters: None
Description:
This procedure removes duplicate rows from the CallDetailRecords table. Rows are considered duplicate if the "rawdata" columns match exactly.
Revision History:
01/04/2008 - sgr - Copied from sbs_remove_dup_emidata
*/
AS
set nocount ON
-- Get all rows with duplicate "rawdata".
print('Finding duplicate raw data...')
select RawData, COUNT(*) as cnt
into #cdrdupes
from CallDetailRecords
group by RawData
having COUNT(*) > 1
-- Table for holding ids of duplicate rows.
CREATE TABLE #delids (id uniqueidentifier NOT NULL )
-- Step through the duplicate row ids keeping only one row.
print('Deciding which rows to remove...')
declare @keepraw char(300), @rawdata char(300), @cdr_id UNIQUEIDENTIFIER, @billed BIT
declare tmpcur cursor
for
select rawdata, id, billed
from CallDetailRecords
where rawdata in (select rawdata from [#cdrdupes])
order by rawdata, billed desc
open tmpcur
set @keepraw = ''
fetch tmpcur into @rawdata, @cdr_id, @billed
while (@@FETCH_STATUS = 0) BEGIN
if (@keepraw <> @rawdata) BEGIN
-- Keep the first row we encounter (should be the "billed" one, if there is a "billed" one).
set @keepraw = @rawdata
end
ELSE IF (@billed <> 1) BEGIN
-- Mark any unbilled duplicates for deletion.
insert into #delids (id) values (@cdr_id)
end
fetch tmpcur into @rawdata, @cdr_id, @billed
end
close tmpcur
deallocate tmpcur
update CallDetailRecords SET Rated = 0, ErrorText = 'Duplicate Call' where id in (select id from #delids)
set nocount OFF
April 27, 2009 at 10:22 am
It would sure be helpful to have table structures and some test data. Plus, you get some tested code back, instead of "this should work". Please see the link in my signature for what helpers on this site would be looking for.
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
April 27, 2009 at 11:25 am
Sure, no problem.
There is only one table and only a few pertinent columns:
/****** Object: Table [dbo].[CallDetailRecords] Script Date: 04/27/2009 12:10:02 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[CallDetailRecords](
[Id] [uniqueidentifier] NOT NULL,
[RawData] [varchar](5000) NOT NULL,
[ErrorText] [varchar](200) NULL,
[Rated] [bit] NOT NULL,
[Billed] [bit] NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
And here is some sample data:
-- Duplicate Unbilled Calls (one should be removed)
INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'1234567890',NULL,1,0 )
INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'1234567890',NULL,1,0 )
-- Duplicate Billed Calls (neither one should be removed)
INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'9876543210',NULL,1,1 )
INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'9876543210',NULL,1,1 )
-- Mixed duplicates. All unbilled calls should be removed.
INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'5555555555',NULL,1,1 )
INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'5555555555',NULL,1,0 )
INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'5555555555',NULL,1,0 )
And finally, the desired results:
[font="Courier New"]
Id RawData ErrorText Rated Billed
B37131CE-5F51-4AD4-A368-1B5B54600529 1234567890 NULL 1 0
2D3E2A5A-D511-4883-BA6C-F47B7616E7C7 1234567890 Duplicate Call 0 0
F7E3FA78-25C1-4F4B-8359-9E9B16A3A744 9876543210 NULL 1 1
C925F42C-F568-4AEA-B325-7285C491403F 9876543210 NULL 1 1
D53DCE66-BDB7-4730-BC96-75DC1D13AD11 5555555555 NULL 1 1
1C7DDD68-C22D-432C-8C3F-D8F141E9DC8A 5555555555 Duplicate Call 0 0
422EED15-4BA6-40CA-AC24-2445C8CAAD87 5555555555 Duplicate Call 0 0[/font]
Thanks in advance!
April 27, 2009 at 11:25 am
This should work for you. I just revised this to break duplicate RawData into a separate temp table and index it. The original version using just the CTE was going to scan your input table multiple times. This version does it once to build #DupRawData and once to delete rows. Depending on the ratio of duplicates to non duplicates, and billed to unbilled, it might be faster to just pull the rows that are going to be saved rather than delete rows. Less logging, and index maintenance, just something to consider. An index over rawData on the input file might also buy you some speed in the long run.
Please let me know if you have any questions about the logic below.
Bob
--
-- in this example "rawdata" represents a set of columns that you would consider duplicates.
-- but we will just use the three scenarios above to separate them.
--
declare @sample table (rowID int identity, rawdata varchar(30), billed tinyint)
insert into @sample
select '1: All UNbilled',0 union all
select '1: All UNbilled',0 union all
select '1: All UNbilled',0 union all
select '1: All UNbilled',0 union all
select '2: One or more billed',0 union all
select '2: One or more billed',1 union all
select '2: One or more billed',0 union all
select '2: One or more billed',1 union all
select '2: One or more billed',0 union all
select '3: All billed',1 union all
select '3: All billed',1 union all
select '3: All billed',1 union all
select '3: All billed',1 union all
select '4: No duplicate',0 union all
select '5: No duplicate',1
--
select * from @sample
--
set statistics io on;
;with dupRawData as -- identify duplicate groups
(select rawdata
,count(*) as rowsCount
,SUM(case when billed = 0 then 1 else 0 end) as unbilledCount
,MIN(case when billed = 0 then rowid else null end) as keeper
from @sample
group by rawdata
having COUNT(*) > 1
and SUM(case when billed = 0 then 1 else 0 end) > 0 -- must have SOME unbilled
)
select *
into #dupRawData
from dupRawData
--
create unique clustered index #pk_dupRawData on #duprawdata (rawdata) with(fillfactor = 100)
--
delete @sample
from @sample s
join #dupRawData d on s.rawdata = d.rawdata
where (d.rowsCount > d.unbilledCount and s.billed = 0) -- if some are billed, flush the unbilled
or (d.rowsCount = d.unbilledCount and s.rowID keeper) -- if all unbilled, keep the first 1
--
set statistics io off;
select * from @sample
drop table #dupRawData
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
April 27, 2009 at 12:02 pm
Scott, please note that I revised my original solution to cut down on the number of scans of your input table. It now creates a temp table (#dupRawData) that tells us whether or not a group is scenario 1 or scenario 2 and also preserves the rowID to be saved, all in one pass. The deletion pass then joins to this table on the rawdata.
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
April 27, 2009 at 12:14 pm
Thanks Bob. I noticed that your original solution generated 4 table scans on the main table (that has millions of records) but I was reserving judgment until I ran some actual comparisons. My RBAR solution is running now, and I'll try your new solution once it finishes.
April 27, 2009 at 12:32 pm
Oops. The RowId column (or "Id" column) is a uniqueidentifier so "MIN(rowid)" doesn't work. I don't see a "First()" function for SQL Server. Any other ideas?
April 27, 2009 at 12:40 pm
Tsk... I'll never understand what is wrong with a simple integer.
cast(min(cast(ID as varchar(50))) as uniqueidentifier)
I can give you a cross apply solution, but nothing will build #dupRawData as quickly as a MIN() would.
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
April 27, 2009 at 12:59 pm
How about this?
EDIT: Code posted here was removed as it failed!
I noticed that your table def had RawData defined as varchar(5000) but your cursor function had it defined as varchar(300). Based on that I changed the table def.
April 27, 2009 at 1:02 pm
Lynn, doesn't this assume that all the duplicates are going to fall together in sequence?
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
April 27, 2009 at 1:09 pm
Bob Hovious (4/27/2009)
Lynn, doesn't this assume that all the duplicates are going to fall together in sequence?
Possibly, but based on the trivial data, table structure, and expected results provided, it works.
For a better answer, perhaps a more detailed description of the requirements would be helpful as would the DDL of the tables, a more representative sampling of data, and (of course) the expected results based on that data.
April 27, 2009 at 1:09 pm
Bob Hovious (4/27/2009)
Tsk... I'll never understand what is wrong with a simple integer.
I'm certain that you (and everyone else in these forums) is keenly aware of the pros and cons of each player in the never-ending "int vs guid" debate. 😀
Bob Hovious (4/27/2009)
cast(min(cast(ID as varchar(50))) as uniqueidentifier)
That seems to work fine.
Thanks again.
April 27, 2009 at 1:13 pm
Actually, its broke, something was missing. Ignore the current code, i need to rework it a bit.
April 27, 2009 at 1:14 pm
Lynn Pettis (4/27/2009)
Possibly, but based on the trivial data, table structure, and expected results provided, it works.
Actually, it doesn't.
Check your results against the "expected results" in post #3 again. Your solution marks both rows with RawData = '1234567890' as "Duplicates", but only one of them is a duplicate.
April 27, 2009 at 1:33 pm
Scott Roberts (4/27/2009)
Lynn Pettis (4/27/2009)
Possibly, but based on the trivial data, table structure, and expected results provided, it works.Actually, it doesn't.
Check your results against the "expected results" in post #3 again. Your solution marks both rows with RawData = '1234567890' as "Duplicates", but only one of them is a duplicate.
Actually, if you check the post just before yours, I noticed. I have, however, fixed it:
/****** Object: Table [dbo].[CallDetailRecords] Script Date: 04/27/2009 12:10:02 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[CallDetailRecords](
[Id] [uniqueidentifier] NOT NULL,
[RawData] [varchar](300) NOT NULL,
[ErrorText] [varchar](200) NULL,
[Rated] [bit] NOT NULL,
[Billed] [bit] NULL
) ON [PRIMARY]
GO
CREATE CLUSTERED INDEX IX_CallDetailRecords1 ON [dbo].[CallDetailRecords](
[RawData]
) ON [PRIMARY]
;
GO
-- Duplicate Unbilled Calls (one should be removed)
INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'1234567890',NULL,1,0 );
INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'1234567890',NULL,1,0 );
-- Duplicate Billed Calls (neither one should be removed)
INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'9876543210',NULL,1,1 );
INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'9876543210',NULL,1,1 );
-- Mixed duplicates. All unbilled calls should be removed.
INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'5555555555',NULL,1,1 );
INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'5555555555',NULL,1,0 );
INSERT INTO [CallDetailRecords] ( [Id],[RawData],[ErrorText],[Rated],[Billed] ) VALUES ( NEWID(),'5555555555',NULL,1,0 );
select
*
from
dbo.CallDetailRecords
;
declare @RawData varchar(300),
@ErrorText varchar(200);
set @RawData = '';
update dbo.CallDetailRecords set
@ErrorText = ErrorText = case when @RawData = cdr.RawData and cdr.Billed = 0 then 'Duplicate Call' else cdr.ErrorText end,
@RawData = cdr.RawData
from
dbo.CallDetailRecords cdr with (Index = 1)
;
select
*
from
dbo.CallDetailRecords
;
drop table dbo.CallDetailRecords
;
Viewing 15 posts - 1 through 15 (of 40 total)
You must be logged in to reply to this topic. Login to reply