SQLServerCentral Article

Dealing with Incomplete Data - A T-SQL Puzzle 1

,

Introduction

As DBA's and developers, we come across situations where the data to be processed needs to have a number of required elements. In a large set of data, some of the data may not be a complete unit and hence cannot be processed. For example, an employee record may not be complete without a LastName, a geographic record may not be complete without longitude and so on.

Once bad records are identified, they may need to be eliminated and good ones may need to be processed in bulk. Depending on the format and volume of the overall data set, identifying such bad records may be an arduous task. This article demonstrates the use of Common Table Expression and the OUTPUT clause in identifying bad records for a specific scenario that I recently encountered. 

Building up the Scenario

A company that organizes music events receives data on prospective performers from one or more vendors. A vendor sends details of musicians available to perform at such events for the company as part of the company's orchestra.

Data is received from the vendor in a specific format in a flat file. If the flat file doesn't contain the rate-quote for a specific musician, then the entire record for that musician is considered bad and is sent back to the vendor for correction. It's only the complete records that are sent to the company's management team for approval. Once the management approves the quotes, the musicians are hired by the company to perform at a company-sponsored event.

Then the vendor gets paid by the company for all musicians that have been hired. The payment to the vendor is as per the rate quoted by the vendor for each musician hired by the company. So the data received from the vendor must contain the rate-quote (hourly rate) for each musician.

The situation would have been simple had the vendor been sending data for all musicians in columns as shown below:

Name   Address                  Primary Attribute   Rate Quote
-----  ---------------------    -----------------   ------------
John   X City, Y State, Z Zip   Drummer             $500
Mary   A City, B State, C Zip   Flutist             null
...

In such a case, the rows where the Rate Quote value was null or blank could have been deleted.

The data from the vendor is received in rows. A set of rows may be considered a record for one musician. There are several such records for musicians in the same flat file: A sample record for a musician is shown below:

Row#1  '*********************************************'
Row#2  Name
Row#3  'John'
Row#4  Address
Row#5  'X City, Y State, Z Zip'
Row#6  Primary Attribute
Row#7  'Drummer'
Row#8  RateQuote
Row#9  '$500'
Row#10 '--------------------------------------------'

The above data is considered one good record for the musician, John. Data for musicians can contain more than ten rows in cases where they have a secondary attribute like Flute or Guitar. In short, the number of rows for each musician may vary. However, it is mandatory to have a value for the RateQuote.

A sample bad record is shown below. It is bad because the RateQuote doesn't have a value:

Row#1  '*********************************************'
Row#2  Name 
Row#3  'Pierre'
Row#4  Address
Row#5  'F City, G State, H Zip'
Row#6  Primary Attribute
Row#7  'Flutist'
Row#8  Secondary Attribute
Row#9  'Violinist'
Row#10 RateQuote
Row#11 '      '
Row#12 '--------------------------------------------'

Deleting a bad record means deleting all the rows from Row#1 through Row#12 for this musician named Pierre.

The goal of this article is to lay out a method to identify the bad records. It is not the aim of this article to show how good records are processed down the line.

Setting up the situation

The data from the vendor is stored in a table called dbo.Test_Table. This table has an identity column Seq_No; the second column CData hosts the actual vendor data. The script below creates the table, dbo.Test_Table:

if exists( select 1 from sys.tables
 where type = 'U'
 and name = 'Test_Table'
 and USER_NAME(schema_id) = 'dbo')
drop table dbo.Test_Table 
go
create table dbo.Test_Table 
(Seq_No int identity(1,1) Primary Key clustered
,CData varchar(50) not null)
go

The T-SQL script below populates the table, dbo.Test_Table with sample data. Here six records are being inserted, one per musician. The script below is compatible with Microsoft SQL Server 2008 but not with earlier versions. It may be noted that two of the records (for musicians Bob and Spencer) are bad as they do not have a value for the RateQuote:

insert into dbo.Test_Table (CData)
values ('*********************************************')
 ,('Name')
 ,('James')
 ,('Address')
 ,('123 Lane1, City1 ST 12341')
 ,('Primary Attribute')
 ,('Pianist')
 ,('Secondaty Attribute')
 ,('Flutist')
 ,('RateQuote')
 ,('$500')
 ,('--------------------------------------------')
 ,('*********************************************')
 ,('Name')
 ,('Veronica')
 ,('Address')
 ,('321 Lane2, City2 ST 12342')
 ,('Primary Attribute')
 ,('Violinist')
 ,('RateQuote')
 ,('$600') 
 ,('--------------------------------------------')
 ,('*********************************************')
 ,('Name')
 ,('Bob')
 ,('Address')
 ,('213 Lane3, City3 ST 12343')
 ,('Primary Attribute')
 ,('Drummer')
 ,('RateQuote')
 ,(' ') 
 ,('--------------------------------------------')
 ,('*********************************************')
 ,('Name')
 ,('Harry')
 ,('Address')
 ,('312 Lane4, City4 ST 12344')
 ,('Primary Attribute')
 ,('Guitarist')
 ,('RateQuote')
 ,('$500') 
 ,('--------------------------------------------')
 ,('*********************************************')
 ,('Name')
 ,('Spencer')
 ,('Address')
 ,('456 Lane5, City5 ST 12345')
 ,('Primary Attribute')
 ,('Saxophonist')
 ,('RateQuote')
 ,(' ') 
 ,('--------------------------------------------')
 ,('*********************************************')
 ,('Name')
 ,('Jessica')
 ,('Address')
 ,('456 Lane6, City6 ST 12346')
 ,('Primary Attribute')
 ,('Flutist')
 ,('RateQuote')
 ,('$600') 
 ,('--------------------------------------------')
 ,('*********************************************')

Note the data:

select * from dbo.Test_Table

The bad data (marked in red below) is within the Seq_No values 23 to 32 and 43 to 52.

Seq_No  CData 
------  -----------------------------------------------------
1       *********************************************
2       Name
3       James
4       Address
5       123 Lane1, City1 ST 12341
6       Primary Attribute
7       Pianist
8       Secondaty Attribute
9       Flutist
10      RateQuote
11      $500
12      --------------------------------------------
13      *********************************************
14      Name
15      Veronica
16      Address
17      321 Lane2, City2 ST 12342
18      Primary Attribute
19      Violinist
20      RateQuote
21      $600
22      --------------------------------------------
23      *********************************************
24      Name
25      Bob
26      Address
27      213 Lane3, City3 ST 12343
28      Primary Attribute
29      Drummer
30      RateQuote
31 
32      --------------------------------------------
33      *********************************************
34      Name
35      Harry
36      Address
37      312 Lane4, City4 ST 12344
38      Primary Attribute
39      Guitarist
40      RateQuote
41      $500
42      --------------------------------------------
43      *********************************************
44      Name
45      Spencer
46      Address
47      456 Lane5, City5 ST 12345
48      Primary Attribute
49      Saxophonist
50      RateQuote
51 
52      --------------------------------------------
53      *********************************************
54      Name
55      Jessica
56      Address
57      456 Lane6, City6 ST 12346
58      Primary Attribute
59      Flutist
60      RateQuote
61      $600
62      --------------------------------------------
63      *********************************************

As is apparent, each record for a musician starts with stars ('******') and ends with '--------...'. There could be multiple entries (rows) between the starting row and the ending row for a musician's record. The number of rows for each musician may not be the same. Also, the final entry in the flat file is always stars ('******') as if it were the beginning of a new record.

Please note that I have changed the domain to music and the values as well to bring out the problem. In our practical situation, we have around 16 million records in a single flat file from the vendor and the domain is finance.

Removing the Bad Data

The script below identifies the bad records, deletes them from the table, dbo.Test_Table and then inserts those deleted bad records into the table, dbo.Test_Table_Bad_Records So the table, dbo.Test_Table_Bad_Records would ultimately host the bad records that have been identified and would be sent back to the vendor.

I have used the method below because it sounded simple to me, executes fast for a few tens of millions of records and is short. It also brings out the usage of CTE (Common Table Expression) and the OUTPUT clause. Please execute the following T-SQL code to remove bad records:

if exists( select 1 from sys.tables
           where type = 'U'
           and name   = 'Test_Table_Bad_Records'
           and USER_NAME(schema_id) = 'dbo')
drop table dbo.Test_Table_Bad_Records
go
create table dbo.Test_Table_Bad_Records
(Seq_No int not null primary key clustered 
,CData varchar(50) not null);
With CTE_Find_Start_Rows (CData, Seq_No, Finder)
as
(select CData, Seq_No, Finder = case
 when ltrim(rtrim(CData)) like '*********************************************' then 1
 else 0
 end
 from dbo.Test_Table)
,CTE_Last_Row_To_Delete (CData, Seq_No, Finder)
as
(select CData, Seq_No, Finder from CTE_Find_Start_Rows a
 where Finder = 0
 and ltrim(rtrim(CData)) = '--------------------------------------------'
 and exists (select 1 from CTE_Find_Start_Rows b
             where b.Finder = 1
             and   a.Seq_No + 1 = b.Seq_No)
 and exists (select 1 from CTE_Find_Start_Rows c
             where c.Finder = 0
             and a.Seq_No - 1 = c.Seq_No
             and ltrim(rtrim(c.CData)) = ''))
,CTE_Range_to_Delete (Delete_Start, Delete_End)
as
(select max(a.Seq_No) Delete_Start, b.Seq_No Delete_End from   
 CTE_Find_Start_Rows a inner join CTE_Last_Row_To_Delete b
 on a.Seq_No < b.Seq_No
 where a.Finder = 1
 group by b.Seq_No)
delete from a
output deleted.* into dbo.Test_Table_Bad_Records
--select a.*
from dbo.Test_Table a inner join CTE_Range_to_Delete b
on a.Seq_No between b.Delete_Start and b.Delete_End
go

Explanation of the code

First the table, dbo.Test_Table_Bad_Records is being created for storing bad records that have been deleted from the table, dbo.Test_Table.

The purpose of the common table expression, CTE_Find_Start_Rows is to identify the starting row in a record for each musician. This is accomplished through the field called Finder. Finder is assigned a value of 1 for those rows that have all stars ('******') in them. For all other rows, the value of Finder would be 0.

The purpose of the common table expression, CTE_Last_Row_To_Delete is to identify the ending row in a bad record for each musician. The identity of such a record is established by three facts. The record needs to have only dashes ('---------'), the next record after that must have all stars ('********') and the prior record must be blank or null. The value of Seq_No from this CTE would then be used in the next CTE to find the starting and ending positions (Seq_No) of bad records.

Having found the starting position of each musician's record in CTE_Find_Start_Rows and the ending position of each bad record in CTE_Last_Row_To_Delete, it is now time to find the starting position of each bad record and ending position of each bad record by joining the first two CTE's. This is accomplished in the common table expression, CTE_Range_to_Delete. Since the Seq_No value of the ending row of a bad record is known from CTE_Last_Row_To_Delete, all that needs to done is to find the maximum value of a Seq_No value with all starts ('*******') in the CData column that is less that the Seq_no value of the ending row of a bad record. This gives us the range of Seq_No values that need to be deleted. Please note that we need this range to delete the entire bad record no matter how many rows it contains.

The final step in the code is to delete all rows falling within the start and end of a bad record. In other words, all Seq_No's are deleted from our main table, dbo.Test_Table that fall within the Delete_Start and Delete_End values from the common Table Expression, CTE_Range_To_Delete. At the same time, OUTPUT clause is being used within the DELETE statement to insert the bad deleted records into the table, dbo.Test_Table_Bad_Records. The contents of the table, dbo.Test_Table_Bad_Records would be sent back to the vendor for corrections in RateQuote for the musicians Bob and Spencer.

Let's verify the contents of the table, dbo.Test_Table_Bad_Records by executing the following TSQL code:

select * from dbo.Test_Table_Bad_Records   ---Bad records
select * from dbo.Test_Table               ---Good records   

Please note that the records for musicians Bob (Seq_No from 23 to 32) and Spencer (Seq_No from 43 to 52) have been deleted from dbo.Test_Table and have been stored in the table, dbo.Test_Table_Bad_Records.

Conclusion

CTE's (common table expressions) are a nice way of navigating through different data sets constructed out of one data set. They help analytics by looking at the same data from multiple different angles in the same piece of code. OUTPUT clause is a nice option for auditing altered data as shown in this example. In the next article, we would use recursive CTE's to solve another TSQL puzzle.

Rate

2.19 (21)

You rated this post out of 5. Change rating

Share

Share

Rate

2.19 (21)

You rated this post out of 5. Change rating