Overview
A common requirement in ETL scenarios is to extract data from a single file that has multiple record types with those records having parent-child relationships or a single record that can have one or more detail records that relate to it. SQL Server Integration Services (SSIS) offers a number of solutions for this scenario.
In this article, I will demonstrate a data flow solution for this scenario through the use of conditional split, script component (using VB.NET) and merge join. The input file is a comma separated sales report. The parent records detail an item bought and it's attributes, while the child records contain information about customers who have purchased that item in a single field and information regarding their purchases (e.g. price, quantity) The methodology will be to identify "good records" and allow those through the data flow. Next, a script component will be created that will assign a parent-child key to each record and direct each row to it's respective output. After each row has been directed to a separate output, I will need to scrub the customer address field using another script component. Finally, both the customer and product records will be sorted by their parent-child key and merged together using a merge join transformation and output into a SQL table.
Implementation
First, let's take a look at the sample file to be loaded
Figure 1 - Sample file
As you can see, there are two header records that contain column names: product and customer respectively. The third row has product information, while the fourth and fifth rows have customer information. The sixth row contains summary information that is not needed for this transformation.
The next step will be setting up a connection manager to extract this file. Notice that I have put double quotes (") in as a text qualifier. This will ensure that all of the customer information is captured in a single field and not delimited by 'City, State'.
Figure 2 - Flat File Connection Manager Editor
To extract only non-blank records into the actual data flow, I will add a conditional split to the transformation. The condition will check the first column to see if there is any data contained in that column. If there is data in that column the transformation sends the record to the output 'GoodRecs'.
Figure 3 - Conditional Split Transformation- excludes blank records
Once the conditional split is complete, drag and drop a 'Script Component' transformation into the data flow and choose type 'Transformation'. Select all columns under Input Columns then choose the 'Inputs and Outputs' view. Here, add 3 Outputs (Product, Customer & Summary) and add an output column to both the Product (ProdKey) and Customer (CustKey) Outputs. For the common properties (right side of the editor) of each output, you will need to change the field 'ExclusionGroup' to match the exclusion group of the Input (typically 1) and choose 'input "YourInput" (001)' as the 'SynchronousInputID'.
Figure 4 - Script Transformation Editor - splitting records by type
Next, choose the 'Script' view and click 'Design Script' at the bottom right hand corner of this box. Copy the below script into the script task. I have created variables CustPattern and SumPattern that identify patterns unique to each record type. When directing a customer record to the customer output, this variable will be called to validate that the record is, in fact, a customer record.
Imports System Imports System.Data Imports System.Math Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper Imports Microsoft.SqlServer.Dts.Runtime.Wrapper Public Class ScriptMain Inherits UserComponent Dim counter As Integer 'Record Pattern for Customer Records Dim CustPattern As String = "*;*" 'Record Pattern for Summary Records Dim SumPattern As String = "*(Entries*" 'Record Pattern for Product Records left blank Public Overrides Sub PreExecute() counter = 0 'Initialize counter = 0 End Sub Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer) 'Increment counter by 1 on each pass of blank second column If Row.Column1 <> "" Then counter += 1 End If counter = counter 'Set keys = counter value Row.CustKey = counter.ToString Row.ProdKey = counter.ToString 'Output records based on record patterns If Row.Column0 Like CustPattern Then Row.DirectRowToCustomerRecord() ElseIf Row.Column0 Like SumPattern Then Row.DirectRowToSummaryRecord() Else : Row.DirectRowToProductRecord() End If End Sub End Class
Now your data flow should look like Figure 5 below.
Figure 5 - Data Flow
Now, we need to scrub those nasty customer records with another script component. As before, create a 'Script Component' and choose type 'Transformation'. Select all columns under Input Columns then choose the 'Inputs and Outputs' view. Next, create an output called ScrubbedOuput and add output columns Zip, State, City, Address, CustomerName, CustomerNbr. For the common properties of each output, you will need to change the field 'ExclusionGroup' to match the exclusion group of the Input (typically 1) and choose 'input "YourInput" (001)' as the 'SynchronousInputID' just like before.
Figure 6 - Script Transformation Editor - scrub customer records
Next, choose the 'Script' view and click 'Design Script' at the bottom right hand corner. Copy the below script into your script task. Here, I have created the variable 'Scrub', just to make the script easier to read and modify.
Imports System Imports System.Data Imports System.Math Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper Imports Microsoft.SqlServer.Dts.Runtime.Wrapper Public Class ScriptMain Inherits UserComponent Dim Scrub As String Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer) ' Scrub = Row.Column0 Row.CustomerNbr = Scrub.Substring(0, 8) Row.CustomerName = Scrub.Substring(8, Scrub.IndexOf(";", 1) - 8) Row.Address = Scrub.Substring(Scrub.IndexOf(";", 1) + 2, (Scrub.LastIndexOf(";")) - (Scrub.IndexOf(";", 1)) - 2) Row.City = Scrub.Substring(Scrub.LastIndexOf(";") + 2, (Scrub.LastIndexOf(",")) - (Scrub.LastIndexOf(";")) - 2) Row.State = Scrub.Substring(Scrub.LastIndexOf(",") + 2, 2) Row.Zip = Scrub.Substring(Scrub.LastIndexOf(",") + 6, 5) Row.DirectRowToScrubbedOutput() ' End Sub End Class
Now your data flow should look like Figure 7 below.
Figure 7 - Data Flow Task
Before we can merge these two record types together, we need to add 'Sort' transformations to both the ProductRecord flow and the CustomerRecord flow on the ProdKey and CustKey fields. Once that is complete, we can add a 'Merge Join' transformation joining the two record types on ProdKey = CustKey.
Figure 8 - Merge Join Transformation Editor - join on ProdKey = CustKey
Once the join is complete, add your destination. Now your data flow should look like Figure 9 below. With the package configuration completed, you can execute the package and watch the data flow to multiple record types and then back together to create a single flattened record.
Figure 9 - Final Data Flow Task
Conclusion
In this article, I've demonstrated how to use SQL Server Integration Services (SSIS) to create a data flow task that can handle multiple record types with a parent-child relationship. The transformation extracts non-blank records, creates a key for each record and splits that record by type and then merges the records back together to create a flattened record with a 1-to-1 relationship.