SSIS 2008R2 - Data Quality

  • OK,

    so I have a flat file. I have set all the columns in the flat file connection to Varchar (a lot). This should mean that I have very fre errors loading into the staging tables.

    I catch any errors on loading the file to staging and if ok go onto loading into the warehouse. Now my question is this. When going from staging to warehouse I will need to do transformations on a lot of the columns to get the data into the right format.

    Is there a best practice for doing this?

    I am planning to use a SQL statement for the source so this may do a lot of the work for me. Following that the derived column and conditional split if there is an illegal size data item.

    Thanks

    Ells

    😛

  • wait, you write the data into a stage table as varchar and then you pull it back out to analyze it and write it back to another sql table?

    I'm not a fan of best practice questions, but I would not do it this way.

  • I chose to import the text files as very big varchar fields. This then gives the file import the best chance of success. Then once they are in staging I have the best chance to deal with the issues. If there is a problem with a field then I have various options as to how to deal with it. Currently some fields I may truncate if too big, other fields with issues I may reject the row or the file.

    Once in at least I have a chance to deal with the data. If the file is rejected for wrong size/format of data then I cant really help with the diagnosis.

    What I am looking for is the best way to ensure that post transformation the fields match such criteria as fit into an int data type, YYYYMMDD for dates, varchar(n). Looks like a big conditional split or a heavily coded stored proc that looks through each column, and gets the data formating definition from a table to enforce the data type. I can not use the definitions of the data stoore table to help with the data type definition as it has to have different column definitions.

    Still looking for a way forward

    M

    😎

  • I'm not going to say that there is anything wrong with that, I'm just going to provide an alternative that has worked for me. I set all of the source data types to strings on my source connection then send this to derived columns, which attempts to cast the data to the correct sql data type, if this cast fails for whatever reason (invalid data, truncation), i write these rows to an error table (or file) that stores all the values as large string fields. I then send the remaining (valid rows) to the database destination confident that these values will fit into whatever data type the column happens to be.

  • Thanks,

    I will have a look at that. I am currently playing with a conditional split then derived columns, then a union all to put it back into the error table. That way I ges

    t to keep the row, and a good error description. Would prefer something less hectic on the design screen.

    Cheers

    Mark

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply