Clean data in a column and use an IF ELSE LOGIC

  • Hi I have a column (Driver) that has dirty data but there is info in the column which is usable.

    What I would like is to create two adjacent columns DriverDefault and DriverFix.

    In DriverDefault I would like to extract the clean data and default any spelling errors or random data to 'UNKNOWN'. There is a list of Clean data in the column I want to search for which includes;

    IN ('INSURED', 'REGULAR DRIVER', 'BROTHER', 'SISTER', 'SPOUSE', 'PARENT', 'SIBLING', 'DAUGHTER', 'FATHER', 'MOTHER','COUSIN','PARKED','FRIEND')

    For this column DriverDefault I'm assuming I would have to use some IF ELSE Logic such as: If Driver = 'INSURED' THEN 'INSURED' ELSIF Driver = PARENT THEN PARENT etc.....ELSE 'UNKNOWN'

    In the last column DriverFix I would like to perform some sort of data clean up. I have heard of algorithms that can clean for instance the misspelled word 'INSURRED' to 'INSURED' and so on.

    The algorithm must only clean once again

    IN ('INSURED', 'REGULAR DRIVER', 'BROTHER', 'SISTER', 'SPOUSE', 'PARENT', 'SIBLING', 'DAUGHTER', 'FATHER', 'MOTHER','COUSIN','PARKED','FRIEND')

    and the other random data needs to be defaulted to 'UNKNOWN'

    I have provided some sample data to clarify what I would like to do.

    Thanks in advance.

    --DROP TABLE #Driver

    CREATE TABLE #Driver (Driver varchar(max) , DriverDefault varchar(max),DriverFix varchar(max))

    INSERT INTO #Driver VALUES ('INSURED','INSURED','INSURED')

    INSERT INTO #Driver VALUES ('INSURED','INSURED','INSURED')

    INSERT INTO #Driver VALUES ('INSSURED','UNKNOWN','INSURED')

    INSERT INTO #Driver VALUES ('10138','UNKNOWN','UNKNOWN')

    INSERT INTO #Driver VALUES ('PARKED','PARKED','PARKED')

    INSERT INTO #Driver VALUES ('REGULAR DRIVER','REGULAR DRIVER','REGULAR DRIVER')

    INSERT INTO #Driver VALUES ('INSURED','INSURED','INSURED')

    INSERT INTO #Driver VALUES ('SAM','UNKNOWN','UNKNOWN')

    INSERT INTO #Driver VALUES ('NULL','UNKNOWN','UNKNOWN')

    INSERT INTO #Driver VALUES ('MOTHER - 5703230003083 (WILMA VERMAAK)','UNKNOWN','MOTHER')

    INSERT INTO #Driver VALUES ('MR LUCAS KRUGER-ID NUMBER ','UNKNOWN','UNKNOWN')

    INSERT INTO #Driver VALUES ('INSURED','INSURED','INSURED')

    INSERT INTO #Driver VALUES ('MPHO','UNKNOWN','UNKNOWN')

    INSERT INTO #Driver VALUES ('INSURED','INSURED','INSURED')

    INSERT INTO #Driver VALUES ('INSURED','INSURED','INSURED')

    INSERT INTO #Driver VALUES ('MR','UNKNOWN','UNKNOWN')

    INSERT INTO #Driver VALUES ('INSURED','INSURED','INSURED')

    INSERT INTO #Driver VALUES ('REGULAR DRIVER - PETER','UNKNOWN','REGULAR DRIVER')

    INSERT INTO #Driver VALUES ('INSURED','INSURED','INSURED')

    INSERT INTO #Driver VALUES ('PARKD','UNKNOWN','PARKED')

    INSERT INTO #Driver VALUES ('SPOUSE','SPOUSE','SPOUSE')

    INSERT INTO #Driver VALUES ('PARENT','PARENT','PARENT')

    INSERT INTO #Driver VALUES ('INSURED','INSURED','INSURED')

    INSERT INTO #Driver VALUES ('REGULAR DRIVVER','REGULAR DRIVVER','REGULAR DRIVER')

    INSERT INTO #Driver VALUES ('INSURED - MR JOHN','UNKNOWN','INSURED')

    INSERT INTO #Driver VALUES ('INNSURRED','UNKNOWN','INSURED')

    INSERT INTO #Driver VALUES ('INSURED','INSURED','INSURED')

    INSERT INTO #Driver VALUES ('SPOUSE','SPOUSE','SPOUSE')

    INSERT INTO #Driver VALUES ('INSURED','INSURED','INSURED')

    INSERT INTO #Driver VALUES ('FATTHER','UNKNOWN','FATHER')

    Select * from #Driver

  • mic.con87 (2/29/2012)


    Hi I have a column (Driver) that has dirty data but there is info in the column which is usable.

    What I would like is to create two adjacent columns DriverDefault and DriverFix.

    In DriverDefault I would like to extract the clean data and default any spelling errors or random data to 'UNKNOWN'. There is a list of Clean data in the column I want to search for which includes;

    IN ('INSURED', 'REGULAR DRIVER', 'BROTHER', 'SISTER', 'SPOUSE', 'PARENT', 'SIBLING', 'DAUGHTER', 'FATHER', 'MOTHER','COUSIN','PARKED','FRIEND')

    For this column DriverDefault I'm assuming I would have to use some IF ELSE Logic such as: If Driver = 'INSURED' THEN 'INSURED' ELSIF Driver = PARENT THEN PARENT etc.....ELSE 'UNKNOWN'

    In the last column DriverFix I would like to perform some sort of data clean up. I have heard of algorithms that can clean for instance the misspelled word 'INSURRED' to 'INSURED' and so on.

    The algorithm must only clean once again

    IN ('INSURED', 'REGULAR DRIVER', 'BROTHER', 'SISTER', 'SPOUSE', 'PARENT', 'SIBLING', 'DAUGHTER', 'FATHER', 'MOTHER','COUSIN','PARKED','FRIEND')

    and the other random data needs to be defaulted to 'UNKNOWN'

    I have provided some sample data to clarify what I would like to do.

    Thanks in advance.

    --DROP TABLE #Driver

    CREATE TABLE #Driver (Driver varchar(max) , DriverDefault varchar(max),DriverFix varchar(max))

    INSERT INTO #Driver VALUES ('INSURED','INSURED','INSURED')

    INSERT INTO #Driver VALUES ('INSURED','INSURED','INSURED')

    INSERT INTO #Driver VALUES ('INSSURED','UNKNOWN','INSURED')

    INSERT INTO #Driver VALUES ('10138','UNKNOWN','UNKNOWN')

    INSERT INTO #Driver VALUES ('PARKED','PARKED','PARKED')

    INSERT INTO #Driver VALUES ('REGULAR DRIVER','REGULAR DRIVER','REGULAR DRIVER')

    INSERT INTO #Driver VALUES ('INSURED','INSURED','INSURED')

    INSERT INTO #Driver VALUES ('SAM','UNKNOWN','UNKNOWN')

    INSERT INTO #Driver VALUES ('NULL','UNKNOWN','UNKNOWN')

    INSERT INTO #Driver VALUES ('MOTHER - 5703230003083 (WILMA VERMAAK)','UNKNOWN','MOTHER')

    INSERT INTO #Driver VALUES ('MR LUCAS KRUGER-ID NUMBER ','UNKNOWN','UNKNOWN')

    INSERT INTO #Driver VALUES ('INSURED','INSURED','INSURED')

    INSERT INTO #Driver VALUES ('MPHO','UNKNOWN','UNKNOWN')

    INSERT INTO #Driver VALUES ('INSURED','INSURED','INSURED')

    INSERT INTO #Driver VALUES ('INSURED','INSURED','INSURED')

    INSERT INTO #Driver VALUES ('MR','UNKNOWN','UNKNOWN')

    INSERT INTO #Driver VALUES ('INSURED','INSURED','INSURED')

    INSERT INTO #Driver VALUES ('REGULAR DRIVER - PETER','UNKNOWN','REGULAR DRIVER')

    INSERT INTO #Driver VALUES ('INSURED','INSURED','INSURED')

    INSERT INTO #Driver VALUES ('PARKD','UNKNOWN','PARKED')

    INSERT INTO #Driver VALUES ('SPOUSE','SPOUSE','SPOUSE')

    INSERT INTO #Driver VALUES ('PARENT','PARENT','PARENT')

    INSERT INTO #Driver VALUES ('INSURED','INSURED','INSURED')

    INSERT INTO #Driver VALUES ('REGULAR DRIVVER','REGULAR DRIVVER','REGULAR DRIVER')

    INSERT INTO #Driver VALUES ('INSURED - MR JOHN','UNKNOWN','INSURED')

    INSERT INTO #Driver VALUES ('INNSURRED','UNKNOWN','INSURED')

    INSERT INTO #Driver VALUES ('INSURED','INSURED','INSURED')

    INSERT INTO #Driver VALUES ('SPOUSE','SPOUSE','SPOUSE')

    INSERT INTO #Driver VALUES ('INSURED','INSURED','INSURED')

    INSERT INTO #Driver VALUES ('FATTHER','UNKNOWN','FATHER')

    Select * from #Driver

    Well, the DriverDefault column is easy. Add the column to the table if doesn't already exist. Then

    UPDATE #Driver

    set DriverDefault = CASE WHEN Driver IN (<your list of acceptable values>) THEN Driver ELSE 'UNKNOWN' END

    The second part is much trickier. The conventional wisdom, with which I mainly agree, is that the kind of pattern matching and text replacement you need to do is more efficiently done in a CLR language. You may be able to find commercially available software that will do this, but a competent C# programmer should be able to do it with reasonable effort. The actual algorithms you would use depend on your use case, but there are many out there for identfying strings that match words but for transposed characters, extra characters/missing characters, common letter substitutions, etc. I have seen this done in T-SQL to identify strings that differ from words by only one character more or less or a single character transposition, but I think anything more sophisticated that that would be extremely clunky and practically useless in any high-volume processes.

    Jason Wolfkill

  • It seems to me that there is a parsing question here.

    When using the IN for string comparision, it will be using the database collation.

    Default Collation of the SQL Server installation SQL_Latin1_General_CP1_CI_AS is not case sensitive.

    If this is not the case, then leave the UPPER() words in the case statement below.

    I would use the LIKE, not IN for the search since "MOTHER ..." will not match.

    If this is a one time process, then run a update to clear your data. If data is being added to the table on a constant basis, then use a trigger to categorize the data when added.

    The sample select for a Driver Default is below. Convert to UPDATE statement after you verify it does what you want.

    //

    // Sample select

    //

    SELECT

    Driver,

    CASE

    WHEN UPPER(Driver) LIKE '%INSURED%' THEN 'INSURED'

    WHEN UPPER(Driver) LIKE '%REGULAR DRIVER%' THEN 'REGULAR DRIVER'

    WHEN UPPER(Driver) LIKE '%BROTHER%' THEN 'BROTHER'

    WHEN UPPER(Driver) LIKE '%SISTER%' THEN 'SISTER'

    WHEN UPPER(Driver) LIKE '%SPOUSE%' THEN 'SPOUSE'

    WHEN UPPER(Driver) LIKE '%PARENT%' THEN 'PARENT'

    WHEN UPPER(Driver) LIKE '%SIBLING%' THEN 'SIBLING'

    WHEN UPPER(Driver) LIKE '%DAUGHTER%' THEN 'DAUGHTER'

    WHEN UPPER(Driver) LIKE '%FATHER%' THEN 'FATHER'

    WHEN UPPER(Driver) LIKE '%MOTHER%' THEN 'MOTHER'

    WHEN UPPER(Driver) LIKE '%COUSIN%' THEN 'COUSIN'

    WHEN UPPER(Driver) LIKE '%PARKED%' THEN 'PARKED'

    WHEN UPPER(Driver) LIKE '%FRIEND%' THEN 'FRIEND'

    ELSE

    'UNKNOWN'

    END AS DriverDefault

    FROM

    #Driver

    Check out my blog for FULL TEXT Searching. The FTS engine is another way to go.

    http://craftydba.com/?p=1664

    But if you are just categorizing data when added or one time, a trigger will do the trick.

    Good luck.

    John Miner

    http://www.craftydba.com

    John Miner
    Crafty DBA
    www.craftydba.com

  • Thanks for the help and advice.

    John when I run CREATE CLUSTERED INDEX IDX_incident_driver ON MyTable (incident_driver);

    I receive this error:

    Warning! The maximum key length is 900 bytes. The index 'IDX_incident_driver' has maximum length of 2000 bytes. For some combination of large values, the insert/update operation will fail.

    Msg 1946, Level 16, State 3, Line 1

    Operation failed. The index entry of length 901 bytes for the index 'IDX_ncident_driver' exceeds the maximum length of 900 bytes.

    I tried to change the data type to varchar 900(Think this is what the error message was saying) but received this error:

    Msg 8152, Level 16, State 14, Line 1

    String or binary data would be truncated.

    Any suggestions?

  • Since you are parsing the column, adding a clustered index will not help since the optimizer will probably do a full table scan during that operation.

    You can add an identity column as a sorrogate key. Then either use a full text index or do categorization. The categorization can be one time with a trigger to maintain the data going forward. These two options are covered in my blog.

    I guess my question to you is why are you adding a clustered index?

    John Miner

    http://www.craftydba.com

    John Miner
    Crafty DBA
    www.craftydba.com

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply