Data Cleaning

  • I know this should be easy, but I can't seem to get it. I have a database with a table that has 800 million rows of data that contains any number of extra spaces and non-ASCII characters which are causing my front-end application to have issues. I need an SP that I can run that goes through replaces non-ASCII characters with spaces and reduces multi-blank spaces with single blank spaces. It needs to run search through each column of the table.

    On the non-ASCII characters: I have no idea how to search and replace for them. I would think that somewhere someone has a functions/SP that will do it, but not that I can find.

    On the blank spaces: This functions works great.

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    ALTER FUNCTION [dbo].[cleanString] (@s varchar(50))

    RETURNS varchar(50)

    AS

    BEGIN

    declare @i int, @l int

    declare @t varchar(50), @d varchar(50)

    select @S=LTRIM(RTRIM(@s))

    select @i=CHARINDEX(' ',@s) -1

    if @i<1 return @S

    set @t=''

    while @i>0

    begin

    set @d=LEFT(@s,@i)

    set @t = @t + ' ' + @d

    set @l=LEN(@s)

    set @S=LTRIM(right(@s,@l-@i-2))

    set @i=CHARINDEX(' ',@s)-1

    end

    RETURN ( ltrim(@t + ' ' + @S) )

    END

    However, It will only work on one column at a time. I need it to run across the entire row before moving on.

    In short, I need an SP that will take a row of data, replace non-ASCII characters with spaces, and then remove extra blank spaces.

    Thanks

  • --First, lets build some test data

    DECLARE @table TABLE(

    data VARCHAR(20));

    DECLARE @cnt INT

    --2,500,000 rows

    SET @cnt=2500000

    WHILE @cnt > 0

    BEGIN

    SET @cnt=@cnt - 1

    INSERT INTO @table

    SELECT Upper(LEFT(Newid(), 3)) + ' ' + '$' + Upper(LEFT(Newid(), 1)) +

    Lower

    (

    LEFT(Newid(), 12))

    END

    --Now the query

    ;WITH num1 (n)

    AS (SELECT 1

    UNION ALL

    SELECT 1),

    num2 (n)

    AS (SELECT 1

    FROM num1 AS data,

    num1 AS data2),

    num3 (n)

    AS (SELECT 1

    FROM num2 AS data,

    num2 AS data2),

    num4 (n)

    AS (SELECT 1

    FROM num3 AS data,

    num3 AS data2),

    nums (n)

    AS (SELECT Row_number() OVER(ORDER BY n)

    FROM num4),

    cleaner

    AS (SELECT data,

    (SELECT CASE

    WHEN Substring(data, n, 1) NOT LIKE '%[^a-zA-Z0-9_]%'

    THEN

    Substring(data, n, 1)

    ELSE ''

    END + ''

    FROM nums

    WHERE n <= Len(data)

    FOR XML PATH('')) AS new_data

    FROM @table)

    SELECT data,

    new_data

    FROM cleaner;

    How's that?

    /*Output

    data new_data

    -------------------- --------------------

    3B6 $063761ccd-1ae 3B6063761ccd1ae

    649 $99aa55943-14c 64999aa5594314c

    BFA $36426bb2d-0dd BFA36426bb2d0dd

    D45 $40f3b40c7-5cf D4540f3b40c75cf

    812 $0f8b0e0e7-cef 8120f8b0e0e7cef

    B7A $06006ba34-52f B7A06006ba3452f

    BCD $F6f27b65a-703 BCDF6f27b65a703

    824 $B0ea2c21a-021 824B0ea2c21a021

    A3A $Fdcac2ed5-306 A3AFdcac2ed5306

    9A4 $9e812cb39-c26 9A49e812cb39c26

    3CF $9cb7feeeb-18f 3CF9cb7feeeb18f

    221 $8c78a1fca-d0c 2218c78a1fcad0c

    D24 $Bdbac8ffa-8f3 D24Bdbac8ffa8f3

    064 $F7c34b3c0-dd8 064F7c34b3c0dd8

    154 $91a737a45-118 15491a737a45118

    */

    *EDIT*

    Sorry, I misread the question as "can I remove non-alphanumberic character". I'll have a look either later on or tomorrow if someone else hasn't already given you a query.


    Forever trying to learn
    My blog - http://www.cadavre.co.uk/
    For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
    For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • Right, modified. Not too happy with it, but it works.

    --First, lets build some test data

    DECLARE @table TABLE(

    data VARCHAR(20));

    DECLARE @cnt INT

    --2,500,000 rows

    SET @cnt=2500000

    WHILE @cnt > 0

    BEGIN

    SET @cnt=@cnt - 1

    INSERT INTO @table

    SELECT Upper(LEFT(Newid(), 3)) + ' ' + '$' + Upper(LEFT(Newid(), 1)) +

    Lower (LEFT(Newid(), 12))

    END

    --Now the query

    ;WITH num1 (n)

    AS (SELECT 1

    UNION ALL

    SELECT 1),

    num2 (n)

    AS (SELECT 1

    FROM num1 AS data,

    num1 AS data2),

    num3 (n)

    AS (SELECT 1

    FROM num2 AS data,

    num2 AS data2),

    num4 (n)

    AS (SELECT 1

    FROM num3 AS data,

    num3 AS data2),

    nums (n)

    AS (SELECT Row_number() OVER(ORDER BY n)

    FROM num4),

    cleaner

    AS (SELECT data,

    (SELECT CASE

    WHEN Substring(data, n, 1) LIKE '%[a-zA-Z0-9]%' THEN

    Substring(data, n, 1)

    ELSE ' '

    END + '~FAKE~'

    FROM nums

    WHERE Len(data) >= n

    FOR XML PATH('')) AS new_data

    FROM @table)

    SELECT data,

    REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(new_data, '%~FAKE~ ', ''),

    '~FAKE~', '')

    , ' ',

    '<>'), '><', ''), '<>', ' ') AS new_data

    FROM cleaner;

    /*OUTPUT

    data new_data

    -------------------- -----------------------

    CC1 $66b8c36dd-152 CC1 66b8c36dd 152

    752 $54fa19f95-227 752 54fa19f95 227

    743 $6b2d27a15-da6 743 6b2d27a15 da6

    79D $7206ec154-c00 79D 7206ec154 c00

    C3A $E03edea79-f7b C3A E03edea79 f7b

    5CD $9f354f659-177 5CD 9f354f659 177

    FE8 $22407f934-ef2 FE8 22407f934 ef2

    A5C $Dda09ace8-8e2 A5C Dda09ace8 8e2

    E50 $42bc1c44d-23c E50 42bc1c44d 23c

    5C3 $1ec89932d-d0f 5C3 1ec89932d d0f

    605 $F6afded0e-ecd 605 F6afded0e ecd

    68D $F3a5e446f-ef7 68D F3a5e446f ef7

    020 $E12e651bc-fbf 020 E12e651bc fbf

    7B3 $3b3f8db13-4ab 7B3 3b3f8db13 4ab

    562 $46108cf5e-785 562 46108cf5e 785

    */

    *EDIT*

    Spotted a bug with it 10mins after posting. Fixed now, should be working full.

    Let me know if that's right for you! :hehe:


    Forever trying to learn
    My blog - http://www.cadavre.co.uk/
    For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
    For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply