Data cleanup - compress/replace chars in a field

  • We have a SQL 2000 stored procedure which will remove spaces, punctuation characters etc from a specified field (i.e. to cleanup/standardize business names temporarily before matching) but the performance is TERRIBLE.  It uses a cursor and steps through one record at a time then looks at one character at a time using PATINDEX and removing characters and updating the final result before going to the next record.  I have not been able to figure out a way to convert it to a batch process to speed it up.  I have 950,000 rows to process and using the stored procedure only completed 40,000 rows in 15 hours!  Does anyone have or know of a more efficient compress function or script.  The only option I can think of at this time is to export to text and use the SAS compress function on it then upload back to SQL Server.  Thanks, Larry Hilyard


    larry.hilyard@bcbsfl.com

  • Have you searched the script section here?

    Anyway, it would also help if you could post the procedure along with sample data and the desired output.

    --
    Frank Kalis
    Microsoft SQL Server MVP
    Webmaster: http://www.insidesql.org/blogs
    My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

  • I did search the posts in the last year in the programming forum for any related items with no applicable result.  I am not familiar with the site layout so have not yet found a separate section with scripts (I only see a link to recent scripts on the opening site page).    Here is the desired input and output:

    BusNm (Input)                          (Desired Output)

    BRINK'S CO                              BRINKSCO

    SONNY'S REAL PIT BAR-B-Q        SONNYSREALPITBARBQ

    R J GATOR'S                             RJGATORS

    Here is a SAS statement that will yield the desired output:

    /* Update work tables in PCSAS */

    data temp;

    set Work.GdbDump;

    BusNm = compress(BusNm,' ."<>!#&,'); /* remove spaces and punctuation except sngl quote */

    BusNm = compress(BusNm,''''); /* remove single quotes */

    run;

    Here is the SQL Server stored procedure:

    CREATE PROCEDURE sp_compress2 @StringToCompress varchar(255), @ReturnVal varchar(255) OUTPUT

    AS

    /*********************************************************************

    sp_compress

    This stored procedure removes certain characters from a string and

    returns the result.

    Usage:

    DECLARE @FullName

    DECLARE @CompressedFullName

    EXECUTE sp_compress @FullName, @CompressedFullName OUTPUT

    If @FullName="DOE, JOHN M", then @CompressedFullName="DOEJOHNM"

    MODIFICATIONS HISTORY

    --------------------- 

    2000-03-13 PC  Creation

    2005-03-02 LH  Modify to remove single quote also

    *********************************************************************/

    SET NOCOUNT ON

    SET QUOTED IDENTIFIER OFF

    declare @ResultTx varchar(255)

    declare @CompressPosNb tinyint

    select @ResultTx = @StringToCompress

    /* Compress ' ' (spaces) */

    select @CompressPosNb = PATINDEX ('% %', @ResultTx)

    while @CompressPosNb <> 0

    begin

     select @ResultTx = substring(@ResultTx, 1, @CompressPosNb - 1) + substring(@ResultTx, @CompressPosNb + 1, len(@ResultTx))

     select @CompressPosNb = PATINDEX ('% %', @ResultTx)

    end

    /* Compress '-' */

    select @CompressPosNb = PATINDEX ('%-%', @ResultTx)

    while @CompressPosNb <> 0 /* and @CompressPosNb <> len(@ResultTx) */

    begin

     select @ResultTx = substring(@ResultTx, 1, @CompressPosNb - 1) + substring(@ResultTx, @CompressPosNb + 1, len(@ResultTx))

     select @CompressPosNb = PATINDEX ('%-%', @ResultTx)

    end

    /* Compress "'" */

    select @CompressPosNb = PATINDEX ("%'%", @ResultTx)

    while @CompressPosNb <> 0 /* and @CompressPosNb <> len(@ResultTx) */

    begin

     select @ResultTx = substring(@ResultTx, 1, @CompressPosNb - 1) + substring(@ResultTx, @CompressPosNb + 1, len(@ResultTx))

     select @CompressPosNb = PATINDEX ("%'%", @ResultTx)

    end

    /* Compress '.' (period) */

    select @CompressPosNb = PATINDEX ('%.%', @ResultTx)

    while @CompressPosNb <> 0

    begin

     select @ResultTx = substring(@ResultTx, 1, @CompressPosNb - 1) + substring(@ResultTx, @CompressPosNb + 1, len(@ResultTx))

     select @CompressPosNb = PATINDEX ('%.%', @ResultTx)

    end

    select @ReturnVal = @ResultTx

    set nocount off

    GO

    And finally, here is the SQL script which calls the stored procedure within a cursor:

    /* SQL Script to process table using stored procedure */

    SET ANSI_NULLS OFF

    SET NOCOUNT ON

    DECLARE @UnqId int

    DECLARE @FullName varchar(60)

    DECLARE @CompressedFullName  varchar(60)

    DECLARE @RowCt int

    DECLARE @RowTotalCt int

    DECLARE GetUnqId CURSOR FOR

     SELECT distinct UnqId from test

    OPEN GetUnqId

    FETCH GetUnqId INTO @UnqId

    select @RowCt = 0

    select @RowTotalCt = @@CURSOR_ROWS

    WHILE @@fetch_status = 0

    BEGIN

     select @FullName = null

     select @CompressedFullName  = null

     select @FullName =

         (select BusNm from test

          where UnqId = @UnqId)

     EXECUTE sp_compress2 @FullName, @CompressedFullName OUTPUT

     IF COALESCE(@CompressedFullName,'Y') <> 'Y'

     begin

      UPDATE test

      SET BusNm = @CompressedFullName

      WHERE UnqId=@UnqId AND BusNm = @FullName

     end

     /* Next UnqId */

     select @FullName  = null

     select @CompressedFullName   = null

     CHECKPOINT

     

     FETCH GetUnqId INTO @UnqId

     select @RowCt = @RowCt + 1

     if (@RowCt % 10000 = 0  OR  @RowCt = @RowTotalCt)

      select convert(char(25), getdate()) + 'Processed ' + convert(varchar(6), @RowCt) + ' rows of ' + convert(varchar(6), @RowTotalCt)

    END   /* (WHILE) */

    DEALLOCATE GetUnqId

    SET NOCOUNT OFF

    GO

    Thanks for looking at it.


    larry.hilyard@bcbsfl.com

  • Well, if it's just removing certain characters, I would recommend using REPLACE - this will work without cursor. You can nest several REPLACEs, which makes it a lot easier.

    SELECT REPLACE(BusNm, '-','') FROM YourTable

    removes all occurences of the minus sign (-). Syntax is replace WHERE, WHAT, WITH WHAT. If WITH WHAT is empty string (two single quotes), the character is simply removed.

    SELECT REPLACE(REPLACE(BusNm, '-',''),' ','') FROM YourTable

    removes all occurrences of both minus sign and space at once - and so on. Very simple and efficient. Just make sure you use the correct syntax when removing single quote, like here:

    SELECT REPLACE(BusNm, '''', '') FROM YourTable

    ... because single quote is text delimiter, so you have to double it inside the single quotes (i.e. put 2 single quotes there, not a double quote!) to tell SQL Server that this time it is meant as a character, not as a delimiter. Good luck!

  • Just a quick comment on this - removing data, especially characters like '-' and ' ', can leads to incorrect results. For example, 3 different companies:

    'SuperTech'

    'Super-Tech'

    'Super Tech'

    All become:

    'SuperTech'

     

    Just a thought....

  • Wanderer,

    as I understood Larry, this is not a permanent removal - it is used just to create a search pattern or matchcode or whatever you want to call it. In such case it is always better to skip certain characters, because people tend to differ in writing the same company name or address - someone uses 356/15, someone 356-15 as a house number, someone adds one more empty space between two words, but it is still the same house or company... or at least it could be the same. Results of such search are rarely used directly, in most cases if several are found, they are displayed to the user and user selects the one that is correct.

    Well, at least that's how we use the matchcode

  • Vladan,

    True enough, but given 3 validly (is there even such a word ? ) different companies, they might incorrectly match...

    As I said - just a thought...

    A more important thougt is ... IT'S FRIDAY. enjoy the weekend..

  • Sorry for being late on this.

    Actually,  I think there is more than one way to solve this.

    Given your sample data:

    set nocount on

    use tempdb

    create table #cleanme

    (

     c1 varchar(30)

    )

    insert into #cleanme values('BRINK''S CO')

    insert into #cleanme values('SONNY''S REAL PIT BAR-B-Q ')

    insert into #cleanme values('R J GATOR''S')

    set nocount off

    As has been mentioned, you can use REPLACE. So, something like this maybe:

    select replace(replace(replace(c1,'''',''),' ',''),'-','') from #cleanme

                         

    ---------------------

    BRINKSCO

    SONNYSREALPITBARBQ

    RJGATORS

    (3 row(s) affected)

    And, another way, which I would perhaps prefer as it cleans up all noncharacters, is to use a UDF like this

    CREATE FUNCTION dbo.RemoveNonChars(@Input varchar(1000))

     RETURNS VARCHAR(1000)

     BEGIN

      DECLARE @pos INT

      SET @Pos = PATINDEX('%[^a-Z]%',@Input)

      WHILE @Pos > 0

       BEGIN

        SET @Input = STUFF(@Input,@pos,1,'')

        SET @Pos = PATINDEX('%[^a-Z]%',@Input)

       END

      RETURN @Input

    END

    GO

    SELECT dbo.RemoveNonChars(c1) from #cleanme

                         

    ---------------------

    BRINKSCO

    SONNYSREALPITBARBQ

    RJGATORS

    (3 row(s) affected)

    --
    Frank Kalis
    Microsoft SQL Server MVP
    Webmaster: http://www.insidesql.org/blogs
    My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

  • Just another thaught on this. If this search is occuring often and considering the pretty great number of rows, wouldn't it be better just to create a new column with all the replaced characters. It would only need to be maintaned through triggers when the data is inserted or modified so that you wouldn't have to rerun the replace code every time someones enters a search. The great thing about this method is that you could use full-text indexing on this column which would yield far greater performance than any version of the replace function you can write and would be far less costly on the cpu to maintain.

  • Frank has a great point above removing all characters except for letters.  However, if you have a predefined list you want to remove, the technique below works really well.

    The big problem with the techniques above is that they are not set based.  You'll never get the performance you need with these techniques.  The one below, however, is, and it shows in the performance.

    -------------------------------------------------------------------------------------------

    use pubs

    create table MatchCodeRemoveChar

    (Name char(1) Primary Key

    )

     

    Insert MatchCodeRemoveChar

    select '<'

    union all

    select '>'

    union all

    select '!'

    union all

    select '#'

    union all

    select '&'

    union all

    select ','

    union all

    select ''''

    go

     

    create function getMatchCode (@MatchValue varchar(255))

    returns varchar(255)

    as

    BEGIN

    select @MatchValue = replace(@MatchValue, Name, '')

    From MatchCodeRemoveChar

    return @MatchValue

    END

    go

    select dbo.getMatchCode('M<A>T!C##H&C,O''D<>!#&,E')

    Signature is NULL

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply