Remove non-unicode characters from a column

  • hi ,
    Here is the function i found in google.

    DROP Function [dbo].[RemoveNonAlphaCharacters]
    GO
    Create Function [dbo].[RemoveNonAlphaCharacters](@Temp VarChar(4000))
    Returns VarChar(1000)
    AS
    Begin

      Declare @KeepValues as varchar(100) = '%[^a-zA-Z0-9]%'
      While PatIndex(@KeepValues, @Temp) > 0
       Set @Temp = Stuff(@Temp, PatIndex(@KeepValues, @Temp), 1, '')

      Return @Temp
    End

    Declare @string nvarchar(200)
    SET @string= N'資料來源聯博 No payment ashdk'

    SELECT [dbo].[RemoveNonAlphaCharacters](@string)

    Result:  Nopaymentashdk

    But wanted the result  : No payment ashdk

    From above function how can i get the expected result . The function is replacing the spaces too .

  • The problem is that you're only leaving letters and numbers instead  of all non-unicode characters.
    The following function uses several techniques such as a tally table,and using FOR XML to concatenate characters. It's also being called in a different way to improve performance.

    CREATE FUNCTION [dbo].[RemoveNonUnicodeChars]
    (
      @string nvarchar(4000)
    )
    RETURNS TABLE WITH SCHEMABINDING
    AS
    RETURN
    WITH
    E(n) AS(
      SELECT n FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0))E(n)
    ),
    E2(n) AS(
      SELECT a.n FROM E a, E b
    ),
    E4(n) AS(
      SELECT a.n FROM E2 a, E2 b
    ),
    cteTally(n) AS(
      SELECT TOP(LEN(@string))ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) n
      FROM E4
    )
    SELECT STUFF(( SELECT SUBSTRING( @string, n, 1)
    FROM cteTally
    WHERE SUBSTRING( @string, n, 1) = CAST(SUBSTRING( @string, n, 1) AS char(1))
        FOR XML PATH(''), TYPE).value('./text()[1]', 'varchar(max)'), 1, 1, '') AS CleanString;
    GO
    Declare @sample table(
      string nvarchar(200))
    INSERT INTO @sample VALUES( N'?????? No payment ashdk');

    SELECT *
    FROM @sample s
    CROSS APPLY [dbo].[RemoveNonUnicodeChars](s.string) rnuc;

    GO
    DROP FUNCTION [dbo].[RemoveNonUnicodeChars]

    References:
    http://dwaincsql.com/2014/03/27/tally-tables-in-t-sql/
    http://qa.sqlservercentral.com/articles/comma+separated+list/71700/

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Try this...

    Declare @string nvarchar(200)
    SET @string= N'?????? No payment ashdk';

    SELECT REPLACE(CAST(@string AS VARCHAR(200)), CHAR(63), '');

  • Luis Cazares - Friday, June 2, 2017 10:50 AM

    The problem is that you're only leaving letters and numbers instead  of all non-unicode characters.
    The following function uses several techniques such as a tally table,and using FOR XML to concatenate characters. It's also being called in a different way to improve performance.

    CREATE FUNCTION [dbo].[RemoveNonUnicodeChars]
    (
      @string nvarchar(4000)
    )
    RETURNS TABLE WITH SCHEMABINDING
    AS
    RETURN
    WITH
    E(n) AS(
      SELECT n FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0))E(n)
    ),
    E2(n) AS(
      SELECT a.n FROM E a, E b
    ),
    E4(n) AS(
      SELECT a.n FROM E2 a, E2 b
    ),
    cteTally(n) AS(
      SELECT TOP(LEN(@string))ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) n
      FROM E4
    )
    SELECT STUFF(( SELECT SUBSTRING( @string, n, 1)
    FROM cteTally
    WHERE SUBSTRING( @string, n, 1) = CAST(SUBSTRING( @string, n, 1) AS char(1))
        FOR XML PATH(''), TYPE).value('./text()[1]', 'varchar(max)'), 1, 1, '') AS CleanString;
    GO
    Declare @sample table(
      string nvarchar(200))
    INSERT INTO @sample VALUES( N'資料來æºè¯åš No payment ashdk');

    SELECT *
    FROM @sample s
    CROSS APPLY [dbo].[RemoveNonUnicodeChars](s.string) rnuc;

    GO
    DROP FUNCTION [dbo].[RemoveNonUnicodeChars]

    References:
    http://dwaincsql.com/2014/03/27/tally-tables-in-t-sql/
    http://qa.sqlservercentral.com/articles/comma+separated+list/71700/

    How can we modifythis function to retrieve the column value ?
    SELECT [dbo].[RemoveNonUnicodeChars](Contractplan) FROm [Stage].[Price] where stageid ='816299'

    It throws error saying :Cannot find either column "dbo" or the user-defined function or aggregate "dbo.RemoveNonUnicodeChars", or the name is ambiguous.

  • Jason A. Long - Friday, June 2, 2017 10:52 AM

    Try this...

    Declare @string nvarchar(200)
    SET @string= N'資料來æºè¯åš No payment ashdk';

    SELECT REPLACE(CAST(@string AS VARCHAR(200)), CHAR(63), '');

    I've always liked the idea of this approach. The trick, of course, is what to do with any question marks in the original string.

    For most cases, you could get away with first replacing the question marks in the original string with something very, very uncommon, then doing the convert, removing the introduced question marks, and then replacing your uncommon character with question marks to reintroduce the originals.

    If you want something to work in 100% of cases, though, you have to do something fancy to dynamically determine a character/set of characters that doesn't exist in the original string (as a safe placeholder for the original question marks).

    By the time that's all said and done, though, the approach isn't really all that simpler or better performing than using the tally table/STUFF/FOR XML approach.

    Still fun to work out, though 🙂

  • komal145 - Friday, June 2, 2017 11:11 AM

    How can we modifythis function to retrieve the column value ?
    SELECT [dbo].[RemoveNonUnicodeChars](Contractplan) FROm [Stage].[Price] where stageid ='816299'

    It throws error saying :Cannot find either column "dbo" or the user-defined function or aggregate "dbo.RemoveNonUnicodeChars", or the name is ambiguous.

    There's an example on how to use it in the code that I posted. Did you even tried to understand the code?

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Jacob Wilkins - Friday, June 2, 2017 11:16 AM

    I've always liked the idea of this approach. The trick, of course, is what to do with any question marks in the original string.

    For most cases, you could get away with first replacing the question marks in the original string with something very, very uncommon, then doing the convert, removing the introduced question marks, and then replacing your uncommon character with question marks to reintroduce the originals.

    If you want something to work in 100% of cases, though, you have to do something fancy to dynamically determine a character/set of characters that doesn't exist in the original string (as a safe placeholder for the original question marks).

    By the time that's all said and done, though, the approach isn't really all that simpler or better performing than using the tally table/STUFF/FOR XML approach.

    Still fun to work out, though 🙂

    Total brain fart on my part. Didn't even take into account the possibility of actual ?'s in the original string. :blush:
    As far as it being more or less simple than the tally table/STUFF/FOR XML approach... It should still be significantly faster. While it's far faster than loops/cursors it's still has a significant expense when compared to a CAST and a few REPLACEs.

  • Luis Cazares - Friday, June 2, 2017 11:28 AM

    komal145 - Friday, June 2, 2017 11:11 AM

    How can we modifythis function to retrieve the column value ?
    SELECT [dbo].[RemoveNonUnicodeChars](Contractplan) FROm [Stage].[Price] where stageid ='816299'

    It throws error saying :Cannot find either column "dbo" or the user-defined function or aggregate "dbo.RemoveNonUnicodeChars", or the name is ambiguous.

    There's an example on how to use it in the code that I posted. Did you even tried to understand the code?

    The code looks simple but it hard to understand but still trying what is really doing to remove such unicodes

  • Updated to handle ? characters in the original string...

    Declare @string nvarchar(200)
    SET @string= N'?????? No payment ashdk ?&';

    SELECT REPLACE(REPLACE(CAST(REPLACE(@string, '?', '```') AS VARCHAR(200)), CHAR(63), ''), '```', '?');
    GO

  • Jason A. Long - Friday, June 2, 2017 11:45 AM

    Jacob Wilkins - Friday, June 2, 2017 11:16 AM

    I've always liked the idea of this approach. The trick, of course, is what to do with any question marks in the original string.

    For most cases, you could get away with first replacing the question marks in the original string with something very, very uncommon, then doing the convert, removing the introduced question marks, and then replacing your uncommon character with question marks to reintroduce the originals.

    If you want something to work in 100% of cases, though, you have to do something fancy to dynamically determine a character/set of characters that doesn't exist in the original string (as a safe placeholder for the original question marks).

    By the time that's all said and done, though, the approach isn't really all that simpler or better performing than using the tally table/STUFF/FOR XML approach.

    Still fun to work out, though 🙂

    Total brain fart on my part. Didn't even take into account the possibility of actual ?'s in the original string. :blush:
    As far as it being more or less simple than the tally table/STUFF/FOR XML approach... It should still be significantly faster. While it's far faster than loops/cursors it's still has a significant expense when compared to a CAST and a few REPLACEs.

    I agree that it has the potential to be faster. The trick is finding a suitably inexpensive way to find the safe placeholder for the original question marks.

    I've only messed with this once a while back, but I remember the only method I tried that was 100% safe ended up being slower than the tally/STUFF/FOR XML method.

    I might have (probably) missed some clever tricks then, though. I'll have to revisit to see if I can come up with something better this time around.

    Cheers!

    EDIT: A solution like your most recent post is mostly safe, but not 100% safe like the other approach; In theory, the original string might also contain '```', or any other fixed placeholder you choose (highly unlikely, of course, but still, 100% accuracy is nice :))

  • Jacob Wilkins - Friday, June 2, 2017 12:11 PM

    EDIT: A solution like your most recent post is mostly safe, but not 100% safe like the other approach; In theory, the original string might also contain '```', or any other fixed placeholder you choose (highly unlikely, of course, but still, 100% accuracy is nice :))

    I suppose there's a trade-off to be had... I don't think it's outrageous to assume that there are ASCII character combinations that will be so unlikely to exist as to assume that they would never exist in the existing text. You could use a GUID with the dashes replaced with tildes for example...
    Given the performance difference, I (personally) could live with 99.999999999% safety...

  • Jacob Wilkins - Friday, June 2, 2017 12:11 PM

    Jason A. Long - Friday, June 2, 2017 11:45 AM

    Jacob Wilkins - Friday, June 2, 2017 11:16 AM

    I've always liked the idea of this approach. The trick, of course, is what to do with any question marks in the original string.

    For most cases, you could get away with first replacing the question marks in the original string with something very, very uncommon, then doing the convert, removing the introduced question marks, and then replacing your uncommon character with question marks to reintroduce the originals.

    If you want something to work in 100% of cases, though, you have to do something fancy to dynamically determine a character/set of characters that doesn't exist in the original string (as a safe placeholder for the original question marks).

    By the time that's all said and done, though, the approach isn't really all that simpler or better performing than using the tally table/STUFF/FOR XML approach.

    Still fun to work out, though 🙂

    Total brain fart on my part. Didn't even take into account the possibility of actual ?'s in the original string. :blush:
    As far as it being more or less simple than the tally table/STUFF/FOR XML approach... It should still be significantly faster. While it's far faster than loops/cursors it's still has a significant expense when compared to a CAST and a few REPLACEs.

    I agree that it has the potential to be faster. The trick is finding a suitably inexpensive way to find the safe placeholder for the original question marks.

    I've only messed with this once a while back, but I remember the only method I tried that was 100% safe ended up being slower than the tally/STUFF/FOR XML method.

    I might have (probably) missed some clever tricks then, though. I'll have to revisit to see if I can come up with something better this time around.

    Cheers!

    EDIT: A solution like your most recent post is mostly safe, but not 100% safe like the other approach; In theory, the original string might also contain '```', or any other fixed placeholder you choose (highly unlikely, of course, but still, 100% accuracy is nice :))

    Try using CHAR(7). or CHAR(127).

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • komal145 - Friday, June 2, 2017 11:52 AM

    Luis Cazares - Friday, June 2, 2017 11:28 AM

    komal145 - Friday, June 2, 2017 11:11 AM

    How can we modifythis function to retrieve the column value ?
    SELECT [dbo].[RemoveNonUnicodeChars](Contractplan) FROm [Stage].[Price] where stageid ='816299'

    It throws error saying :Cannot find either column "dbo" or the user-defined function or aggregate "dbo.RemoveNonUnicodeChars", or the name is ambiguous.

    There's an example on how to use it in the code that I posted. Did you even tried to understand the code?

    The code looks simple but it hard to understand but still trying what is really doing to remove such unicodes

    It's looking at each character individually.  It filters out the characters that don't match the Varchar equivalent of itself (Nvarchar p = Varchar p, Nvarchar åš = Varchar ?).  It reassembles them and returns the concatenated string.

    The  single biggest thing Luis did to help you is that he converted it into an inline table-valued function (ITVF) instead of a scalar function.  This is going to result in a huge difference in performance when you use it.

    It does require that you call it differently than your scalar function.  Luis included an example, so take a look at it and note the difference.  This is one of those moments when a real paradigm shift in thinking can occur if you want it to.

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply