Remove non-unicode characters from a column

Question

Post reply

Remove non-unicode characters from a column

komal145

SSCrazy Eights

Points: 9892
More actions
June 2, 2017 at 10:38 am

#337449

hi ,
Here is the function i found in google.

DROP Function [dbo].[RemoveNonAlphaCharacters]
GO
Create Function [dbo].[RemoveNonAlphaCharacters](@Temp VarChar(4000))
Returns VarChar(1000)
AS
Begin

Declare @KeepValues as varchar(100) = '%[^a-zA-Z0-9]%'
While PatIndex(@KeepValues, @Temp) > 0
Set @Temp = Stuff(@Temp, PatIndex(@KeepValues, @Temp), 1, '')

Return @Temp
End

Declare @string nvarchar(200)
SET @string= N'資料來源聯博 No payment ashdk'

SELECT [dbo].[RemoveNonAlphaCharacters](@string)

Result: Nopaymentashdk

But wanted the result : No payment ashdk

From above function how can i get the expected result . The function is replacing the spaces too .

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply

Luis Cazares SSC Guru Points: 183686 More actions · Answer 1

The problem is that you're only leaving letters and numbers instead of all non-unicode characters.
The following function uses several techniques such as a tally table,and using FOR XML to concatenate characters. It's also being called in a different way to improve performance.
CREATE FUNCTION [dbo].[RemoveNonUnicodeChars] ( @string nvarchar(4000) ) RETURNS TABLE WITH SCHEMABINDING AS RETURN WITH E(n) AS( SELECT n FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0))E(n) ), E2(n) AS( SELECT a.n FROM E a, E b ), E4(n) AS( SELECT a.n FROM E2 a, E2 b ), cteTally(n) AS( SELECT TOP(LEN(@string))ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) n FROM E4 ) SELECT STUFF(( SELECT SUBSTRING( @string, n, 1) FROM cteTally WHERE SUBSTRING( @string, n, 1) = CAST(SUBSTRING( @string, n, 1) AS char(1)) FOR XML PATH(''), TYPE).value('./text()[1]', 'varchar(max)'), 1, 1, '') AS CleanString; GO Declare @sample table( string nvarchar(200)) INSERT INTO @sample VALUES( N'?????? No payment ashdk');



SELECT *
FROM @sample s
CROSS APPLY [dbo].[RemoveNonUnicodeChars](s.string) rnuc;

GO DROP FUNCTION [dbo].[RemoveNonUnicodeChars]
References:
http://dwaincsql.com/2014/03/27/tally-tables-in-t-sql/
http://qa.sqlservercentral.com/articles/comma+separated+list/71700/

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

Jason A. Long SSC-Insane Points: 23731 More actions · Answer 2

Jason A. Long

SSC-Insane

Points: 23731

June 2, 2017 at 10:52 am

#1945165

Try this...
Declare @string nvarchar(200) SET @string= N'?????? No payment ashdk';

SELECT REPLACE(CAST(@string AS VARCHAR(200)), CHAR(63), '');

komal145 SSCrazy Eights Points: 9892 More actions · Answer 3

Luis Cazares - Friday, June 2, 2017 10:50 AM

The problem is that you're only leaving letters and numbers instead of all non-unicode characters.
The following function uses several techniques such as a tally table,and using FOR XML to concatenate characters. It's also being called in a different way to improve performance.
CREATE FUNCTION [dbo].[RemoveNonUnicodeChars] ( @string nvarchar(4000) ) RETURNS TABLE WITH SCHEMABINDING AS RETURN WITH E(n) AS( SELECT n FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0))E(n) ), E2(n) AS( SELECT a.n FROM E a, E b ), E4(n) AS( SELECT a.n FROM E2 a, E2 b ), cteTally(n) AS( SELECT TOP(LEN(@string))ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) n FROM E4 ) SELECT STUFF(( SELECT SUBSTRING( @string, n, 1) FROM cteTally WHERE SUBSTRING( @string, n, 1) = CAST(SUBSTRING( @string, n, 1) AS char(1)) FOR XML PATH(''), TYPE).value('./text()[1]', 'varchar(max)'), 1, 1, '') AS CleanString; GO Declare @sample table( string nvarchar(200)) INSERT INTO @sample VALUES( N'è³‡æ–™ä¾†æºè¯åš No payment ashdk'); SELECT * FROM @sample s CROSS APPLY [dbo].[RemoveNonUnicodeChars](s.string) rnuc;
GO DROP FUNCTION [dbo].[RemoveNonUnicodeChars]
References:
http://dwaincsql.com/2014/03/27/tally-tables-in-t-sql/
http://qa.sqlservercentral.com/articles/comma+separated+list/71700/

How can we modifythis function to retrieve the column value ?
SELECT [dbo].[RemoveNonUnicodeChars](Contractplan) FROm [Stage].[Price] where stageid ='816299'

It throws error saying :Cannot find either column "dbo" or the user-defined function or aggregate "dbo.RemoveNonUnicodeChars", or the name is ambiguous.

Jacob Wilkins One Orange Chip Points: 27972 More actions · Answer 4

Jason A. Long - Friday, June 2, 2017 10:52 AM

Try this...
Declare @string nvarchar(200) SET @string= N'è³‡æ–™ä¾†æºè¯åš No payment ashdk';
SELECT REPLACE(CAST(@string AS VARCHAR(200)), CHAR(63), '');

I've always liked the idea of this approach. The trick, of course, is what to do with any question marks in the original string.

For most cases, you could get away with first replacing the question marks in the original string with something very, very uncommon, then doing the convert, removing the introduced question marks, and then replacing your uncommon character with question marks to reintroduce the originals.

If you want something to work in 100% of cases, though, you have to do something fancy to dynamically determine a character/set of characters that doesn't exist in the original string (as a safe placeholder for the original question marks).

By the time that's all said and done, though, the approach isn't really all that simpler or better performing than using the tally table/STUFF/FOR XML approach.

Still fun to work out, though 🙂

Luis Cazares SSC Guru Points: 183686 More actions · Answer 5

komal145 - Friday, June 2, 2017 11:11 AM

How can we modifythis function to retrieve the column value ?
SELECT [dbo].[RemoveNonUnicodeChars](Contractplan) FROm [Stage].[Price] where stageid ='816299'

It throws error saying :Cannot find either column "dbo" or the user-defined function or aggregate "dbo.RemoveNonUnicodeChars", or the name is ambiguous.

There's an example on how to use it in the code that I posted. Did you even tried to understand the code?

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

Jason A. Long SSC-Insane Points: 23731 More actions · Answer 6

Jacob Wilkins - Friday, June 2, 2017 11:16 AM

I've always liked the idea of this approach. The trick, of course, is what to do with any question marks in the original string.

For most cases, you could get away with first replacing the question marks in the original string with something very, very uncommon, then doing the convert, removing the introduced question marks, and then replacing your uncommon character with question marks to reintroduce the originals.

If you want something to work in 100% of cases, though, you have to do something fancy to dynamically determine a character/set of characters that doesn't exist in the original string (as a safe placeholder for the original question marks).

By the time that's all said and done, though, the approach isn't really all that simpler or better performing than using the tally table/STUFF/FOR XML approach.

Still fun to work out, though 🙂

Total brain fart on my part. Didn't even take into account the possibility of actual ?'s in the original string. :blush:
As far as it being more or less simple than the tally table/STUFF/FOR XML approach... It should still be significantly faster. While it's far faster than loops/cursors it's still has a significant expense when compared to a CAST and a few REPLACEs.

komal145 SSCrazy Eights Points: 9892 More actions · Answer 7

Luis Cazares - Friday, June 2, 2017 11:28 AM

komal145 - Friday, June 2, 2017 11:11 AM

How can we modifythis function to retrieve the column value ?
SELECT [dbo].[RemoveNonUnicodeChars](Contractplan) FROm [Stage].[Price] where stageid ='816299'

It throws error saying :Cannot find either column "dbo" or the user-defined function or aggregate "dbo.RemoveNonUnicodeChars", or the name is ambiguous.

There's an example on how to use it in the code that I posted. Did you even tried to understand the code?

The code looks simple but it hard to understand but still trying what is really doing to remove such unicodes

Jason A. Long SSC-Insane Points: 23731 More actions · Answer 8

Jason A. Long

SSC-Insane

Points: 23731

June 2, 2017 at 12:09 pm

#1945190

Updated to handle ? characters in the original string...
Declare @string nvarchar(200) SET @string= N'?????? No payment ashdk ?&';

SELECT REPLACE(REPLACE(CAST(REPLACE(@string, '?', '```') AS VARCHAR(200)), CHAR(63), ''), '```', '?'); GO

Jacob Wilkins One Orange Chip Points: 27972 More actions · Answer 9

Jason A. Long - Friday, June 2, 2017 11:45 AM

Jacob Wilkins - Friday, June 2, 2017 11:16 AM

I've always liked the idea of this approach. The trick, of course, is what to do with any question marks in the original string.

For most cases, you could get away with first replacing the question marks in the original string with something very, very uncommon, then doing the convert, removing the introduced question marks, and then replacing your uncommon character with question marks to reintroduce the originals.

If you want something to work in 100% of cases, though, you have to do something fancy to dynamically determine a character/set of characters that doesn't exist in the original string (as a safe placeholder for the original question marks).

By the time that's all said and done, though, the approach isn't really all that simpler or better performing than using the tally table/STUFF/FOR XML approach.

Still fun to work out, though 🙂

Total brain fart on my part. Didn't even take into account the possibility of actual ?'s in the original string. :blush:
As far as it being more or less simple than the tally table/STUFF/FOR XML approach... It should still be significantly faster. While it's far faster than loops/cursors it's still has a significant expense when compared to a CAST and a few REPLACEs.

I agree that it has the potential to be faster. The trick is finding a suitably inexpensive way to find the safe placeholder for the original question marks.

I've only messed with this once a while back, but I remember the only method I tried that was 100% safe ended up being slower than the tally/STUFF/FOR XML method.

I might have (probably) missed some clever tricks then, though. I'll have to revisit to see if I can come up with something better this time around.

Cheers!

EDIT: A solution like your most recent post is mostly safe, but not 100% safe like the other approach; In theory, the original string might also contain '```', or any other fixed placeholder you choose (highly unlikely, of course, but still, 100% accuracy is nice :))

Jason A. Long SSC-Insane Points: 23731 More actions · Answer 10

Jacob Wilkins - Friday, June 2, 2017 12:11 PM

EDIT: A solution like your most recent post is mostly safe, but not 100% safe like the other approach; In theory, the original string might also contain '```', or any other fixed placeholder you choose (highly unlikely, of course, but still, 100% accuracy is nice :))

I suppose there's a trade-off to be had... I don't think it's outrageous to assume that there are ASCII character combinations that will be so unlikely to exist as to assume that they would never exist in the existing text. You could use a GUID with the dashes replaced with tildes for example...
Given the performance difference, I (personally) could live with 99.999999999% safety...

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 11

Jacob Wilkins - Friday, June 2, 2017 12:11 PM

Jason A. Long - Friday, June 2, 2017 11:45 AM

Jacob Wilkins - Friday, June 2, 2017 11:16 AM

I've always liked the idea of this approach. The trick, of course, is what to do with any question marks in the original string.

For most cases, you could get away with first replacing the question marks in the original string with something very, very uncommon, then doing the convert, removing the introduced question marks, and then replacing your uncommon character with question marks to reintroduce the originals.

If you want something to work in 100% of cases, though, you have to do something fancy to dynamically determine a character/set of characters that doesn't exist in the original string (as a safe placeholder for the original question marks).

By the time that's all said and done, though, the approach isn't really all that simpler or better performing than using the tally table/STUFF/FOR XML approach.

Still fun to work out, though 🙂

Total brain fart on my part. Didn't even take into account the possibility of actual ?'s in the original string. :blush:
As far as it being more or less simple than the tally table/STUFF/FOR XML approach... It should still be significantly faster. While it's far faster than loops/cursors it's still has a significant expense when compared to a CAST and a few REPLACEs.

I agree that it has the potential to be faster. The trick is finding a suitably inexpensive way to find the safe placeholder for the original question marks.

I've only messed with this once a while back, but I remember the only method I tried that was 100% safe ended up being slower than the tally/STUFF/FOR XML method.

I might have (probably) missed some clever tricks then, though. I'll have to revisit to see if I can come up with something better this time around.

Cheers!

EDIT: A solution like your most recent post is mostly safe, but not 100% safe like the other approach; In theory, the original string might also contain '```', or any other fixed placeholder you choose (highly unlikely, of course, but still, 100% accuracy is nice :))

Try using CHAR(7). or CHAR(127).

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Ed Wagner SSC Guru Points: 287015 More actions · Answer 12

komal145 - Friday, June 2, 2017 11:52 AM

Luis Cazares - Friday, June 2, 2017 11:28 AM

komal145 - Friday, June 2, 2017 11:11 AM

How can we modifythis function to retrieve the column value ?
SELECT [dbo].[RemoveNonUnicodeChars](Contractplan) FROm [Stage].[Price] where stageid ='816299'

It throws error saying :Cannot find either column "dbo" or the user-defined function or aggregate "dbo.RemoveNonUnicodeChars", or the name is ambiguous.

There's an example on how to use it in the code that I posted. Did you even tried to understand the code?

The code looks simple but it hard to understand but still trying what is really doing to remove such unicodes

It's looking at each character individually. It filters out the characters that don't match the Varchar equivalent of itself (Nvarchar p = Varchar p, Nvarchar åš = Varchar ?). It reassembles them and returns the concatenated string.

The single biggest thing Luis did to help you is that he converted it into an inline table-valued function (ITVF) instead of a scalar function. This is going to result in a huge difference in performance when you use it.

It does require that you call it differently than your scalar function. Luis included an example, so take a look at it and note the difference. This is one of those moments when a real paradigm shift in thinking can occur if you want it to.

Tally Tables - Performance Personified
String Splitting with True Performance
Best practices on how to ask questions