General Questions - Globalization and Unicode

  • I'd like to ask some of my colleagues what critical areas should be considered when the question of "Unicode/Globalization" comes up.  For those of you who have successfully converted your systems to Unicode, I salute you!

    I have a few questions for those of you who have been through this.  Any help would, naturally, be appreciated since we DBAs are very busy. 

    Are there any "gotchas" when we go to convert our char, varchar, etc., columns to "nchar", "nvarchar", and so on.  What are the general methods some of you have used to convert these column types?  Via EM and change the column attribute?-- or -- will this come back and "bite me" in some way?  (for example, when I change from char to nchar, will SQL Server alter the contents of the field in any way that I should be aware of?)  Or, should a different method be used?

  • This was removed by the editor as SPAM

  • There's no gotcha's in that way.

    The only gotcha is if you have used the DATALENGTH function on any strings to test the lengh, this presents an integer in Bytes. Otherwise, you should be Okay, I havn't seen any wierdness.

  • Thanks!.  That helps clear up a few answers I had.  Thanks for your post!  I do appreciate you taking time from your schedule to jot down a couple of notes!  It has been a fast week as I'm sure all will agree.

  • (Post-scriptum caveat: I'm far from an expert. I'm just reporting some simple things of which I myself was not aware some years ago, in case they are of any help to others.)

    Things I'd suggest.

    * Be aware of the limitations of UCS-2LE

    My guess is this only affects GB18030 Chinese.

    * Realize that string length is not the same as byte length.

    Look up "composing characters" in Unicode if you're not familiar with them.

    * Realize that display text width is much more complex than character length (due to non-spacing characters).

    * Realize that string equality requires considering how you want to handle "composing characters".

    * Read up on the subject of UTF-8 encoding and minimal encoding and the security implications.

    * Realize that there are a few corner cases preventing uppercasing and lowercasing from being idempotent (eg, Greek sigma, Turkish i, German SS).

    * Review caret & cursor behavior for right-to-left languages, and for the case of mixing, say, Arabic and French (ie, mixing RTL and LTR text).

  • In .NET for string equality implement the ICOMPARER interface on the application layer unlike the ICOMPARABLE interface it let different types to be compared.  UCS-2LE uses less space and UTF8 is almost ASCII under the cover.  Try the thread below I explained Chinese specific Unicode in SQL Server on another forum.  Hope this helps.

    http://forums.asp.net/1067798/ShowPost.aspx

     

    Kind regards,

    Gift Peddie

    Kind regards,
    Gift Peddie

  • But, on that thread you said "Dictionary order" for Chinese.

    Chinese dictionaries use several different orders, in my experience:

    First by radical, and then by stroke count

    (I think this is the most common order?)

    But I suppose this must be subdivided into two variants, based on

    whether traditional or simplified stroke counts are used

    By stroke count

    Again, I guess there are two variants of this

    Phonetic by pinyin Latin alphabetization

    Phonetic by bopomofo order

    When you say "dictionary order", do you mean "by radical then by stroke, using simplified stroke counts" ?

  • Perry,

    Those codes are straight out of SQL Server 2000 BOL and I was interacting with someone who is Chinese and it helped resolve his problems, I cannot go in detail with you now because in my current project my RDBMS(relational database management systems) is Oracle 10g 64bits.  I am still in Design stage but it will be deployed in at least 33 countries.

    Kind regards,

    Gift Peddie

     

    Kind regards,
    Gift Peddie

  • If that is a subtle way of saying that you don't know, that's fine -- I'm only an amateur myself, in either Chinese language or g10n, and just posting because noone else seems to (plus, I had a real question about corruption, and I like to try to answer some other peoples' questions when I ask one of my own, as a kind of feeling of fairness).

  • Perry,

    No that is not my way of telling you I don't know because I know sorting require applying the equality operator to types and SQL Server 2000 BOL (books online) says the best is Binary but SQL Server must be case sensitive.  I am not an academic I only know what works there are six different Chinese sort in SQL Server you are running SQL Server I am not so you run some test.  Richard the person I was helping is of Chinese decent and he said it is based on pronunciation, so I would assume Dictionary is based on the 2000 plus Chinese alphabet.  The point is he said thanks and did not come back which means his problem was solved.

    Kind regards,

    Gift Peddie

    Kind regards,
    Gift Peddie

  • Sorry, I get it now -- you were just copying&pasting text from Books Online, and don't necessarily understand what you pasted -- so I needn't be asking you to explain what you posted.

    I don't see that text "Dictionary Order" you quoted in my Books Online, but I do see some collations labelled stroke order, so it might be a poor way (by Microsoft, not you) to say that somewhere. Stroke order is not phonetic, but you've already explained that you don't use SQL Server, so it won't matter to you, so this is a nice fun pointless explanation, isn't it?

    In case anyone ever does read this -- although I hope not -- I'll summarize:

    My Books Online says:

    202 Chinese_Taiwan_Stroke_CS_AS

    so apparently 202 is collated on stroke order.

  • I am both MCSE and MCDBA certified  and I am SQL Server expert so your comment about me not using SQL Server is not relevant.  Richard is of Chinese decent I will take what he tell me about Chinese in SQL Server over what you say.  And for people who will read just this page Windows code page is different and I have posted the SQL Server info in the link I provided.  Here is that info.

    196

    Chinese_Taiwan_Stroke_BIN

    197

    Chinese_Taiwan_Stroke_CI_AS

    198

    Chinese_PRC_BIN

    199

    Chinese_PRC_CI_AS

    200

    Japanese_CS_AS

    201

    Korean_Wansung_CS_AS

    202

    Chinese_Taiwan_Stroke_CS_AS

    203

    Chinese_PRC_CS_AS

    196

    Binary order, for use with the 950 (Traditional Chinese) character set.

    197

    Dictionary order, case-insensitive, for use with the 950 (Traditional Chinese) character set.

    198

    Binary order, for use with the 936 (Simplified Chinese) character set.

    199

    Dictionary order, case-insensitive, for use with the 936 (Simplified Chinese) character set.

    200

    Dictionary order, case-sensitive, for use with the 932 (Japanese) character set.

    201

    Dictionary order, case-sensitive, for use with the 949 (Korean) character set.

    202

    Dictionary order, case-sensitive, for use with the 950 (Traditional Chinese) character set.

    203

    Dictionary order, case-sensitive, for use with the 936 (Simplified Chinese) character set.

    Kind regards,

    Gift Peddie

     

     

     

    Kind regards,
    Gift Peddie

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply