Weird issue with unicode and pattern matching

  • Hiya,

    I've been trying to find and filter out data that can cause FOR XML to fail with the error:

    FOR XML could not serialize the data for node ?? because it contains a character (0x0006) which is not allowed in XML.

    I thought it would be simple, just identify the rows that don't match the XML spec http://www.w3.org/TR/REC-xml/#charsets and filter them out.

    However, the following doesn't work.

    select fieldname

    from tablename

    where fieldname

    like N'%[^'+nchar(0x9)+nchar(0xA)+nchar(0xD)+nchar(0x20)+N'-'+nchar(0xD7FF)+N']%'

    Not only does this fail to find the offending rows, when I checked by inverted it by taking out the "^" it still returns no rows.

    I'm at a loss as to why it's doing this.

    When I reduced 0xD7FF down to 0xFF it returns data, but I cannot add the other range back in without all rows not matching.

    I've experimented with values and found 0x02E9 was the highest I could go without all values vanishing. Even then, numbers lower than that caused a large variation in the number of rows returned.

    My Field is NVarchar(100) in SQL_Latin1_General_CP1_CI_AS.

    Thanks

    Ian.

  • I accidently posted in the wrong area, I've now re-posted over here:

    http://qa.sqlservercentral.com/Forums/Topic1672918-23-1.aspx

Viewing 2 posts - 1 through 1 (of 1 total)

You must be logged in to reply to this topic. Login to reply