XML Parsing - Unable to switch encodiing using utf-8 when casting from NVARCHAR

  • Hi there

    I have a piece of code which takes an input parameter of NVARCHAR(Max) and then attempts to cast it to XML as follows

    declare @DataSheetXML NVARCHAR(max)

    declare @DataSheetXML2 xml

    set @DataSheetXML =

    N'<?xml version="1.0" encoding="utf-8"?>

    <Datasheet xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">

    </Datasheet>'

    set @DataSheetXML2= CONVERT(XML,CONVERT(NVARCHAR(MAX),@DataSheetXML))

    This fails because I get the following error message:

    Msg 9402, Level 16, State 1, Line 14

    XML parsing: line 1, character 38, unable to switch the encoding

    Is there a way that I can cast @DataSheetXML to @DataSheetXML2, ie from NVARCHAR(MAX) to XML?

  • Was doing a little bit of random testing on this and I think the problem is that utf-8 is going to be VARCHAR.  utf-16 would be NVARCHAR.

    This is easy to test by taking your SET statement and changing the second convert from NVARCHAR(MAX) to VARCHAR(MAX).  Alternately, you can change your encoding to utf-16 in the XML.

    Now, to handle this in coded format, I think you would need to have an IF statement.  Something along the lines of this:

    declare @DataSheetXML NVARCHAR(max)

    declare @DataSheetXML2 xml

    set @DataSheetXML =

    N'<?xml version="1.0" encoding="utf-8"?>

    <Datasheet xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">

    </Datasheet>'

    IF (SELECT CHARINDEX('utf-8',@DataSheetXML)) = 0
    BEGIN
    SET @DataSheetXML2= CONVERT(XML,@DataSheetXML)
    END
    ELSE
    BEGIN
    SET @DataSheetXML2= CONVERT(XML,CONVERT(VARCHAR(MAX),@DataSheetXML))
    END

    if the string "utf-8" exists in your XML, then convert to VARCHAR(MAX), otherwise leave it as an NVARCHAR(MAX).  No need to CONVERT @DataSheetXML from NVARCHAR(MAX) to NVARCHAR(MAX) as that isn't doing anything.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • Thanks Brian...that worked perfectly for me

  • Not a problem.  I imagine there are other solutions as well.  If you KNOW you are wanting NVARCHAR for sure, you could also do a REPLACE on the utf-8 to be utf-16.  This is MAY break the XML though (shouldn't but never say never).  Or if you are ALWAYS going to be working with utf-8 XML, then having the first parameter as a VARCHAR would save some conversions.

    Something to note, there are a LOT of different encoding types supported by XML.  I found this list: https://www.iana.org/assignments/character-sets/character-sets.xhtml

    You can see there are a lot of encoding types.  My approach above ONLY handles utf-8 and casts that to VARCHAR first.  I am not certain if any of the other types (such as US-ASCII) would need to be converted to VARCHAR or if NVARCHAR can handle it.  I expect if you don't know what encoding(s) will be used, you will need to do trial and error, with my above method, on all different encoding types and there are a lot.

     

    As it is a large list to parse through if you needed to check all of them, I would take a different approach.  A safer approach would be something like:

    declare @DataSheetXML NVARCHAR(max)
    declare @DataSheetXML2 xml
    set @DataSheetXML =
    N'<?xml version="1.0" encoding="utf-8"?>

    <Datasheet xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">

    </Datasheet>'
    BEGIN try
    SET @DataSheetXML2= CONVERT(XML,CONVERT(NVARCHAR(MAX),@DataSheetXML))
    END try
    BEGIN CATCH
    BEGIN TRY
    SET @DataSheetXML2= CONVERT(XML,CONVERT(VARCHAR(MAX),@DataSheetXML))
    END TRY
    BEGIN CATCH
    SET @DatasheetXML2=N'<?xml version="1.0" encoding="utf-16"?>
    <ERROR>ERROR ENCOUNTERED</ERROR>'
    END CATCH
    END CATCH
    IF CONVERT(NVARCHAR(MAX),@DatasheetXML2) LIKE '%<ERROR>%')
    BEGIN
    --HANDLE ERROR and exit
    END

    The above approach ends up using 2 try catch blocks which I generally try to avoid in SQL if I can, but I think this is going to be easier than trying to test and capture all possible encoding types.  Plus it handles the case where it can't convert to VARCHAR or NVARCHAR.  I left the explicit NVARCHAR conversion in even though we have the NVARCHAR at the start because to me it makes it clearer to see what we are doing.  That is, it increases readability for future developers.  It is not required.

    Plus, that last CONVERT for handling the error could be handled by parsing the XML rather than converting it.  Parsing will likely give better performance, but I expect it to be very minimal performance gain.  In this case, the XML is short, so converting it and doing a LIKE comparison on it should be quick and allows you to handle the error in the event some encoding type can't be converted to VARCHAR or NVARCHAR.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • SQL Server 2019 now supports UTF-8 you can check if you have UTF-8 collations by running the following query :

    SELECT Name, Description FROM fn_helpcollations() WHERE Name LIKE '%UTF8'

    Since UTF-8 must be stored in VARCHAR (not NVARCHAR) you will need two conversions :

    • Convert your NVARCHAR variable to VARCHAR
    • Convert resulting VARCHAR to one of the UTF-8 collations

    Here is an example where I've added a special character what would be considered an illegal character if you don't convert to UTF-8 first.

    DECLARE 
    @DataSheetXML nvarchar(max),
    @DataSheetXML2 xml

    SET @DataSheetXML = N'<?xml version="1.0" encoding="utf-8"?>
    <Datasheet xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
    <Description>Service 5 étoiles</Description>
    </Datasheet>'

    DECLARE @hdoc int

    SET @DataSheetXML2 = cast(@DataSheetXML as varchar(max)) COLLATE Latin1_General_100_CI_AS_SC_UTF8
    --SET @DataSheetXML2 = cast(@DataSheetXML as varchar(max)) --This would throw 'XML parsing: line 3, character 25, illegal xml character'

    EXEC sp_xml_prepareDocument @hdoc OUTPUT, @DataSheetXML2

    SELECT *
    FROM OPENXML(@hdoc, '/Datasheet', 2)
    WITH ([Description] varchar(max))

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply