Loading XML Data into SQL Server (SQL Spackle)

  • XML is not going to work with HTML as content unless you enclose it with CDATA tags, like so...

    <Content>

    <![CDATA[

    <div id="contentcenter">

    <p>Perform a <b>hardware</b> check with the utility to assure the quality of the system driver, etc., etc.</p>

    </div>

    ]]>

    </Content>

    Steve

  • Your sample mainly contains display information. Get the data in XML format (without the display information) and you're all set with Waynes article.

    Lutz,

    Thank you for that link on the difference btw xml and html. Exactly what I needed.

    There was actual data included in the xml file I was sent, but it was mixed in with all that other stuff. The person who sent me the file seemed pretty perplexed that I didn't know how to deal with it and load it into our database. With the comments I have received, I'm still not exactly sure.

  • Sorry for the delay guys:

    DECLARE @x XML

    SELECT @x= CONVERT(xml, BulkColumn, 2) FROM

    OPENROWSET(BULK 'C:\data.xml', SINGLE_BLOB) AS x

  • SQL-DBA (1/11/2011)


    XML is not going to work with HTML as content unless you enclose it with CDATA tags, like so...

    <Content>

    <![CDATA[

    <div id="contentcenter">

    <p>Perform a <b>hardware</b> check with the utility to assure the quality of the system driver, etc., etc.</p>

    </div>

    ]]>

    </Content>

    Steve

    Steve,

    Do you mean I need to go into the file and type in [CDATA[... where all HTML appears?

    Amy

  • weharrelson (1/11/2011)


    Sorry for the delay guys:

    DECLARE @x XML

    SELECT @x= CONVERT(xml, BulkColumn, 2) FROM

    OPENROWSET(BULK 'C:\data.xml', SINGLE_BLOB) AS x

    Ahh, I see. You loading the file in as I showed in the article, except that you're taking it straight to a variable. Then you're using that variable in sp_xml_preparedocument.

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • Amy.G (1/11/2011)


    Steve,

    Do you mean I need to go into the file and type in [CDATA[... where all HTML appears?

    Amy

    Well, run this little test:

    declare @xml1 XML, @xml2 XML;

    set @xml1 = '<Content>

    <![CDATA[

    <div id="contentcenter">

    <p>Perform a <b>hardware</b> check with the utility to assure the quality of the system driver, etc., etc.</p>

    </div>

    ]]>

    </Content>';

    set @xml2 = '<Content>

    <div id="contentcenter">

    <p>Perform a <b>hardware</b> check with the utility to assure the quality of the system driver, etc., etc.</p>

    </div>

    </Content>';

    select @xml1, @xml2;

    @xml1 has the <![CDATA[ ... ]]>, and @xml2 doesn't.

    Notice that in @xml2, that all of the HTML has become tokenized. From what I understand about HTML (very little), there are some tokens where the closing token is optional. If you have any of those in the XML file, SQL is probably going to throw an error.

    As is stands, since the HTML is tokenized, it might not get in the way of extracting the data. You'd have to test it to see. Basically, if it can be put into an XML datatype (either variable or table column), then you will probably be able to work with it without having to add the CDATA stuff.

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • Here's a short example to query data like you've posted (please note that I added a few end tags to change it into a valid xml structure as per SQL requirements):

    DECLARE @x XML

    SET @x='<p>

    <div id="content">

    <a name="ada-content" />

    <div id="browseBox">

    <div class="roundedBox_top-left" />

    <div class="roundedBox_top-right" />

    <div class="licenseTitle">

    <h1>

    <span id="_SE_FLD" _SE_FLD="tcm:Content/custom:Content/custom:Title[1]">Aboveground Storage Tank (AST) Permits (>1 Million Gallons Total)</span>

    </h1>

    <div class="licenseDescription">

    <span id="_SE_FLD" _SE_FLD="tcm:Content/custom:Content/custom:Description[1]">Minnesota Pollution Control Agency (MPCA)</span>

    </div>

    </div>

    <div class="roundedBox_bottom-buff" />

    </div>

    <div class="roundedBox_bottom-left" />

    <div class="roundedBox_bottom-right" />

    <div id="licenseDetail">

    <div id="licenseRightColumn">

    <h3 style="margin:0;line-height:100%;">Agency contact information:</h3>

    </div>

    </div>

    </div>

    </p>'

    ;WITH cte AS

    (

    SELECT @x.query('//span') col

    )

    SELECT c.value('.','varchar(200)') span_result

    FROM cte

    CROSS APPLY col.nodes('span') T(c)

    /* result set

    Aboveground Storage Tank (AST) Permits (>1 Million Gallons Total)

    Minnesota Pollution Control Agency (MPCA)

    */



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • LutzM (1/11/2011)


    Here's a short example to query data like you've posted

    I'm really impressed that out of that whole mess you were able to identify the only actual piece of data needed. I have been practicing and got the license title like you did, but then just as I felt I had it licked, a snag:

    DECLARE @y XML

    SET @y = '

    <h1>

    <p xmlns:tcm="http://www.tridion.com/ContentManager/5.0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://www.w3.org/1999/xhtml">

    Facilities that have more than one million gallons capacity must obtain an individual permit from the MPCA according to Minnesota Rules Chapter <a href="http://www.revisor.leg.state.mn.us/arule/7001/">7001</a>.4200-4250.

    </br>

    </br>

    Owners of Aboveground Storage Tanks larger than 1,100 gallons must notify the Minnesota Pollution Control Agency (MPCA) of the existence of these tanks. In addition, tank owners are required to notify the MPCA of change of product or change of status of ASTs.

    </h1>

    '

    SELECT

    y.data.value('p[1]', 'varchar(500)') as Fees

    from @y.nodes('h1') y(data);

    I think the "xmlns"tcm..." is the problem, b/c when I delete that part, it works fine. But going through the whole file deleting would go against the whole point of this. Do you know a way to ignore it?

    Looking back, I guess I was hoping someone would've said "whoever sent you that file is an idiot, it's impossible to read!". Now I'm getting obsessed with it.

    Amy

  • Ok, then we'd need to add the namespace declaration (result as above).

    Regarding your comment about the person sending you the file, here's what I'd say:

    Whoever sent you that file should remember that SQL Server is still a RDBMS and not a HTML garbage cleaner. But it can do that, too, if there's absolutely nothing else to do.

    But, if all you're asking SQL Server to do is to deal with this kind of stuff, you should consider getting a different tool... 😉

    The following code is like using a hammer to drive in a screw. :pinch: So, it's for demonstration purposes only, NOT a recommended solution! The recommended solution would be to get clean XML data.

    DECLARE @x XML

    SET @x=' <p xmlns:tcm="http://www.tridion.com/ContentManager/5.0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://www.w3.org/1999/xhtml">

    <div id="content">

    <a name="ada-content" />

    <div id="browseBox">

    <div class="roundedBox_top-left" />

    <div class="roundedBox_top-right" />

    <div class="licenseTitle">

    <h1>

    <span id="_SE_FLD" _SE_FLD="tcm:Content/custom:Content/custom:Title[1]">Aboveground Storage Tank (AST) Permits (>1 Million Gallons Total)</span>

    </h1>

    <div class="licenseDescription">

    <span id="_SE_FLD" _SE_FLD="tcm:Content/custom:Content/custom:Description[1]">Minnesota Pollution Control Agency (MPCA)</span>

    </div>

    </div>

    <div class="roundedBox_bottom-buff" />

    </div>

    <div class="roundedBox_bottom-left" />

    <div class="roundedBox_bottom-right" />

    <div id="licenseDetail">

    <div id="licenseRightColumn">

    <h3 style="margin:0;line-height:100%;">Agency contact information:</h3>

    </div>

    </div>

    </div>

    </p>'

    ;WITH cte AS

    (

    SELECT @x.query('

    declare default element namespace "http://www.w3.org/1999/xhtml";

    //span') col

    )

    SELECT c.value('.','varchar(200)') span_result

    FROM cte

    CROSS APPLY col.nodes('declare default element namespace "http://www.w3.org/1999/xhtml"; span') T(c)



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • The following code is like using a hammer to drive in a screw. :pinch: So, it's for demonstration purposes only, NOT a recommended solution! The recommended solution would be to get clean XML data.

    I'm going to consider this permission to let it go and do other work.

    And thanks for that article. It has proved very helpful.

    Amy

  • Amy.G (1/11/2011)


    ...

    I'm going to consider this permission to let it go and do other work.

    And thanks for that article. It has proved very helpful.

    Amy

    I'm not sure if a forum post would qualify as permission to do other work. ;-):-D

    But I'm glad I've been of some kind of help though. 🙂



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Hello all,

    Well finally what we concluded then cannot we import the complex XML data file or what ?.

  • Anil Maharjan (1/11/2011)


    Hello all,

    Well finally what we concluded then cannot we import the complex XML data file or what ?.

    I think that the only conclusions we can draw are:

    1. If the input is well-formed XML then we can import it using the given method.

    2. HTML is NOT, in general, well-formed XML

    3. (Contentious.) Well-formed XML elements can only contain HTML data if they are within a CDATA block.

    Kelsey Thornton
    MBCS CITP

  • Supose you had to load 1 or 2 Gb of Xml data from one file into 2 tables. What you'd consider the best choice, OpenRowSet or BCP?

  • brito.santos (2/7/2011)


    Supose you had to load 1 or 2 Gb of Xml data from one file into 2 tables. What you'd consider the best choice, OpenRowSet or BCP?

    It depends. But details should be discussed in a separate thread... 😉



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

Viewing 15 posts - 31 through 45 (of 60 total)

You must be logged in to reply to this topic. Login to reply