Free form text parsing suggestions

  • Hi Everyone,

    I am extracting data from a SharePoint list object and one of the columns that I am pulling from SharePoint has free form text inside various <div> and/or <p> tags. I have the data extracted from SharePoint and stored in a SQL Server 2012 table and I've gotten the basic string parsing working but I can see that my design is a bit problematic and I am trying to see if there is a way to re-write this so that the SQL statement can handle changing amounts of data in the future. I will not know ahead of time if the data will have 1 or 100 <p> tags so I figured I'd reach out for some guidance.

    Right now, in the data that I have to test against I have the following 4 patterns:

    Text data between 2 div tags:

    <div class="ExternalClassFB2223956DB545BC820BD55A20925FD7">Estimated measurements of all braid &  sleeving. Cut to length. Applied 2 layers of braid as required per drawing and sleeving. </div>

    Text data inside a single paragraph opening/closing tag between the div tags:

    <div class="ExternalClassF7C7EB4C8ED147B9AC7EDEEE90B9E6EE"><p>​sn 150286-01-07, -08, -09, -13, -14, -15</p></div>

    Text data inside 2 different paragraph opening/closing tags between the div tags:

    <div class="ExternalClass6C6D6B8FA7CB43218E63A65B555A1A91"><p>​sn changed to150078-02-1</p><p>was325-14-299-001</p></div>

    Text data inside 3 different paragraph opening/closing tags between the div tags:

    <div class="ExternalClass4A79E8DF5C67405CA0FA7385F7A9A618"><p>​wrap 4 harnesses with nylon tape 1.75 hrs</p><p>wrap 4 harnesses with foil 4.25 hrs</p><p>apply 2 layers of braid 2 hrs.</p></div>

    The case statement that I wrote to parse this out is here (I'm also using Cade Bryant's dbo.charindex2 function http://qa.sqlservercentral.com/scripts/Miscellaneous/30497/):

    select CASE

    WHEN len(notes) - len(replace(notes, '<p>', '')) = 0 --No <p> tags in the Notes column

    THEN substring(notes, (dbo.charindex2('">', notes, 1) + 2), (dbo.charindex2('</div>', notes, 1) - (dbo.charindex2('">', Notes, 1) + 2)))

    WHEN len(notes) - len(replace(notes, '<p>', '')) = 3 --1 <p> tag in the Notes column

    THEN substring(notes, (dbo.charindex2('<p>', notes, 1) + 3), (dbo.charindex2('</p>', notes, 1) - (dbo.charindex2('<p>', notes, 1) + 3)))

    WHEN len(notes) - len(replace(notes, '<p>', '')) = 6 --2 <p> tags in the Notes column

    THEN substring(notes, (dbo.charindex2('<p>', notes, 1) + 3), (dbo.charindex2('</p>', notes, 1) - (dbo.charindex2('<p>', notes, 1) + 3))) + ' ' + substring(notes, (dbo.charindex2('<p>', notes, 2) + 3), (dbo.charindex2('</p>', notes, 2) - (dbo.charindex2('<p>', notes, 2) + 3)))

    WHEN len(notes) - len(replace(notes, '<p>', '')) = 9 --3 <p> tags in the Notes column

    THEN substring(notes, (dbo.charindex2('<p>', notes, 1) + 3), (dbo.charindex2('</p>', notes, 1) - (dbo.charindex2('<p>', notes, 1) + 3))) + ' ' + substring(notes, (dbo.charindex2('<p>', notes, 2) + 3), (dbo.charindex2('</p>', notes, 2) - (dbo.charindex2('<p>', notes, 2) + 3))) + ' ' + substring(notes, (dbo.charindex2('<p>', notes, 3) + 3), (dbo.charindex2('</p>', notes, 3) - (dbo.charindex2('<p>', notes, 3) + 3)))

    ELSE NULL --unhandled

    END parsed_notes

    from stg.SP_ProdControl_Time

    where notes is not null

    The above SQL works and successfully parses out the free form entry text that is placed into the SharePoint text box but the maintenance of this script is going to be a pain in the neck. I'm certain there is a better/more efficient way of accomplishing this task and I wanted to see if anyone had any suggestions.

    Thanks for taking the time to read this.

    Josh

  • HTML is a subset of XML, it's probably more efficient to treat it as XML and use xquery.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • Yikes. Here's an alternative that should perform better for you:

    select STUFF(

    cast((select ' ' + SUBSTRING(spt.notes, t.tally + 3, CHARINDEX('</p', spt.notes, t.tally + 3) - (t.tally + 3))

    from (

    select CHARINDEX('<p>', spt.notes) as first_p_tag

    ) as ca1

    inner join dbo.tally t on first_p_tag > 0 and t.tally >= first_p_tag and SUBSTRING(spt.notes, t.tally, 3) = '<p>'

    for xml path('')

    ) as varchar(max))

    , 1, 1, '')

    from ( --stg.SP_ProdControl_Time

    select '<div class="ExternalClass4A79E8DF5C67405CA0FA7385F7A9A618"><p>?wrap 4 harnesses with nylon tape 1.75 hrs</p><p>wrap 4 harnesses with foil 4.25 hrs</p><p>apply 2 layers of braid 2 hrs.</p></div>' as notes union all

    select '<div class="ExternalClass4A79E8DF5C67405CA0FA7385F7A9A618"><p>?wrap 4 harnesses with nylon tape 1.75 hrs</p><p>wrap 4 harnesses with foil 4.25 hrs</p><p>apply 2 layers of braid 2 hrs.</p><p>tag4blah</p><p>tag5blahblah</p><p>tag6blahblahblah</p></div>'

    ) spt

    SQL DBA,SQL Server MVP(07, 08, 09) "Money can't buy you happiness." Maybe so, but it can make your unhappiness a LOT more comfortable!

  • Thank you both for the replies.

    Scott, can you explain dbo.tally and t.tally. I'm not sure I understand what they are or what I need to do to create that table. Sorry if this is a stupid question.

  • jguffey (9/30/2015)


    Thank you both for the replies.

    Scott, can you explain dbo.tally and t.tally. I'm not sure I understand what they are or what I need to do to create that table. Sorry if this is a stupid question.

    No, sorry, my bad. I meant to add something about it but didn't.

    A "standard tally" table is just a table of sequential numbers, from 0 to however high you want to go; 1 million rows is typical. It's really remarkable how useful such a simple table can be for many SQL tasks.

    I was going to post code to create one but a filter at work thinks it matches a "sql injection" pattern.

    At any rate, do a Google search for creating a permanent tally table using CROSS JOINs. I believe Itzik Ben-Gan first demo'd the technique, but anyone can help you create one.

    SQL DBA,SQL Server MVP(07, 08, 09) "Money can't buy you happiness." Maybe so, but it can make your unhappiness a LOT more comfortable!

  • ScottPletcher (9/30/2015)


    Yikes. Here's an alternative that should perform better for you:

    Part of the reason that XML string splitters are inefficient is converting character separated strings to a format that can be interpreted as XML. Since this string is already in XML format, that overhead is not present.

    I haven't had a chance to do a comparison of the two methods.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • drew.allen (9/30/2015)


    ScottPletcher (9/30/2015)


    Yikes. Here's an alternative that should perform better for you:

    Part of the reason that XML string splitters are inefficient is converting character separated strings to a format that can be interpreted as XML. Since this string is already in XML format, that overhead is not present.

    I haven't had a chance to do a comparison of the two methods.

    Drew

    Sorry, I meant better than the original code, not necessarily better than the XML method. I don't really know how the XML method might perform.

    SQL DBA,SQL Server MVP(07, 08, 09) "Money can't buy you happiness." Maybe so, but it can make your unhappiness a LOT more comfortable!

  • So I created the tally table and executed your code (thanks for sharing/educating).

    I'm a little confused as to why the leading ? character is present in the results but I'll keep digging around with the script and see if I can figure it out. Just wanted to post some feedback.

    Query:

    SELECT STUFF(cast((

    SELECT ' ' + SUBSTRING(spt.notes, t.N + 3, CHARINDEX('</p', spt.notes, t.n + 3) - (t.n + 3))

    FROM (

    SELECT CHARINDEX('<p>', spt.notes) AS first_p_tag

    ) AS ca1

    INNER JOIN dbo.Numbers t ON first_p_tag > 0

    AND t.n >= first_p_tag

    AND SUBSTRING(spt.notes, t.n, 3) = '<p>'

    FOR XML path('')

    ) AS VARCHAR(max)), 1, 1, '')

    FROM (

    --stg.SP_ProdControl_Time

    SELECT '<div class="ExternalClass4A79E8DF5C67405CA0FA7385F7A9A618"><p>?wrap 4 harnesses with nylon tape 1.75 hrs</p><p>wrap 4 harnesses with foil 4.25 hrs</p><p>apply 2 layers of braid 2 hrs.</p></div>' AS notes

    UNION ALL

    SELECT '<div class="ExternalClass4A79E8DF5C67405CA0FA7385F7A9A618"><p>?wrap 4 harnesses with nylon tape 1.75 hrs</p><p>wrap 4 harnesses with foil 4.25 hrs</p><p>apply 2 layers of braid 2 hrs.</p><p>tag4blah</p><p>tag5blahblah</p><p>tag6blahblahblah</p></div>'

    ) spt

    Query Results:

    (No column name)

    ?wrap 4 harnesses with nylon tape 1.75 hrs wrap 4 harnesses with foil 4.25 hrs apply 2 layers of braid 2 hrs.

    ?wrap 4 harnesses with nylon tape 1.75 hrs wrap 4 harnesses with foil 4.25 hrs apply 2 layers of braid 2 hrs. tag4blah tag5blahblah tag6blahblahblah

  • I wondered about that too, but I noticed there's an odd character before "wrap" that looks like a blank, but really isn't. When you copy it in the SSMS editor, it looks like a (red) dash, but it returns to a blank if you try to put quotes around it :crazy:

    SQL DBA,SQL Server MVP(07, 08, 09) "Money can't buy you happiness." Maybe so, but it can make your unhappiness a LOT more comfortable!

  • jguffey (9/30/2015)


    So I created the tally table and executed your code (thanks for sharing/educating).

    [/code]

    Make sure that the only column in the Tally Table is NOT NULL and is the PK of the table so that a unique clustered index is formed on it. Otherwise, it'll be quite slow with table scans all the time.

    Here's an introductory article on what Tally Tables are and how they can be used to replace certain types of loops.

    http://qa.sqlservercentral.com/articles/T-SQL/62867/

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply