Replacing String Between Two HTML Tags

  • Hi all,

    I could use some assistance with this one as I can't get anything working correctly for it at the moment.

    I'm having to move some HTML formatted data between two applications and they have a few differing quirks regarding how they handle HTML formatting.

    The specific quirk that is causing me issues is how the applications handle the bold html tags <b> and </b> differently.

    The source application allows a line break (<br/>) between <b> and </b> tags and will treat everything until  to closing </b> tag as being bold

    The new application requires that the bold tag is closed off before the line break and opened again afterwards

    So the existing data might look like this:

    <b>This text is bold</b> This text is not <b> This text is bold<br/>as is this text</b>

    Where the new application will require it to look like this to display correctly:

    <b>This text is bold</b> This text is not <b> This text is bold</b><br/><b>as is this text</b>

    I've attempted to solve this by finding out the position of the first opening <b> tag, then getting the position of the next closing </b> tag and using these variables to run a replace on the text contained within them:

    SET @HTMLText = REPLACE(@HTMLText, SUBSTRING(@HTMLText, @Start, @Chars), REPLACE(SUBSTRING(@HTMLText, @Start, @Chars), '<br/>', '</b><br/><b>'))

    I then update the start and end variables to find the next opening <b> tag and loop through the rest of the string, which can be a fairly large chunk of text. However, what I've come up with isn't working properly.

    Has anyone got an idea for a good solution to this one?

    Thanks for taking the time to read this.

    Paul

     

  • Could do this with recursion, from right to left, but for now I'll just do a simple cursor loop.  Hopefully that will perform well enough for what you need.  Naturally you'll need to additional data samples to fully test out its logic.

    IF OBJECT_ID('tempdb.dbo.#data') IS NOT NULL
    DROP TABLE #data;
    CREATE TABLE #data (
    id int IDENTITY(1, 1) NOT NULL PRIMARY KEY,
    string nvarchar(max) NULL
    );
    INSERT INTO #data VALUES
    ('<b>This text is bold</b> This text is not <b> This text is bold<br/>as is this text</b>')


    DECLARE @byte_b_start int
    DECLARE @byte_b_end int
    DECLARE @byte_br_start int
    DECLARE @id int
    DECLARE @string nvarchar(max)
    DECLARE @string_was_changed bit
    DECLARE @stuff_length int

    DECLARE cursor_strings CURSOR LOCAL FAST_FORWARD FOR
    SELECT id, string
    FROM #data
    WHERE string LIKE '%<b>%<br%</b%'

    SELECT 'before', id, string
    FROM #data
    WHERE string LIKE '%<b>%<br%</b%'

    OPEN cursor_strings
    WHILE 1 = 1
    BEGIN
    FETCH NEXT FROM cursor_strings INTO @id, @string
    IF @@FETCH_STATUS <> 0
    IF @@FETCH_STATUS = -2
    CONTINUE
    ELSE
    BREAK;
    SET @byte_b_start = 1
    SET @string_was_changed = 0
    WHILE 2 = 2
    BEGIN
    --SELECT 'before', @string
    SET @byte_b_start = CHARINDEX('<b>', @string, @byte_b_start)
    IF @byte_b_start = 0
    BREAK;
    SET @byte_b_end = CHARINDEX('</b>', @string, @byte_b_start + 4)
    SET @byte_br_start = CHARINDEX('<br', @string, @byte_b_start + 4)
    IF @byte_br_start = 0
    BREAK;
    IF @byte_b_start > 0 AND @byte_br_start < @byte_b_end
    BEGIN
    IF SUBSTRING(@string, @byte_br_start + 3, 1) = '/'
    SET @stuff_length = 5
    ELSE
    SET @stuff_length = 4
    SET @string = STUFF(@string, @byte_br_start, @stuff_length,
    '</b>' + SUBSTRING(@string, @byte_br_start, @stuff_length) + '<b>')
    SET @string_was_changed = 1
    SET @byte_b_start = @byte_b_end + 12
    END /*IF*/
    ELSE
    BEGIN
    SET @byte_b_start = @byte_b_end + 3
    END /*ELSE*/
    --SELECT 'after_', @string, @byte_b_start, @byte_b_end, @byte_br_start
    END /*WHILE 2*/
    IF @string_was_changed > 0
    UPDATE #data
    SET string = @string
    WHERE id = @id

    END /*WHILE 1*/
    DEALLOCATE cursor_strings

    SELECT 'after_', id, string
    FROM #data
    WHERE string LIKE '%<b>%<br%</b%'

    SQL DBA,SQL Server MVP(07, 08, 09) "Money can't buy you happiness." Maybe so, but it can make your unhappiness a LOT more comfortable!

  • Hi Scott,

    Thanks for the quick reply - I'm just finishing for the day and I'll pick this up in the morning, at first glance it looks promising.

    Paul

  • ScottPletcher wrote:

    Could do this with recursion, from right to left, but for now I'll just do a simple cursor loop.  Hopefully that will perform well enough for what you need.  Naturally you'll need to additional data samples to fully test out its logic.

    IF OBJECT_ID('tempdb.dbo.#data') IS NOT NULL
    DROP TABLE #data;
    CREATE TABLE #data (
    id int IDENTITY(1, 1) NOT NULL PRIMARY KEY,
    string nvarchar(max) NULL
    );
    INSERT INTO #data VALUES
    ('<b>This text is bold</b> This text is not <b> This text is bold<br/>as is this text</b>')


    DECLARE @byte_b_start int
    DECLARE @byte_b_end int
    DECLARE @byte_br_start int
    DECLARE @id int
    DECLARE @string nvarchar(max)
    DECLARE @string_was_changed bit
    DECLARE @stuff_length int

    DECLARE cursor_strings CURSOR LOCAL FAST_FORWARD FOR
    SELECT id, string
    FROM #data
    WHERE string LIKE '%<b>%<br%</b%'

    SELECT 'before', id, string
    FROM #data
    WHERE string LIKE '%<b>%<br%</b%'

    OPEN cursor_strings
    WHILE 1 = 1
    BEGIN
    FETCH NEXT FROM cursor_strings INTO @id, @string
    IF @@FETCH_STATUS <> 0
    IF @@FETCH_STATUS = -2
    CONTINUE
    ELSE
    BREAK;
    SET @byte_b_start = 1
    SET @string_was_changed = 0
    WHILE 2 = 2
    BEGIN
    --SELECT 'before', @string
    SET @byte_b_start = CHARINDEX('<b>', @string, @byte_b_start)
    IF @byte_b_start = 0
    BREAK;
    SET @byte_b_end = CHARINDEX('</b>', @string, @byte_b_start + 4)
    SET @byte_br_start = CHARINDEX('<br', @string, @byte_b_start + 4)
    IF @byte_br_start = 0
    BREAK;
    IF @byte_b_start > 0 AND @byte_br_start < @byte_b_end
    BEGIN
    IF SUBSTRING(@string, @byte_br_start + 3, 1) = '/'
    SET @stuff_length = 5
    ELSE
    SET @stuff_length = 4
    SET @string = STUFF(@string, @byte_br_start, @stuff_length,
    '</b>' + SUBSTRING(@string, @byte_br_start, @stuff_length) + '<b>')
    SET @string_was_changed = 1
    SET @byte_b_start = @byte_b_end + 12
    END /*IF*/
    ELSE
    BEGIN
    SET @byte_b_start = @byte_b_end + 3
    END /*ELSE*/
    --SELECT 'after_', @string, @byte_b_start, @byte_b_end, @byte_br_start
    END /*WHILE 2*/
    IF @string_was_changed > 0
    UPDATE #data
    SET string = @string
    WHERE id = @id

    END /*WHILE 1*/
    DEALLOCATE cursor_strings

    SELECT 'after_', id, string
    FROM #data
    WHERE string LIKE '%<b>%<br%</b%'

    Hi Scott,

    I couldn't resist trying it - The only thing it doesn't look to cater for is if there are more than a single <br/> tag in-between a single set of <b> </b> tags

    So for row 2:

    INSERT INTO #data VALUES
    ('<b>This text is bold</b> This text is not <b> This text is bold<br/>as is this text</b>'),
    ('<b>This text is bold<br/>this text is also bold as is this text<br/>and this text is still bold as well</b>This text is not bold.<b>This text is bold<br/> as is this text</b>')

    you end up with the following:

    <b>This text is bold</b><br/><b> this text is also bold as is this text<br/>and this text is still bold as well</b>This text is not bold.<b>This text is bold</b><br/><b>as is this text</b>

    Obviously you can just give the cursor another pass over the data, but I'm sure there is a more elegant solution.

    I'm really signing off now.

    Thanks again

    Paul

  • I think this is right.  Naturally you'll need to confirm for yourself, as I don't have any additional test data.

    IF OBJECT_ID('tempdb.dbo.#data') IS NOT NULL
    DROP TABLE #data;
    CREATE TABLE #data (
    id int IDENTITY(1, 1) NOT NULL PRIMARY KEY,
    string nvarchar(max) NULL
    );
    INSERT INTO #data VALUES
    ('<b>This text is bold</b> This text is not <b> This text is bold<br/>as is this text</b>'),
    ('<b>This text is bold<br/>this text is also bold as is this text<br/>and this text is still bold as well</b>This text is not bold.<b>This text is bold<br/> as is this text</b>')

    DECLARE @byte_b_start int
    DECLARE @byte_b_end int
    DECLARE @byte_b_offset int
    DECLARE @byte_br_start int
    DECLARE @id int
    DECLARE @string nvarchar(max)
    DECLARE @string_was_changed bit
    DECLARE @stuff_length int

    DECLARE cursor_strings CURSOR LOCAL FAST_FORWARD FOR
    SELECT id, string
    FROM #data
    WHERE string LIKE '%<b>%<br%</b%'

    SELECT 'before', id, string
    FROM #data
    WHERE string LIKE '%<b>%<br%</b%'

    OPEN cursor_strings
    WHILE 1 = 1
    BEGIN
    FETCH NEXT FROM cursor_strings INTO @id, @string
    IF @@FETCH_STATUS <> 0
    IF @@FETCH_STATUS = -2
    CONTINUE
    ELSE
    BREAK;
    SET @byte_b_start = 1
    SET @byte_b_offset = 4
    SET @string_was_changed = 0
    WHILE 2 = 2
    BEGIN
    --SELECT 'before', @string
    SET @byte_b_start = CHARINDEX('<b>', @string, @byte_b_start)
    IF @byte_b_start = 0
    BREAK;
    SET @byte_b_end = CHARINDEX('</b>', @string, @byte_b_start + @byte_b_offset)
    WHILE 3 = 3
    BEGIN
    SET @byte_br_start = CHARINDEX('<br', @string, @byte_b_start + @byte_b_offset)
    IF @byte_br_start = 0
    BREAK;
    IF @byte_br_start < @byte_b_end
    BEGIN
    IF SUBSTRING(@string, @byte_br_start + 3, 1) = '/'
    SET @stuff_length = 5
    ELSE
    SET @stuff_length = 4
    SET @string = STUFF(@string, @byte_br_start, @stuff_length,
    '</b>' + SUBSTRING(@string, @byte_br_start, @stuff_length) + '<b>')
    SET @byte_b_offset = @byte_b_offset + 7
    SET @byte_b_start = @byte_br_start + 12
    SET @string_was_changed = 1
    END /*IF*/
    ELSE
    BREAK;
    END /*WHILE 3*/
    SET @byte_b_start = @byte_b_end + @byte_b_offset
    --SELECT 'after_', @string, @byte_b_start, @byte_b_end, @byte_br_start
    END /*WHILE 2*/
    IF @string_was_changed > 0
    UPDATE #data
    SET string = @string
    WHERE id = @id

    END /*WHILE 1*/
    DEALLOCATE cursor_strings

    SELECT 'after_', id, string
    FROM #data
    WHERE string LIKE '%<b>%<br%</b%'

    SQL DBA,SQL Server MVP(07, 08, 09) "Money can't buy you happiness." Maybe so, but it can make your unhappiness a LOT more comfortable!

  • ScottPletcher wrote:

    I think this is right.  Naturally you'll need to confirm for yourself, as I don't have any additional test data.

    IF OBJECT_ID('tempdb.dbo.#data') IS NOT NULL
    DROP TABLE #data;
    CREATE TABLE #data (
    id int IDENTITY(1, 1) NOT NULL PRIMARY KEY,
    string nvarchar(max) NULL
    );
    INSERT INTO #data VALUES
    ('<b>This text is bold</b> This text is not <b> This text is bold<br/>as is this text</b>'),
    ('<b>This text is bold<br/>this text is also bold as is this text<br/>and this text is still bold as well</b>This text is not bold.<b>This text is bold<br/> as is this text</b>')

    DECLARE @byte_b_start int
    DECLARE @byte_b_end int
    DECLARE @byte_b_offset int
    DECLARE @byte_br_start int
    DECLARE @id int
    DECLARE @string nvarchar(max)
    DECLARE @string_was_changed bit
    DECLARE @stuff_length int

    DECLARE cursor_strings CURSOR LOCAL FAST_FORWARD FOR
    SELECT id, string
    FROM #data
    WHERE string LIKE '%<b>%<br%</b%'

    SELECT 'before', id, string
    FROM #data
    WHERE string LIKE '%<b>%<br%</b%'

    OPEN cursor_strings
    WHILE 1 = 1
    BEGIN
    FETCH NEXT FROM cursor_strings INTO @id, @string
    IF @@FETCH_STATUS <> 0
    IF @@FETCH_STATUS = -2
    CONTINUE
    ELSE
    BREAK;
    SET @byte_b_start = 1
    SET @byte_b_offset = 4
    SET @string_was_changed = 0
    WHILE 2 = 2
    BEGIN
    --SELECT 'before', @string
    SET @byte_b_start = CHARINDEX('<b>', @string, @byte_b_start)
    IF @byte_b_start = 0
    BREAK;
    SET @byte_b_end = CHARINDEX('</b>', @string, @byte_b_start + @byte_b_offset)
    WHILE 3 = 3
    BEGIN
    SET @byte_br_start = CHARINDEX('<br', @string, @byte_b_start + @byte_b_offset)
    IF @byte_br_start = 0
    BREAK;
    IF @byte_br_start < @byte_b_end
    BEGIN
    IF SUBSTRING(@string, @byte_br_start + 3, 1) = '/'
    SET @stuff_length = 5
    ELSE
    SET @stuff_length = 4
    SET @string = STUFF(@string, @byte_br_start, @stuff_length,
    '</b>' + SUBSTRING(@string, @byte_br_start, @stuff_length) + '<b>')
    SET @byte_b_offset = @byte_b_offset + 7
    SET @byte_b_start = @byte_br_start + 12
    SET @string_was_changed = 1
    END /*IF*/
    ELSE
    BREAK;
    END /*WHILE 3*/
    SET @byte_b_start = @byte_b_end + @byte_b_offset
    --SELECT 'after_', @string, @byte_b_start, @byte_b_end, @byte_br_start
    END /*WHILE 2*/
    IF @string_was_changed > 0
    UPDATE #data
    SET string = @string
    WHERE id = @id

    END /*WHILE 1*/
    DEALLOCATE cursor_strings

    SELECT 'after_', id, string
    FROM #data
    WHERE string LIKE '%<b>%<br%</b%'

    Hi Scott,

    The updated cursor does work as expected with the very limited test data I gave you, however when running it against some real data it ends up falling into an endless loop on certain records (the original version does not suffer from this).

    I've tracked down one of the offending records and anonymised the data, so the following will never return a result from the second version of the cursor.

    IF OBJECT_ID('tempdb.dbo.#data') IS NOT NULL
    DROP TABLE #data;
    CREATE TABLE #data (
    id int IDENTITY(1, 1) NOT NULL PRIMARY KEY,
    string nvarchar(max) NULL
    );
    INSERT INTO #data VALUES
    ('<b>Test text Test text &nbsp;Test text Test text Test text Test text Test text Test text Test text Test text &nbsp; Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text <br/>&nbsp;<br/>Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text''s<br/>&nbsp;<br/>Test text Test text Test text Test text Test text Test text Test text Test text <br/>--------------------------------------------------------------<br/>Test text Test text <br/>&nbsp;<br/>Test text Test text -&nbsp; Test text Test text </b>- Test text Test text <br/><br/><b>Test text Test text </b>Test text Test text Test text Test text Test text Test text Test text Test text .<br/><br/><b>Test text Test text </b> Test text Test text <br/><br/><b>Test text Test text </b>.&nbsp;<br/><br/><b>Test text Test text </b><br/><br/><b>Test text Test text </b>Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text <br/><br/><b>Test text Test text </b>) -&nbsp;Test text Test text <br/><br/><b>10/01/14 Test text Test text </b>&nbsp;- Test text Test text <br/>Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text <br/><br/>18/12Test text Test text &nbsp;Test text Test text <br/>Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text <br/>Test text Test text &nbsp;Test text Test text <br/>&nbsp;<br/>Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text <br/>&nbsp;<br/>Test text Test text <br/><br/>12/02Test text Test text <br/>Test text Test text <br/>Test text Test text &nbsp; Test text Test text <br/>Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text <br/>Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text <br/><br/>Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text <br/><br/>16/03/2016 Test text Test text Test text Test text Test text Test text Test text Test text <br/><br/>Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text <br/><br/>Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text <br/><br/>Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text <br/>&nbsp;<br/><br/>Test text Test text Test text Test text Test text Test text Test text Test text <br/>&nbsp;<br/>Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text Test text <br/>&nbsp;<br/><br/>&nbsp;<br/><b><u>Test text Test text </u></b><br/>Test text Test text Test text Test text Test text Test text <br/>Test text Test text Test text Test text Test text Test text Test text Test text <br/>Test text Test text Test text Test text Test text Test text Test text Test text <br/>Test text Test text Test text Test text Test text Test text Test text Test text <br/>Test text Test text Test text Test text Test text Test text Test text Test text <br/>Test text Test text Test text Test text &nbsp;&nbsp;<br/>Test text Test text Test text Test text Test text Test text <br/>Test text Test text &nbsp;Test text Test text <br/><br/>')

    I've not worked out what part is causing the issue yet...

    Thanks

    Paul

     

  • ...hope you can work it out.

    You really need to provide truly representative sample data if you want a full solution.

    SQL DBA,SQL Server MVP(07, 08, 09) "Money can't buy you happiness." Maybe so, but it can make your unhappiness a LOT more comfortable!

  • I did this exercise a few years ago. There were many, many combinations that caused problems, plus we had cases where the html markups were quoted in the text and had to be excluded. It did seem easy up front, but applying solutions to real data just kept throwing issues.

    Good luck!

  • ScottPletcher wrote:

    ...hope you can work it out.

    You really need to provide truly representative sample data if you want a full solution.

    With the sample data thing, with this one it's incredibly difficult to do - We are talking about nearly half a million user input rows, with an average of 25,500 characters each and there is no consistent rhyme or reason for how they have decided to apply formatting.

    Thanks again for your help, it's put me on the path to getting this last issue sorted.

    Paul

  • jschmidt 17654 wrote:

    I did this exercise a few years ago. There were many, many combinations that caused problems, plus we had cases where the html markups were quoted in the text and had to be excluded. It did seem easy up front, but applying solutions to real data just kept throwing issues.

    Good luck!

    I feel your pain 🙂

    It's been an interesting one for sure but I'm just about there now I think...

  • Paul McCaffrey wrote:

    ScottPletcher wrote:

    ...hope you can work it out.

    You really need to provide truly representative sample data if you want a full solution.

    With the sample data thing, with this one it's incredibly difficult to do - We are talking about nearly half a million user input rows, with an average of 25,500 characters each and there is no consistent rhyme or reason for how they have decided to apply formatting.

    Thanks again for your help, it's put me on the path to getting this last issue sorted.

    Paul

    Given that it's hard for you to provide sample data, when you already have half a million rows to choose from, imagine how difficult it is for me, when I have no sample data rows available.

    I would think finding the rows with the most <br/> tags in them would have had a good chance of finding the more difficult rows to do, viz.: SELECT TOP (5) * FROM .... ORDER BY LEN(string) - LEN(REPLACE(string, '<br/>', '')) DESC

    Again, good luck with this, I hope you work it all out successfully.

    SQL DBA,SQL Server MVP(07, 08, 09) "Money can't buy you happiness." Maybe so, but it can make your unhappiness a LOT more comfortable!

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply