Text Parsing TSQL code

  • I need help to create a script that would perform text parsing.

    I have a text (words separated by spaces, probably by more than one space and with any different symbols). I need to extract all words from the text and insert each individual word in another table. 

    eg.

    TableA

    SentenceID   Sentence

    000001         How are you today?

    000002         What is your name?

    TableB

    SentenceID   Word

    000001         How

    000001         are

    000001         you

    000001         today

    000002         What

    000002         is

    000002         your

    000003         name

    Any help appreciated!

     

     

     

     

     

     

  • Best to do that type of work using a computer language that's good for parsing text. TSQL is not one of those languages.

    Vyas talks about many methods to parse comma demited parameter. That's similar to a space delimited sentence.

    http://vyaskn.tripod.com/passing_arrays_to_stored_procedures.htm

  • This is part of a SPROC I wrote for multi-word text matching procedure.  I use a temp table to hold the parsed words, which then I run through clean up filters.  Runs fairly quick.  Hope it helps....Jeff

    DECLARE @Position        INT              --Used for looping during word parsing

    DECLARE @Counter         INT             --Loop counter for parsing the word from string

    DECLARE @String         VARCHAR(100)       --Company name to be parsed into words

    DECLARE @WordEnd     INT             --String position of the end character of the word

    DECLARE @Weight        INT             --Word weighting factor

    DECLARE @WordCtr      INT             --Count of how many words were in the string

    --Create the word list table

    DECLARE @WordTable TABLE

     ( 

      Iden  INT       NULL,

      word  VARCHAR(30) NULL

    &nbsp

    SET @String = '<your string here' 

      --Clear the Word table of any entries

      --Reset all working variables to start parsing the new string

      SET @WordCtr = 1

      SET @Position = DATALENGTH(@Company)

      SET @wordend = DATALENGTH(@Company)

      WHILE @position >= 0 

       BEGIN

        IF (ASCII(SUBSTRING(@Company, @position, 1)) = 32) --Looks for the space character

         BEGIN

          INSERT @WordTable  (Iden, word) VALUES (@WordCtr, SUBSTRING(@Company, @position +1, (@wordEnd -@position) ) )

          SET @WordEnd = @Position - 1

          SET @Position = @Position - 1

          SET @WordCtr = @WordCtr +1

         END  --If character is a space

        ELSE  --Character is not a speace

         BEGIN

          IF @Position = 0  --We have parsed to the beginning of the string, so add the first word in string

           BEGIN

            INSERT @WordTable  (Iden, word) Values (@WordCtr, SUBSTRING(@Company, 0, @wordEnd+1) )

            SET @position = @position - 1

           END

          ELSE --Not to the beginning of the string yet

           BEGIN   

            SET @position = @position - 1

           END

         END  --Character is not a space

       END  --While Position >=0

    SELECT word FROM @WordTable

     

     

  • Jeff, this is awesome!  I have created a sproc from this to share with my team - really good stuff!

    [font="Courier New"]ZenDada[/font]

  • Cool, glad you found it useful.  I actually had to write it because I got tasked with comparing non-standard company names from two different CRM systems to get 'close matches' using a free-text search routine, what a pain...anyway have a great day....jd

  • Hi All,

    Thanks for all comments and to Jeff for enclosed SQL code. It is very good! Unfortunatly I could not used because of dead line at Friday (Sydney time zone) so I have been forced to create something similar to Jeff stuff!

    I wrote function:

     CREATE FUNCTION fnWord ( @SentenceID CHAR( 8 ),

                              @Sentence VARCHAR ( 100 ) ) 

        RETURNS @TableA TABLE ( SentenceID TEXT, Word TEXT )

     AS 

        BEGIN

           SET @Sentence =  @Sentence + ' '

           DECLARE @WordStart INT

           DECLARE @WordEnd INT

           SET @WordStart = 1

           SET @WordEnd = 1

           WHILE @WordStart < LEN( @Sentence )

              BEGIN

                 SET @WordStart = CHARINDEX( ' ', @Sentence, @WordStart )

                 INSERT @TableA

                    SELECT @SentenceID, CAST( SUBSTRING( @Sentence, @WordEnd, @WordStart - @WordEnd ) AS VARCHAR )

                 SET @WordEnd = @WordStart + 1

                 SET @WordStart = @WordStart + 1

              END

           RETURN

        END

    GO

    I tested function with:

    SELECT *

       FROM fnWord('00123483','What is your given name?')

    Result is:

    SentenceID Word

    00123483 What

    00123483 is

    00123483 your

    00123483 given

    00123483 name?

    (5 row(s) affected)

    To be able to apply function on data set (15,000,000 rows/ 150 words per sentence), I used CURSOR:

    SET NOCOUNT ON

    DECLARE @part varchar(8)

    DECLARE @ID VARCHAR(8000)

    DECLARE CTest CURSOR FOR

       SELECT part_no, words

          FROM #word

    OPEN CTest

    FETCH NEXT FROM CTest

       INTO @PART, @ID

    WHILE @@FETCH_STATUS = 0

       BEGIN

          INSERT INTO #words

             SELECT * FROM fnWord( @part, @ID )

          FETCH NEXT FROM CTest

             INTO @PART, @id

       END

    CLOSE CTest

    DEALLOCATE CTest

    It took 50 mins. For me it is better than I expected!!

    Another chelenge (hate cursors)!

    Can somebody change cursor (row level processing) to dataset level using 'INSERT'

    Thanks,

    Milovan

     

     

     

     

     

  • hey milovan

    iam not sure whether u have solved ur pblm.

    i did not understand ur question. cursor are for row level processing. if u could elaborate ur doubt i might try to clear it

    any way here is a proc that could parse a sentence and enter the parsed words into another table

     

    alter proc parsing

    as

    declare @sentencelength int

    declare @sentenceid int

    declare @sentence varchar(100)

    declare @count int

    declare @temp_str varchar(100)

    begin

    declare c1_parse cursor for select * from sentence

    open c1_parse

    fetch next from c1_parse into @sentenceid,@sentence

    while @@fetch_status = 0

    begin/*first while loop begins*/

    set @count=0;

    set @sentencelength=len(@sentence)

    set @temp_str=''

     while @count <= @sentencelength

      begin/*2nd while loop begins*/

       if (substring(@sentence,@count+1,1) = ' ' or @count=@sentencelength )

        begin/* if starts*/

          insert into words values(@sentenceid,@temp_str)

          set @temp_str=''      

        end/* if ends*/

       else

         begin/*else starts*/

          set @temp_str=@temp_str+substring(@sentence,@count+1,1)

         end /*else ends*/

       set @count=@count + 1; 

     end/*2nd while loop ends*/

     fetch next from c1_parse into @sentenceid,@sentence

    end/*end the first while loop*/

    close c1_parse

    deallocate c1_parse

    end/*end proc*/

     

    Rajiv.

  • Hi Rajiv

    Thanks for your code. I think that you misunderstand me!

    The problem is not how to insert row. As you can see I already created function fnWord that extract words from sentence and return table with results. I like this part of solution because it is very simple and stright forward.

    I do not like another part of solution that insert rows into final table #words by calling function for each sentence (each row of input table). I used cursor. You didn't tell nothing new in your example because you use CURSOR too.

    I'll be more clear with my question:

    Can you apply function on data set?? (using INSERT/SELECT structure instead of CURSOR)?

    If you can do it you will make my code more efficient!!

    Regards.

    Milovan

     

     

  • No,  what you want to do can't be done with SQL2000.  You would need to join your 'inpt table' to the output from the function. 

    Apparently the next release of SQL Server will support this however to we will just have to wait...

  • Hi Douglas,

    Thanks for your stright answer. It looks like I tried to create perpetual motion. Anyhow thanks for giving me peace of mind!

    Regards,

    Milovan

     

  • Hi folks,

    looking for a UDF that will parse a string, and address field in particular.

    say you have '123 any street'. I need each component separated.

    Any feedback or thoughts will be appreciated

  • Milovan,

    I noticed that all of the other solutions, although well written and functional, contain WHILE loops.  Inherently, WHILE loops are slower than setbased code especially if you have a lot of records and a lot of words in each "sentence".  So, I propose an alternate.

    You need to make a Tally or Numbers table which consists of nothing but a column of numbers and it has many, many uses.  Before I show you how to use it to solve your problem in a very high speed fashion, let me show you how to make a Tally table.  You could, of course, make it as a temp table but it will serve you much better as a permanent table...

    --===== Create and populate the Tally table

         -- The cross-join is intentional

     SELECT TOP 9999 IDENTITY(SMALLINT,1,1) AS N

       INTO dbo.Tally

       FROM Master.dbo.SYSCOLUMNS sc1,

            Master.dbo.SYSCOLUMNS sc2

    --===== Add a primary key to the Tally table for speed

      ALTER TABLE dbo.Tally

            ADD CONSTRAINT PK_Tally_N PRIMARY KEY CLUSTERED (N)

    ...Ok... that wasn't so painful... heck, I didn't even use a loop to make the Tally table.  It creates a table that contains a column of numbers from 1 to 9999.  I use 9999 because I also use the Tally table as a very high speed 4 digit random number generator.  For example, the following code generates, say, 100 4 digit random numbers in a heartbeat...

     SELECT TOP 100 N

       FROM dbo.Tally

      WHERE N >= 1000

      ORDER BY NEWID()

    You can also use it to generate a set of dates... for example, if you wanted a list of the next 12 Fridays, you could do this...

     SELECT TOP 12 CONVERT(VARCHAR(10),GETDATE()+N,101)

       FROM dbo.Tally

      WHERE DATENAME(dw,GETDATE()+N) = 'Friday'

      ORDER BY GETDATE()+N

    ...again, notice that there are no loops in this and my fingers never left my hand. 

    Ok, now for your problem... I included some test code to make a (temp) table to hold your sample data and parse it as you requested... as someone suggested, you could easily turn it into a table UDF and use it in a FROM clause, but this will probably suffice...

    --===== If it exists, drop the test table

         IF OBJECT_ID('TempDB..#YourTable') IS NOT NULL

            DROP TABLE #YourTable

    --===== Create the a test table to contain the original posted data

         -- plus a couple of more

     CREATE TABLE #YourTable

            (

            SentenceID  INT NOT NULL,

            Sentence VARCHAR(8000)

            )

     INSERT INTO #YourTable (SentenceID,Sentence)

     SELECT 000001,'How are you today?' UNION ALL

     SELECT 000002,'What is your name?' UNION ALL

     SELECT 000003,'Now is the time for all good men to come to the aid of their country.'

    --===== Add a primary key to the test table just because it's the

         -- right thing to do.

      ALTER TABLE #YourTable

            ADD PRIMARY KEY CLUSTERED (SentenceID)

    --===== Now, split the "Sentence" column at the " " and display 1 per

         -- row along with it's SentenceID number...

     SELECT SentenceID,

            SUBSTRING(' '+y.Sentence+' ',N+1,CHARINDEX(' ',' '+y.Sentence+' ',N+1)-N-1) AS ParsedData

       FROM #YourTable y,

            dbo.Tally t

      WHERE t.N < LEN(' '+y.Sentence+' ')

        AND SUBSTRING(' '+y.Sentence+' ',N,1) =' '

      ORDER BY SentenceID

    You may run the above with impunity and repeatedly for testing purposes.  Here's the output as the code currently stands...

    SentenceID  ParsedData 

    ----------- ------------

    1           How

    1           are

    1           you

    1           today?

    2           What

    2           is

    2           your

    2           name?

    3           Now

    3           is

    3           the

    3           time

    3           for

    3           all

    3           good

    3           men

    3           to

    3           come

    3           to

    3           the

    3           aid

    3           of

    3           their

    3           country.

    Compared to the use of WHILE loops, Cursors, and other RBAR (my pet name for "Row By Agonzing Row"), the Tally table solution runs lightning quick.

    The Tally table can also be used for removing unwanted characters, converting things to proper case, doing all sorts of date tricks, and a lot more... all without the use of relatively slow RBAR methods.

    Any questions as to why you might need a Tally table?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • I'm starting to get the hang of using these tally tables, but I'm a little foggy on the sequence of events (my experience up 'till now is in purely procedural languages). Could you, for instance, create a column that indicated the word number in addition to the sentance number, so that you would end up with:

    SentenceID  word position  ParsedData 

    ----------- -------------  ------------

    1           1              How

    1           2              are

    1           3              you

    1           4              today?

    2           1              What

    2           2              is

    2           3              your

    2           4              name?

    3           1              Now

    3           2              is

    3           3              the

    3           4              time

    3           5              for

    3           6              all

    3           7              good

    3           8              men

    3           9              to

    3           10             come

    3           11             to

    3           12             the

    3           13             aid

    3           14             of

    3           15             their

    3           16             country.

    I've been trying to do this for several days now, and just can't figure it out. Everything I try comes up with either nonsense or the position of the first letter of each word, or... a bunch of other things, but never what I need. My first thought was to use a separate tally table, but soon got confused by how the selects would be set up. Do you have any suggestions?

     


    Best Regards,

    Carl E. Campbell
    nyprehabmed.org

  • For some things, you don't need the tally table

    SELECT SentenceID, 

      datalength(Left(' '+y.Sentence+' ', N)) - datalength(replace(Left(' '+y.Sentence+' ', N), ' ', ''))InSentenceNo,

            SUBSTRING(' '+y.Sentence+' ',N+1,CHARINDEX(' ',' '+y.Sentence+' ',N+1)-N-1) AS ParsedData

       FROM #YourTable y,

            dbo.Tally t

      WHERE t.N < LEN(' '+y.Sentence+' ')

        AND SUBSTRING(' '+y.Sentence+' ',N,1) =' '

      ORDER BY SentenceID

    Cheers,

     


    * Noel

  • FROM #YourTable y,

            dbo.Tally t

    Ok... now I'm confused That is a Tally table in your code, isn't it?

    Althought the Jury is still out on the subject and for a different reason, I've started some mega row testing on Tally table functions and it turns out that a WHILE loop may actually perform better than a Tally table on function like these ... I have more testing to do before I publish the results...

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

Viewing 15 posts - 1 through 15 (of 29 total)

You must be logged in to reply to this topic. Login to reply