Parsing out a string with multiple entries per record idea

  • All,

    here's the problem.

    I have a text/blob field that has multiple entries that are pseudo date/time stamped.

    I have been trying to use this as an delimiter and in line with a split parser, but am not having much luck.

    I need to break out each text entry per record id and migrate to a new table.

    Any ideas on how to attack?

  • If you could post a (smallish, representative) sample of your text data and roughly what you might want from it, you'd probably be inundated with tips.  It's a bit difficult working blind.

  • The datafield is similar to this:

     

    14:57:00 12/01/06 (Pacific) Person making note Subject: blah, blah, blah Complaint: duh duh duh Object: dee dee deed 12:57:00 11/01/06 (Pacific) Person making note Subject: blah, blah, blah Complaint: duh duh duh Object: dee dee deed

     

    Part of the problem is that the application is doing rtf formating and then storing that formating with the text in the field.  The other rub is that the person doing the notation is not completely standardized. So the general theme is near the same, but not exact, so there's lots of extra characters, carriage returns, escape characters, etc.

    As a  result simple parsing using the time date stamp as aa delimiter produces varied results.

    After looking at the data, its going to have to be scrubbed and then standardized before its parsed out.

    The request for this was made without awareness of the state of the data and the expectation that it was a very simple operation.

    However, if you have any examples of string parsing with options to parse out any non standard characters that would be awesome.

     

     

     

     

  • Here is some sample code that I could think of ... this is just to get you an idea. you need to put this code in WHILE loop inorder to process the complete string. I hope "(pacific)" is ths string that would appear after each date.

    DECLARE

    @val varchar(4000)

    DECLARE

    @pos1 int

    DECLARE

    @pos2 int

    SET

    @val = '14:57:00 12/01/06 (Pacific) Person making note Subject: blah, blah, blah Complaint: duh duh duh Object: dee dee deed 12:57:00 11/01/06 (Pacific) Person making note Subject: blah, blah, blah Complaint: duh duh duh Object: dee dee deed '

    SET

    @pos1 = CHARINDEX('(Pacific)', @val) + 19

    SET

    @pos2 = CHARINDEX('(Pacific)', @val, @pos1)

    PRINT

    @pos1

    PRINT

    @pos2

    SELECT

    substring(@val, @pos1, (@pos2 - @pos1))

    in your WHILE loop @pos2 becomes @pos1 this way you can parse it easily.

     

  • so u want to split the record:

    14:57:00 12/01/06 (Pacific) Person making note Subject: blah, blah, blah Complaint: duh duh duh Object: dee dee deed 12:57:00 11/01/06 (Pacific) Person making note Subject: blah, blah, blah Complaint: duh duh duh Object: dee dee deed

    into two records under the same ID:

    14:57:00 12/01/06 (Pacific) Person making note Subject: blah, blah, blah Complaint: duh duh duh Object: dee dee deed

    12:57:00 11/01/06 (Pacific) Person making note Subject: blah, blah, blah Complaint: duh duh duh Object: dee dee deed

    is that the idea? the reason I'm asking is that u never clearly specified what to include in the splits.

    i'm assuming u use SQL 2005.

  • yes, thats the idea

     

  • Yes, all that my code do is it splits the data based on keyword. I just gave an idea on to achieve the splitting of data.

    I'm using SQL2005, but I presume the same code will work in SQL2000 since the string functions are almost similar in both versions.

     

  • yes, but the problem is Max does not know what keyword will be there up front: his data could vary. What he needs is to split on a pattern

    [date] [some_text]

    , which justifies using regular expressions in this case.

    Max did u ever use regular expressions in SQL 2005? That's the way to handle the task.

  • kkprasad,

    just tried your code, it returns

    'king note Subject: blah, blah, blah Complaint: duh duh duh Object: dee dee deed 12:57:00 11/01/06'

    is that what u expected, or i'm doing smth wrong here? 

  • You don't need to use regular expressions, PATINDEX would do the trick (and also allow for other time zones' dates in the data).  Starting position can be determined by :

     

    select @pos1 = patindex('%[0-9][0-9]:[0-9][0-9]:[0-9][0-9] [0-9][0-9]/[0-9][0-9]/[0-9][0-9] (%', @val)

  • The pity is, of course, that PATINDEX only takes 2 arguments (so no "starting from" displacement) so that to get both @pos1 and @pos2 you have to do something ugly like:

    select @pos1 = patindex('%[0-9][0-9]:[0-9][0-9]:[0-9][0-9] [0-9][0-9]/[0-9][0-9]/[0-9][0-9] (%', @val)

     ,@pos2 = patindex('%[0-9][0-9]:[0-9][0-9]:[0-9][0-9] [0-9][0-9]/[0-9][0-9]/[0-9][0-9] (%', substring(@val, 2, len(@val)-1))

    And i'm sure there are neater ways of doing this, but yuou get the idea.

  • a hidden issue that complicates things, is that the application is using rtf so there might be hidden characters, formatting that impacts the delimiter

     

    The data field needs to be scrubbed down and reformatted then parsed

     

     

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply