Import "structured" data

  • I have to import data from a "structured" text file. The file consists of records where a line type code identifies how that particular record is formatted. I have no issues pulling the data into SQL and then separating out the data. My issue is for one particular line type. Let's say that this data contains data for reports on a person.

    Here is a representation of what it would look like

    SELECT

    *

    FROM

    ( VALUES ( 1 , 1, 'Alfa', 'Bravo'), ( 1 , 2, '', 'Charlie'),

    ( 1 , 3, '', 'Delta'), ( 2 , 6, 'Echo', 'Foxtrot'),

    ( 2 , 7, '', 'Golf'), ( 2 , 8, '', 'Hotel'),

    ( 2 , 15, 'India', 'Juliett'), ( 2 , 16, '', 'Kilo'),

    ( 2 , 17, '', 'Lima'), ( 2, 18, '', 'Mike') ) AS Temp ( PersonID, Seq, Item1, Item2 );

    The PersonID represents a particular person and there will be a sequence number specifying the order of the original records (which resets for a new person). There can be multiple reports per ID but there is no identifier for them. The only way that I can identify separate reports for a given person is a break in the sequence number. The first record for each report contains different information than the following records for that report.

    Here is a representation of what I would like to get as a result

    SELECT

    *

    FROM

    ( VALUES ( 1, 3, 'Alfa', 'Bravo', 'Charlie,Delta'),

    ( 2, 8, 'Echo', 'Foxtrot', 'Golf,Hotel'),

    ( 2, 18, 'India', 'Juliett', 'Kilo,Lima,Mike') ) AS Temp2 ( PersonID, LastSeq, Return1, Return2, Return3 );

    In essence, I need to identify each report for each person, the first line for a report goes into certain fields and the remaining records for that report are concatentated into another field. I also need the last sequence number so I can match it up to another record (with a different line type) whose sequence number is equal to LastSeq +1.

    I can get this done by using a (dreaded) cursor but I was wondering if there was a way to avoid that.

    I hope this makes sense. I will do my best to clarify items you have questions about. I appreciate any help you can give.

    *Edit to correct the ouput

  • Interesting question, looking forward to the replies!

    I personally would probably use a loop for this. The difference is that I'd loop over the number of concatenations needed, not the number of records. But the last time I posted one of those, it was ignored as being a loop so I'll just watch the other answers as they'll probably be better anyways!

    reformatted a bit as it took me a few more steps to copy to a ssms window because my browser sucks and maybe others do too!

    -- Here is a representation of what it would look like

    SELECT

    *

    FROM

    ( VALUES ( 1 , 1, 'Alfa', 'Bravo'), ( 1 , 2, '', 'Charlie'),

    ( 1 , 3, '', 'Delta'), ( 2 , 6, 'Echo', 'Foxtrot'),

    ( 2 , 7, '', 'Golf'), ( 2 , 8, '', 'Hotel'),

    ( 2 , 15, 'India', 'Juliett'), ( 2 , 16, '', 'Kilo'),

    ( 2 , 17, '', 'Lima'), ( 2, 18, '', 'Mike') )

    AS Temp ( PersonID, Seq, Item1, Item2 );

    --Here is a representation of what poster would like to get as a result

    SELECT

    *

    FROM

    ( VALUES ( 1, 3, 'Alfa', 'Bravo', 'Charlie,Delta'),

    ( 2, 8, 'Echo', 'Foxtrot', 'Golf,Hotel'),

    ( 2, 18, 'India', 'Juliett', 'Kilo,Lima,Mike') )

    AS Temp2 ( PersonID, LastSeq, Return1, Return2, Return3 );

  • There's probably a better way to get this working without having to read the table 3 times.

    I'd be interested on the loop as it could be faster than some options that are supposed to be set-based.

    I refrained myself of suggesting the Quirky Update, but that could be a better possibility.

    The results of this query are against normalization and shouldn't be stored that way. Actually, the design of both versions of the data is flawed.

    That being said, here's what I got.

    SELECT

    *

    INTO #Temp

    FROM

    ( VALUES ( 1 , 1, 'Alfa', 'Bravo'),

    ( 1 , 2, '' , 'Charlie'),

    ( 1 , 3, '' , 'Delta'),

    ( 2 , 6, 'Echo', 'Foxtrot'),

    ( 2 , 7, '' , 'Golf'),

    ( 2 , 8, '' , 'Hotel'),

    ( 2 , 15, 'India', 'Juliett'),

    ( 2 , 16, '' , 'Kilo'),

    ( 2 , 17, '' , 'Lima'),

    ( 2 , 18, '' , 'Mike') ) AS Temp ( PersonID, Seq, Item1, Item2 );

    GO

    WITH CTE AS(

    SELECT PersonID,

    Seq,

    Item1 AS Return1,

    Item2 AS Return2,

    LEAD(Seq) OVER(ORDER BY Seq ) NextSeq

    FROM #Temp

    WHERE Item1 <> ''

    )

    SELECT CTE.PersonID,

    ls.LastSeq,

    CTE.Return1,

    CTE.Return2,

    STUFF(( SELECT ',' + Item2

    FROM #Temp i

    WHERE i.PersonID = CTE.PersonID

    AND i.Item1 = ''

    AND i.Seq >= CTE.Seq

    AND ( i.Seq < CTE.NextSeq OR CTE.NextSeq IS NULL)

    FOR XML PATH(''), TYPE).value('./text()[1]', 'varchar(max)'), 1, 1, '')

    FROM CTE

    CROSS APPLY( SELECT MAX(i.Seq) AS LastSeq

    FROM #Temp i

    WHERE i.Seq >= CTE.Seq

    AND ( i.Seq < CTE.NextSeq OR CTE.NextSeq IS NULL)) ls

    ORDER BY Seq;

    GO

    DROP TABLE #Temp

    Reference for the concatenation: http://qa.sqlservercentral.com/articles/comma+separated+list/71700/

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Luis,

    Your code works. I shall have to see about extrapolating it to my actual data. I will agree that the original data structure is flawed but it's from an outside vendor and nothing that I can change. As for the concatenation which I believe is the reason why you feel my reformatted output is flawed, I'm not really making a comma-delimited list. The actual data is more of a separated "paragraph" for want of a better word. It's a data field which spans over multiple records because it exceeds the length of one record. I'm just trying to put it back into one field.

    Many thanks,

    JT

  • JTSash (8/5/2016)


    Luis,

    Your code works. I shall have to see about extrapolating it to my actual data. I will agree that the original data structure is flawed but it's from an outside vendor and nothing that I can change. As for the concatenation which I believe is the reason why you feel my reformatted output is flawed, I'm not really making a comma-delimited list. The actual data is more of a separated "paragraph" for want of a better word. It's a data field which spans over multiple records because it exceeds the length of one record. I'm just trying to put it back into one field.

    Many thanks,

    JT

    I understand the fact that you can't change the input. And now that you mention the separated paragraph, it seems understandable.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Here is another variation that might perform better, depending on the production volume and the existence of a clustered index on the [Seq] column. ( If the index is on [PersonID], [Seq] then a modification is needed to the where clause of the subquery at the end. ) The size of the sample data is so small that an index actually requires more logical reads than just leaving the table as a heap. An index should exist on any production table.

    This approach borrows from the technique used to identify gaps/islands in numeric sequences in order to create the [Grp] column. A summary query produces all of the columns except for [Return3]. The final step applies a subquery using the FOR XML technique to build the concatenated string for each summary row.

    Originally, this code sequenced the rows within Grp and used that number in a CASE statement to identify the first row. However this added an additional sort. Since any first row of a group is defined as having a value in [Item1], that test was substituted, and the sequence within GRP was dropped.

    Again, this should be tested at production volumes. If Luis' solution is running fast enough for your needs, don't bother. I just took his comment about making fewer passes as a challenge and besides it was a thought-provoking problem. 😉

    SELECT

    *

    into #temp

    FROM

    ( VALUES ( 1 , 1, 'Alfa', 'Bravo')

    ,( 1 , 2, '', 'Charlie')

    ,( 1 , 3, '', 'Delta')

    ,( 2 , 6, 'Echo', 'Foxtrot')

    ,( 2 , 7, '', 'Golf'), ( 2 , 8, '', 'Hotel')

    ,( 2 , 15, 'India', 'Juliett')

    ,( 2 , 16, '', 'Kilo')

    ,( 2 , 17, '', 'Lima')

    ,( 2, 18, '', 'Mike')

    ,( 2, 19, 'Do','Re')

    ,( 2, 20, '','Mi')

    ,( 2, 23, '','Fa')

    ) AS Temp ( PersonID, Seq, Item1, Item2 );

    --create clustered index PK_#TEMP on #temp (Seq); -- Optional, but a good idea

    set statistics time, io on;

    SELECT

    *

    FROM

    ( VALUES ( 1, 3, 'Alfa', 'Bravo', 'Charlie,Delta'),

    ( 2, 8, 'Echo', 'Foxtrot', 'Golf,Hotel'),

    ( 2, 18, 'India', 'Juliett', 'Kilo,Lima,Mike') ) AS Temp2 ( PersonID, LastSeq, Return1, Return2, Return3 );

    -- first cte sets up treating this as an island/gaps problem by numbering the rows (SimpleSeq)

    -- and then creating a staggered series of numbers based on the existence of a value in Item1

    ;with cte as (

    select *, ROW_NUMBER() over(order by Seq) as SimpleSeq

    , SUM(case when item1 > '' then 0 else 1 end) over(order by Seq ROWS UNBOUNDED PRECEDING) As X

    from #temp

    )

    -- subtracting the staggered number from the simple row numbers (SimpleSeq - X) produces an obvious grouping column (Grp)

    ,cte2 as ( select *,SimpleSeq - X as Grp

    from cte )

    -- next we do a summary query by Grp to get necessary min/max values

    ,cte3 as ( select Grp,max(PersonID) as PersonID, min(Seq) as SmallestSeq, max(Seq) as LargestSeq

    ,max(item1) as return1

    ,max(case when item1 >'' then Item2 else null end) as Return2

    from cte2

    group by Grp

    )

    -- and finally, for each summary row we cross apply the FOR XML query that concatenates item 2 values

    -- for all but the first row in every Grp

    -- (If #temp is indexed on Seq this produces a total of three index seeks.

    select cte3.*, isnull(Item3,'') as Return3

    from cte3

    cross apply (select STUFF(( SELECT ',' + Item2

    FROM #Temp i

    WHERE Seq BETWEEN SmallestSeq+1 and LargestSeq

    ORDER BY SEQ

    FOR XML PATH(''), TYPE).value('./text()[1]', 'varchar(max)'), 1, 1, '')) subqry (Item3)

    set statistics time, io off;

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Dixie,

    Thank you for fine-tuning Luis' code. I haven't had an opportunity yet to test his version but I will work on testing both versions. The number of records is relatively small so performance shouldn't be an issue. However, I usually try to code for efficiency because it will usually cause fewer issues down the road.

    JT

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply