Optimization: Turning a series of entries into columns

  • Here is my situation

    I have table like this:

    DocID FieldName FieldValue

    1 Number 1000

    1 DocName Murphy.doc

    1 City Greenville

    2 Number 50

    2 DocName Biggs.doc

    2 City Null

    I want to change it into a table like this:

    DocID Number DocName City

    1 1000 Murphy.doc Greenville

    2 50 Biggs.doc Null

    Currently I have done this with a series of left outer joins on the table for each field I'm using:

    Example:

    FROM Table1 as Doc

    left outer join Table1 as Number

    ON DOC.DOCID = NUMBER.DOCID

    AND DOC.FIELDNAME = 'NUMBER'

    left outer join Table1 as DocNAME

    ON DOC.DOCID = DOCNAME.DOCID

    AND DOC.FIELDNAME = 'DOCNAME'

    left outer join Table1 as CITY

    ON DOC.DOCID = CITY.DOCID

    AND DOC.FIELDNAME = 'CITY'

    This works just fine, but is very very slow when the number of fields are 100+ and the number of records are 100k. I made a clustered index on the DOCID and FIELDNAME fields. My execution plan shows that the clustered index is used for all the joins and is a very small part of the process. Using a distinct clause in the select statement is what's killing me, its taking 89% of the overhead. However without a distinct clause I get a huge amount of duplicate records.

    Is there a better way to do this? I have heard perhaps pivot tables could do something similar to what I need.

  • You might want to have a look at the CrossTab and DynamicCrossTab articles referenced in my signature.

    The former will show you an example on how to build pivoted data with a static number of final columns whereas the latter will show you a way to do it dynamic (unknown number of final columns). The second approach actually cannot be achieved using PIVOT (afaik).



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Thanks Jeff, that definately puts me in the right direction.

    However, I do have one complication that may invalidate that direction.

    One thing I did not mention in my example is that sometimes a field can have two fields for the same record.

    Example

    ID NAME VALUE

    1 CITY GREENVILLE

    1 CITY NASHVILLE

    In my original method, I can account for these, but if I understand the crosstab code, I would have to use a MAX aggregate on my value field in order to make it work, and that would exclude one of the results.

    Is there a way to get around this? Right now I'm pondring a hydrid of our two methods. I use yours for all values that are duplicative, then use mine for the ones that are. I'll have to see if I still get a good speed increase from that.

  • Even if I'm not Jeff ( 😀 ) but promoting his great articles because those are simply the best ones I found related to this subject, I'd use ROW_NUMBER() to differentiate those rows and use it as a "combined column" in the dynamic cross tab.

    Something like:

    DECLARE @tbl TABLE

    (

    ID INT, NAME VARCHAR(30),VALUE VARCHAR(30)

    )

    INSERT INTO @tbl

    SELECT 1, 'CITY', 'GREENVILLE' UNION ALL

    SELECT 1, 'CITY', 'NASHVILLE'

    SELECT *, name + CAST(ROW_NUMBER() OVER(PARTITION BY name ORDER BY VALUE ) AS VARCHAR(3)) AS newCol

    FROM @tbl



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • That would work, with the only note that it tremendously increases the amount of CASE statements I need.

    Because every entry is now unique to every record, the number of case statements go through the roof if I'm not mistaken.

    Can SQL actually support that many statements?

  • huston.dunlap (4/15/2010)


    That would work, with the only note that it tremendously increases the amount of CASE statements I need.

    Because every entry is now unique to every record, the number of case statements go through the roof if I'm not mistaken.

    Can SQL actually support that many statements?

    In order to understand what you mean please provide sample data and expected results that actually show what you're struggling with.

    Currently, it doesn't make that much sense (to me, at least)...



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • lmu92 (4/16/2010)

    In order to understand what you mean please provide sample data and expected results that actually show what you're struggling with.

    Currently, it doesn't make that much sense (to me, at least)...

    Sure, least I can do. Due to security reasons I can't show you the actual data, so let me try another example.

    This is an area of the table where the category field might be duplicated.

    ID CATEGORY NAME

    1 CITY Greenville

    1 CITY Nashville

    1 CITY Bigville

    2 CITY Turnerville

    2 CITY Southville

    Using your row_number method, I would get a result like this:

    1 CITY1 Greenville

    1 CITY2 Nashville

    1 CITY3 Bigville

    1 CITY4 Turnerville

    1 CITY5 Southville

    So for my case statement, I would have to make a statment for every city possibility, as opposed to just the word CITY. So the number of statements begin to equal the number of records, which might have scaling issues.

    Thank you for working with me on this btw!

  • Based on your sampe data, how about this:

    --DocID FieldName FieldValue

    --1 Number 1000

    --1 DocName Murphy.doc

    --1 City Greenville

    --2 Number 50

    --2 DocName Biggs.doc

    --2 City Null

    create table #testtab (

    DocID int,

    FieldName varchar(50),

    FieldValue varchar(50)

    );

    insert into #testtab

    select 1, 'Number', '1000' union all

    select 1, 'DocName', 'Murphy.doc' union all

    select 1, 'City', 'Greenville' union all

    select 2, 'Number', '50' union all

    select 2, 'DocName', 'Biggs.doc' union all

    select 2, 'City', null;

    select

    DocID,

    max(case when FieldName = 'Number' then FieldValue else null end) as Number,

    max(case when FieldName = 'DocName' then FieldValue else null end) as DocName,

    max(case when FieldName = 'City' then FieldValue else null end) as City

    from

    #testtab

    group by

    DocID

    order by

    DocID

    ;

  • Change

    PARTITION BY name

    to PARTITION BY id,name

    to get the cities numbered for each id.

    Should help a little... 😉



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • huston.dunlap (4/16/2010)


    lmu92 (4/16/2010)

    In order to understand what you mean please provide sample data and expected results that actually show what you're struggling with.

    Currently, it doesn't make that much sense (to me, at least)...

    Sure, least I can do. Due to security reasons I can't show you the actual data, so let me try another example.

    This is an area of the table where the category field might be duplicated.

    ID CATEGORY NAME

    1 CITY Greenville

    1 CITY Nashville

    1 CITY Bigville

    2 CITY Turnerville

    2 CITY Southville

    Using your row_number method, I would get a result like this:

    1 CITY1 Greenville

    1 CITY2 Nashville

    1 CITY3 Bigville

    1 CITY4 Turnerville

    1 CITY5 Southville

    So for my case statement, I would have to make a statment for every city possibility, as opposed to just the word CITY. So the number of statements begin to equal the number of records, which might have scaling issues.

    Thank you for working with me on this btw!

    I guess my first question would be... what is the purpose of pivoting this data into multiple columns? Is it because someone needs to see it in a spreadsheet or? In theory, this could easily overwhelm even a spreadsheet because you said in your original post that there are 100+ types of entries/columns.

    Without see your actual dataset in this table, I believe my recommendation would be to first normalize the data you currently have in your EAV table and then things will get a whole lot easier.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Thinking about it more I think what I will do is turn the fields into a semi colon delimited list, getting rid of the duplicate fields.

    Then I can implement the solution Imu originally helped me with.

    Thank you all for the help, and wish me luck!

  • huston.dunlap (4/20/2010)


    Thinking about it more I think what I will do is turn the fields into a semi colon delimited list, getting rid of the duplicate fields.

    Then I can implement the solution Imu originally helped me with.

    Thank you all for the help, and wish me luck!

    If you're going to do that, why not use the TAB character {CHAR(9) in T-SQL and a couple of other languages)? THAT way, if you ever do need to bring it into a spreadsheet, you won't have to mess with identifying delimiters and the like... it'll just happen.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff Moden (4/20/2010)


    huston.dunlap (4/20/2010)


    Thinking about it more I think what I will do is turn the fields into a semi colon delimited list, getting rid of the duplicate fields.

    Then I can implement the solution Imu originally helped me with.

    Thank you all for the help, and wish me luck!

    If you're going to do that, why not use the TAB character {CHAR(9) in T-SQL and a couple of other languages)? THAT way, if you ever do need to bring it into a spreadsheet, you won't have to mess with identifying delimiters and the like... it'll just happen.

    Oh, the semi colon list was just an example, ultimately I'll let the user determine their delimiter.

    Again, thanks everyone!

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply