Optimization: Turning a series of entries into columns

Question

Post reply

Optimization: Turning a series of entries into columns

huston.dunlap

SSCommitted

Points: 1551
More actions
April 14, 2010 at 1:16 pm

#217872

Here is my situation

I have table like this:

DocID FieldName FieldValue

1 Number 1000

1 DocName Murphy.doc

1 City Greenville

2 Number 50

2 DocName Biggs.doc

2 City Null

I want to change it into a table like this:

DocID Number DocName City

1 1000 Murphy.doc Greenville

2 50 Biggs.doc Null

Currently I have done this with a series of left outer joins on the table for each field I'm using:

Example:

FROM Table1 as Doc

left outer join Table1 as Number

ON DOC.DOCID = NUMBER.DOCID

AND DOC.FIELDNAME = 'NUMBER'

left outer join Table1 as DocNAME

ON DOC.DOCID = DOCNAME.DOCID

AND DOC.FIELDNAME = 'DOCNAME'

left outer join Table1 as CITY

ON DOC.DOCID = CITY.DOCID

AND DOC.FIELDNAME = 'CITY'

This works just fine, but is very very slow when the number of fields are 100+ and the number of records are 100k. I made a clustered index on the DOCID and FIELDNAME fields. My execution plan shows that the clustered index is used for all the joins and is a very small part of the process. Using a distinct clause in the select statement is what's killing me, its taking 89% of the overhead. However without a distinct clause I get a huge amount of duplicate records.

Is there a better way to do this? I have heard perhaps pivot tables could do something similar to what I need.

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply

LutzM SSC Guru Points: 107049 More actions · Answer 1

You might want to have a look at the CrossTab and DynamicCrossTab articles referenced in my signature.

The former will show you an example on how to build pivoted data with a static number of final columns whereas the latter will show you a way to do it dynamic (unknown number of final columns). The second approach actually cannot be achieved using PIVOT (afaik).

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

huston.dunlap SSCommitted Points: 1551 More actions · Answer 2

Thanks Jeff, that definately puts me in the right direction.

However, I do have one complication that may invalidate that direction.

One thing I did not mention in my example is that sometimes a field can have two fields for the same record.

Example

ID NAME VALUE

1 CITY GREENVILLE

1 CITY NASHVILLE

In my original method, I can account for these, but if I understand the crosstab code, I would have to use a MAX aggregate on my value field in order to make it work, and that would exclude one of the results.

Is there a way to get around this? Right now I'm pondring a hydrid of our two methods. I use yours for all values that are duplicative, then use mine for the ones that are. I'll have to see if I still get a good speed increase from that.

LutzM SSC Guru Points: 107049 More actions · Answer 3

Even if I'm not Jeff ( 😀 ) but promoting his great articles because those are simply the best ones I found related to this subject, I'd use ROW_NUMBER() to differentiate those rows and use it as a "combined column" in the dynamic cross tab.

Something like:

DECLARE @tbl TABLE



(

 ID INT, NAME VARCHAR(30),VALUE VARCHAR(30)

)

INSERT INTO @tbl

SELECT 1, 'CITY', 'GREENVILLE' UNION ALL

SELECT 1, 'CITY', 'NASHVILLE'

SELECT  *, name + CAST(ROW_NUMBER() OVER(PARTITION BY name ORDER BY VALUE ) AS VARCHAR(3)) AS newCol

FROM @tbl

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

huston.dunlap SSCommitted Points: 1551 More actions · Answer 4

huston.dunlap

SSCommitted

Points: 1551

April 15, 2010 at 8:12 pm

#1152559

That would work, with the only note that it tremendously increases the amount of CASE statements I need.

Because every entry is now unique to every record, the number of case statements go through the roof if I'm not mistaken.

Can SQL actually support that many statements?

LutzM SSC Guru Points: 107049 More actions · Answer 5

huston.dunlap (4/15/2010)

That would work, with the only note that it tremendously increases the amount of CASE statements I need.

Because every entry is now unique to every record, the number of case statements go through the roof if I'm not mistaken.

Can SQL actually support that many statements?

In order to understand what you mean please provide sample data and expected results that actually show what you're struggling with.

Currently, it doesn't make that much sense (to me, at least)...

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

huston.dunlap SSCommitted Points: 1551 More actions · Answer 6

lmu92 (4/16/2010)

In order to understand what you mean please provide sample data and expected results that actually show what you're struggling with.

Currently, it doesn't make that much sense (to me, at least)...

Sure, least I can do. Due to security reasons I can't show you the actual data, so let me try another example.

This is an area of the table where the category field might be duplicated.

ID CATEGORY NAME

1 CITY Greenville

1 CITY Nashville

1 CITY Bigville

2 CITY Turnerville

2 CITY Southville

Using your row_number method, I would get a result like this:

1 CITY1 Greenville

1 CITY2 Nashville

1 CITY3 Bigville

1 CITY4 Turnerville

1 CITY5 Southville

So for my case statement, I would have to make a statment for every city possibility, as opposed to just the word CITY. So the number of statements begin to equal the number of records, which might have scaling issues.

Thank you for working with me on this btw!

Lynn Pettis SSC Guru Points: 442447 More actions · Answer 7

Based on your sampe data, how about this:



--DocID FieldName FieldValue

--1 Number 1000

--1 DocName Murphy.doc

--1 City Greenville

--2 Number 50

--2 DocName Biggs.doc

--2 City Null

create table #testtab (

    DocID int,

    FieldName varchar(50),

    FieldValue varchar(50)

);

insert into #testtab

select 1, 'Number', '1000' union all

select 1, 'DocName', 'Murphy.doc' union all

select 1, 'City', 'Greenville' union all

select 2, 'Number', '50' union all

select 2, 'DocName', 'Biggs.doc' union all

select 2, 'City', null;

select

    DocID,

    max(case when FieldName = 'Number' then FieldValue else null end) as Number,

    max(case when FieldName = 'DocName' then FieldValue else null end) as DocName,

    max(case when FieldName = 'City' then FieldValue else null end) as City

from

    #testtab

group by

    DocID

order by

    DocID

;

LutzM SSC Guru Points: 107049 More actions · Answer 8

Change

PARTITION BY name

to PARTITION BY id,name

to get the cities numbered for each id.

Should help a little... 😉

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 9

huston.dunlap (4/16/2010)

lmu92 (4/16/2010)

In order to understand what you mean please provide sample data and expected results that actually show what you're struggling with.

Currently, it doesn't make that much sense (to me, at least)...

Sure, least I can do. Due to security reasons I can't show you the actual data, so let me try another example.

This is an area of the table where the category field might be duplicated.

ID CATEGORY NAME

1 CITY Greenville

1 CITY Nashville

1 CITY Bigville

2 CITY Turnerville

2 CITY Southville

Using your row_number method, I would get a result like this:

1 CITY1 Greenville

1 CITY2 Nashville

1 CITY3 Bigville

1 CITY4 Turnerville

1 CITY5 Southville

So for my case statement, I would have to make a statment for every city possibility, as opposed to just the word CITY. So the number of statements begin to equal the number of records, which might have scaling issues.

Thank you for working with me on this btw!

I guess my first question would be... what is the purpose of pivoting this data into multiple columns? Is it because someone needs to see it in a spreadsheet or? In theory, this could easily overwhelm even a spreadsheet because you said in your original post that there are 100+ types of entries/columns.

Without see your actual dataset in this table, I believe my recommendation would be to first normalize the data you currently have in your EAV table and then things will get a whole lot easier.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

huston.dunlap SSCommitted Points: 1551 More actions · Answer 10

huston.dunlap

SSCommitted

Points: 1551

April 20, 2010 at 8:00 pm

#1155053

Thinking about it more I think what I will do is turn the fields into a semi colon delimited list, getting rid of the duplicate fields.

Then I can implement the solution Imu originally helped me with.

Thank you all for the help, and wish me luck!

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 11

huston.dunlap (4/20/2010)

Thinking about it more I think what I will do is turn the fields into a semi colon delimited list, getting rid of the duplicate fields.

Then I can implement the solution Imu originally helped me with.

Thank you all for the help, and wish me luck!

If you're going to do that, why not use the TAB character {CHAR(9) in T-SQL and a couple of other languages)? THAT way, if you ever do need to bring it into a spreadsheet, you won't have to mess with identifying delimiters and the like... it'll just happen.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

huston.dunlap SSCommitted Points: 1551 More actions · Answer 12

Jeff Moden (4/20/2010)

huston.dunlap (4/20/2010)

Thinking about it more I think what I will do is turn the fields into a semi colon delimited list, getting rid of the duplicate fields.

Then I can implement the solution Imu originally helped me with.

Thank you all for the help, and wish me luck!

If you're going to do that, why not use the TAB character {CHAR(9) in T-SQL and a couple of other languages)? THAT way, if you ever do need to bring it into a spreadsheet, you won't have to mess with identifying delimiters and the like... it'll just happen.

Oh, the semi colon list was just an example, ultimately I'll let the user determine their delimiter.

Again, thanks everyone!