Normalize data using Pivot WITHOUT using Aggregate function

  • mister.magoo (12/30/2009)


    For the sake of it, a solution which copes with new "ColumnName" values by using dynamic sql (go ahead - pork chop me or whatever you do :-D)

    Heh... nah... no pork chops for you. I don't believe that dynamic SQL is all that bad a thing. Take a look at the following... I even published that it's a good thing when done correctly.

    http://qa.sqlservercentral.com/articles/Crosstab/65048/

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • lobbymuncher (12/30/2009)


    Another overcomplicated effort from the peanut gallery.

    I like it. Nicely done.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff Moden (12/30/2009)


    lobbymuncher (12/30/2009)


    Another overcomplicated effort from the peanut gallery.

    I like it. Nicely done.

    Ack... I may have to take that back. The code has a triangular join in it that causes 60 internal rows to be read from one of the 4 appearances of the table in the execution plan. It's not quite as bad as a full blown triangular join in that it reads each row once for each group where as a real triangular join at the row level would be devastating here.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Justin,

    Please post the table creation statement and be sure to include any indexes you may have on the table.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff Moden (12/30/2009)


    Jeff Moden (12/30/2009)


    lobbymuncher (12/30/2009)


    Another overcomplicated effort from the peanut gallery.

    I like it. Nicely done.

    Ack... I may have to take that back. The code has a triangular join in it that causes 60 internal rows to be read from one of the 4 appearances of the table in the execution plan. It's not quite as bad as a full blown triangular join in that it reads each row once for each group where as a real triangular join at the row level would be devastating here.

    Nah... it's good. The WHERE clause keeps it from being a full "Triangular Join" but only if you have an index on ID.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Man, in the presence of a clustered index on the ID, that's some mighty fast code there, Chris.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff Moden (12/30/2009)


    Man, in the presence of a clustered index on the ID, that's some mighty fast code there, Chris.

    :blush: thanks Jeff. I read your triangular join article!


    [font="Arial"]Low-hanging fruit picker and defender of the moggies[/font]

    For better assistance in answering your questions, please read this[/url].


    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

  • mister.magoo (12/30/2009)


    For the sake of it, a solution which copes with new "ColumnName" values by using dynamic sql (go ahead - pork chop me or whatever you do :-D)

    If you don't like dynamic SQL , then you could just use this to generate the query as and when the data requires a change....just change the "exec sp_executesql @sql" line to "print @sql"....

    --= Dump the data into a temp table as this uses dynamic sql and needs a "proper" table to read.

    IF OBJECT_ID('tempdb..#school') IS NULL

    SELECT ID, ColumnName, [Text]

    INTO #school

    FROM @school

    DECLARE @sql NVARCHAR(4000)

    SET @sql = ''

    ;WITH base(ID,ColName,Text,ColNum)

    AS

    (

    --= Get the data into some kind of order with the ID from the '[DW].[DimClass]' row for each set of data as the key

    --= And a column number to ascertain the correct sequence for the columns later

    SELECT s1.ID, Details.ColName, Details.Text , ROW_NUMBER() OVER(PARTITION BY s1.ID ORDER BY Details.ID) AS ColNum

    FROM #school s1

    OUTER APPLY (

    --= Get the associated rows for the current group/ID

    SELECT ID,ColumnName,[Text]

    FROM #school s2

    WHERE s2.ID>s1.ID

    AND s2.ID < ISNULL(

    (

    --= Get the ID of the next group of data

    SELECT TOP 1 ID

    FROM #school s3

    WHERE s3.ID>s1.ID AND s3.ColumnName = 'DestinationTableName'

    ORDER BY s3.ID

    ),

    (

    --= Get the last row of data when there are no more groups

    SELECT MAX(ID) FROM #school

    )

    )

    ) AS Details(ID,ColName,TEXT)

    WHERE s1.ColumnName = 'DestinationTableName'

    ), cols(Name,POSITION)

    AS

    (

    --= Get the distinct ColumnName values and positions (assuming the data will always place them in order in the first place)

    SELECT ColName ,MAX(ColNum)

    FROM base

    GROUP BY ColName

    ), colList(list)

    AS

    (

    --= Use FOR XML PATH('') trick to get a list of columns

    SELECT STUFF((

    SELECT ', '+QUOTENAME(Name,'[')

    FROM cols

    ORDER BY POSITION

    FOR XML PATH('')),1,1,'')

    )

    SELECT @sql=list

    FROM colList

    --= Now build a dynamic SQL to PIVOT the data

    SET @sql = N'

    ;WITH base(BASEID,ColName,TEXT)

    AS

    (

    SELECT s1.ID, Details.ColName, Details.Text

    FROM #school s1

    OUTER APPLY (

    SELECT ID,ColumnName,[Text]

    FROM #school s2

    WHERE s2.ID>s1.ID

    AND s2.ID < ISNULL(

    (

    SELECT TOP 1 ID

    FROM #school s3

    WHERE s3.ID>s1.ID AND s3.ColumnName = ''DestinationTableName''

    ORDER BY s3.ID

    ),

    (

    SELECT MAX(ID) FROM #school

    )

    )

    ) AS Details(ID,ColName,TEXT)

    WHERE s1.ColumnName = ''DestinationTableName''

    )

    SELECT BASEID,' + @sql + '

    FROM base

    PIVOT (MAX([Text]) FOR ColName IN ('+@sql +')) AS PVT

    ORDER BY BASEID'

    EXEC sp_executesql @sql

    IF NOT (OBJECT_ID('tempdb..#school') IS NULL)

    DROP TABLE #school

    I make no claims for performance as I have not tried to optimise it, but it demonstrates a technique that could be used...

    MM

    This query did exactly what I needed! However, the execution time is HORRIFIC -- 1 day, 5 hours, and 5 minutes to execute. For the record, the table on which I'm operating this query has appx 900,000 rows which are being pivoted out into 46 possible columns. The indexes that I have on the @school table are below:

    create NONClustered index idx1 on @school (columnname)

    create Unique clustered index idx2 on @school (ID)

    Are there any improvements that you can think of which I could incorporate into this query? I'm not that familiar with indexes, but have at least a basic understanding of them.

  • Hi Justin

    This solution has a triangular join, notorious for scaling poorly - but over a day! Blimey. The solution I posted earlier doesn't, so may be worth a try. If I were in your shoes, I'd hard-code those 46 potential columns because it will take less time than dicking around with dynamic sql.

    Cheers

    ChrisM

    β€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Justin James (1/4/2010)


    mister.magoo (12/30/2009)


    For the sake of it, a solution which copes with new "ColumnName" values by using dynamic sql (go ahead - pork chop me or whatever you do :-D)

    If you don't like dynamic SQL , then you could just use this to generate the query as and when the data requires a change....just change the "exec sp_executesql @sql" line to "print @sql"....

    ...

    I make no claims for performance as I have not tried to optimise it, but it demonstrates a technique that could be used...

    MM

    This query did exactly what I needed! However, the execution time is HORRIFIC -- 1 day, 5 hours, and 5 minutes to execute. For the record, the table on which I'm operating this query has appx 900,000 rows which are being pivoted out into 46 possible columns. The indexes that I have on the @school table are below:

    create NONClustered index idx1 on @school (columnname)

    create Unique clustered index idx2 on @school (ID)

    Are there any improvements that you can think of which I could incorporate into this query? I'm not that familiar with indexes, but have at least a basic understanding of them.

    Is there any specific reason why you didn't use lobbymunchers or my solution?

    Both are supposed to perform pretty good. Even better with clustered index on the ID, as Jeff mentioned in an earlier post.

    It makes me wondering why you go for a solution that had a performance disclaimer attached in the first place...

    As a rough duration estimate (depending on the system and, of course, index applied as mentioned before) we should talk about seconds, not hours, or even DAYS!!!



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • I agree that is HORRENDOUS!

    If you can get away with a fixed query then do so.(One of the other examples)

    If you have to cater for dynamic columns then let us know and someone will come up with a quick scalable method.

    (I am busy right now but have some ideas)

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • After working with MM's query, I was able to get results to come back in 36 seconds. Below is the final SQL that I was able to get working:

    --= Dump the data into a temp table as this uses dynamic sql and needs a "proper" table to read.

    IF OBJECT_ID('tempdb..#school') IS NULL

    SELECT UniqueID as [ID], ColumnName, [Text]

    INTO #school

    FROM @school

    order by uniqueid

    create NONClustered index idx1 on #school (columnname)

    create Unique clustered index idx2 on #school (ID)

    update statistics #school

    DECLARE @sql NVARCHAR(4000)

    SET @sql = ''

    ;WITH base(ID,ColName,Text,ColNum)

    AS

    (

    --= Get the data into some kind of order with the ID from the 'DestinationTableName' row for each set of data as the key

    --= And a column number to ascertain the correct sequence for the columns later

    SELECT s1.ID, Details.ColName, Details.Text , ROW_NUMBER() OVER(PARTITION BY s1.ID ORDER BY Details.ID) AS ColNum

    FROM #school s1

    OUTER APPLY (

    --= Get the associated rows for the current group/ID

    SELECT ID,ColumnName,[Text]

    FROM #school s2

    WHERE s2.ID>s1.ID

    AND s2.ID < (

    (

    --= Get the ID of the next group of data

    SELECT TOP 1 isnull(ID,(SELECT max(ID) from #school))

    FROM #school s3

    WHERE s3.ID>s1.ID AND s3.ColumnName = 'DestinationTableName'

    ORDER BY s3.ID

    )

    )

    ) AS Details(ID,ColName,TEXT)

    WHERE s1.ColumnName = 'DestinationTableName'

    ), cols(Name,POSITION)

    AS

    (

    --= Get the distinct ColumnName values and positions (assuming the data will always place them in order in the first place)

    SELECT ColName ,MAX(ColNum)

    FROM base

    GROUP BY ColName

    ), colList(list)

    AS

    (

    --= Use FOR XML PATH('') trick to get a list of columns

    SELECT STUFF((

    SELECT ', '+QUOTENAME(Name,'[')

    FROM cols

    ORDER BY POSITION

    FOR XML PATH('')),1,1,'')

    )

    SELECT @sql=list

    FROM colList

    --= Now build a dynamic SQL to PIVOT the data

    SET @sql = N'

    ;WITH base(BASEID,ColName,TEXT)

    AS

    (

    SELECT s1.ID, Details.ColName, Details.Text

    FROM #school s1

    OUTER APPLY (

    SELECT ID,ColumnName,[Text]

    FROM #school s2

    WHERE s2.ID>s1.ID

    AND s2.ID < (

    (

    SELECT TOP 1 isnull(ID,(select max(ID) from #school))

    FROM #school s3

    WHERE s3.ID>s1.ID AND s3.ColumnName = ''DestinationTableName''

    ORDER BY s3.ID

    )

    )

    ) AS Details(ID,ColName,TEXT)

    WHERE s1.ColumnName = ''DestinationTableName''

    )

    SELECT BASEID,' + @sql + '

    FROM base

    PIVOT (MAX([Text]) FOR ColName IN ('+@sql +')) AS PVT

    ORDER BY BASEID'

    --print @sql

    EXEC sp_executesql @sql

    IF NOT (OBJECT_ID('tempdb..#school') IS NULL)

    DROP TABLE #school

    The only changes that I made were:

    1. Added indexes to the #school table to increase performance

    create nonclustered index idx1 on #school (columnname)

    create Unique clustered index idx2 on #school (ID)

    2. Modified MM's original query, removing the ISNULL function, and incorporating it into the subquery.

    MANY MANY thanks for all of the help that you all have provided!! These forums are an indispensable source of information! πŸ˜€

  • The reason why it took so long is because, in the presence of a triangular join, it must spawn over 405,000,450,000 rows. To put it more succinctly, it spawned .4 TERA rows!!!!!!! I'm surprised it didn't take a week. πŸ˜‰

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff Moden (1/4/2010)


    The reason why it took so long is because, in the presence of a triangular join, it must spawn over 405,000,450,000 rows. To put it more succinctly, it spawned .4 TERA rows!!!!!!! I'm surprised it didn't take a week. πŸ˜‰

    Some people are so reluctant to learn. Here Jeff, take this large bag of frozen pork chops.

    β€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Justin,

    I think you may still have some work to do as the change you made will be excluding the final grouping - I think...

    moving the ISNULL the way you have will not solve the problem of finding the records belonging to the final group.

    (

    (

    --= Get the ID of the next group of data

    SELECT TOP 1 isnull(ID,(SELECT max(ID) from #school))

    FROM #school s3

    WHERE s3.ID>s1.ID AND s3.ColumnName = 'DestinationTableName'

    ORDER BY s3.ID

    )

    )

    DOES NOT return the same information as

    ISNULL(

    (

    --= Get the ID of the next group of data

    SELECT TOP 1 ID

    FROM #school s3

    WHERE s3.ID>s1.ID AND s3.ColumnName = 'DestinationTableName'

    ORDER BY s3.ID

    ),

    (

    --= Get the last row of data when there are no more groups

    SELECT MAX(ID) FROM #school

    )

    )

    But, if you just declare a variable and select the MAX(ID) into that outside the main query, you should get the same saving...

    -- outside the CTE:

    DECLARE @MaxID BIGINT

    SELECT @MaxID=MAX(ID) FROM #school

    -- inside - where you made your change

    ISNULL(

    (

    --= Get the ID of the next group of data

    SELECT TOP 1 ID

    FROM #school s3

    WHERE s3.ID>s1.ID AND s3.ColumnName = 'DestinationTableName'

    ORDER BY s3.ID

    ),

    (

    --= Get the last row of data when there are no more groups

    @MaxID

    )

    )

    There should be further savings possible later....

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

    Viewing 15 posts - 31 through 45 (of 50 total)

    You must be logged in to reply to this topic. Login to reply