Metrics Table or Performance Tuning

  • Hi,

    Fairly new to managing large datasets - we have a large table - 16m rows.

    We load around 800k rows each month and store data on a monthly basis.

    We want to display high level, aggregated metrics to users but find that our queries are quite slow and sluggish.

    Would you look at performance tuning - or creating a metrics table for this kind of project?

    Nick

  • nick.latocha (8/14/2015)


    Hi,

    Fairly new to managing large datasets - we have a large table - 16m rows.

    We load around 800k rows each month and store data on a monthly basis.

    We want to display high level, aggregated metrics to users but find that our queries are quite slow and sluggish.

    Would you look at performance tuning - or creating a metrics table for this kind of project?

    Nick

    If the data is only changing once a month as you suggest, then it's no contest: precalculate the aggregated metrics into a permanent table, and recalculate each time the source data is refreshed.

    As an aside, 16 million rows is trivial and with adequate indexing etc you should get results back in a couple of seconds or less (finger in the air).

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Probably both.

    If the data changes once a month and is queried often, then you can add a job after it's loaded to pre-aggregate into tables for later querying, then tune the queries as well. 16 million isn't particularly large, so I'd definitely suggest tuning as well as pre-aggregating, for when the data volumes grow into something large

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • If you (almost) always process data by month (or other date range(s) too) then cluster the table on the related date column. You should far less performance issues then.

    Edit: Agree that 16M rows isn't a priori a lot, but if you are constantly scanning the entire 16M rows, it could add up very quickly.

    SQL DBA,SQL Server MVP(07, 08, 09) "Money can't buy you happiness." Maybe so, but it can make your unhappiness a LOT more comfortable!

  • This could be where we are going wrong. We have clustered the unique ID column. Let me try this out on our development environment.

  • nick.latocha (8/16/2015)


    This could be where we are going wrong. We have clustered the unique ID column. Let me try this out on our development environment.

    You really need to post the DDL for the table and the related constraints and indexes because having a clustered index on a unique ID column is likely not a problem. In fact, it may be essential to keeping the performance up during the inserts. But, "It Depends" on what else is happening. Is there anything else in the data that would uniquely identify each row?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Attached.

    Main table, queue table, indexes and execution plan.

    Takes around 16 seconds to execute.

    Thanks in advance

  • This is so typical of these types of situations. As usual, there is a vastly better clustering key available than identity.

    The (first) clustering key for this table should be exceptionDateTime, period. In this case, that's probably the only key you need, although ( exceptionDateTime, productArea ) could perform better depending on your specific query requirements. If you prefer, you can add identity to the end to make it unique (which will of course be required if you want to classify it as a "primary key" rather than just a clustering index).

    You should then review the nonclustered index to see if it's still needed -- my guess is it's not worth it to maintain that as well.

    SQL DBA,SQL Server MVP(07, 08, 09) "Money can't buy you happiness." Maybe so, but it can make your unhappiness a LOT more comfortable!

  • Thanks - this definitely improve things. The query alone is now much faster - however the join with the exception table is very slow.

    SELECT m.reportMonth

    ,m.reportYear

    ,ex.service

    ,sum(m.volume) AS Vol

    ,SUM(m.cost) AS effort

    FROM [cts].[exception_Main] m WITH (NOLOCK)

    LEFT JOIN cts.map_Exception ex ON m.queueID = ex.queueID

    WHERE m.exceptionDateTime >= GETDATE() - 365

    AND m.productArea = ('FXMM')

    AND m.reportable = 'Y'

    GROUP BY m.reportMonth

    ,m.reportYear

    ,ex.service

    Even with a PK/FK relationship the join slows things down massively. My solution was to aggregate the "main" data first in to a temp table - then join the exception table afterwards. Is this the best way - or should it be possible to do it using the original query?

  • -- create this index

    CREATE INDEX ix_Helper ON [cts].[exception_Main]

    (reportYear, reportMonth, exceptionDateTime, productArea)

    INCLUDE (volume, cost, queueID, reportable)

    -- run this query, capture the actual execution plan and post back

    SELECT m.reportMonth

    ,m.reportYear

    ,ex.service

    ,sum(m.volume) AS Vol

    ,SUM(m.cost) AS effort

    FROM [cts].[exception_Main] m WITH (NOLOCK)

    LEFT loop JOIN cts.map_Exception ex -- ##

    ON m.queueID = ex.queueID

    WHERE m.exceptionDateTime >= GETDATE() - 365

    AND m.productArea = ('FXMM')

    AND m.reportable = 'Y'

    GROUP BY m.reportYear -- ##

    ,m.reportMonth -- ##

    ,ex.service

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • See attached. Looks like it's not using the new index and instead the previous cluster.

  • Interesting. Edit: The only reason that particular join should be slow is the large number of lookups, since the map_exception table only has a few hundred rows. End-Edit.

    At any rate, first create this index. Or not, this should have only a minor effect anyway, but this may be one of those cases where a nonclus index on the same column as the clus index is worthwhile.

    Then test the query below (I went back to the original query, naturally make any adjustments you need to; the "with(nolock)" is entirely at your discretion). No guarantees, but I think it should perform better by grouping before doing the lookup.

    [Btw, you should increase the FILLFACTOR on the main map_exception table as well, and also very likely on the exception_Main table. As with the clustering index keys, there's no true good "default" value for ff, it should be specifically chosen index by index.]

    CREATE UNIQUE NONCLUSTERED INDEX [IX_map_Exception]

    ON cts.map_Exception ( queueID, service )

    WITH ( FILLFACTOR = 99 );

    SELECT

    main.reportMonth

    ,main.reportYear

    ,ex.service

    ,main.zzzCustomer

    ,sum(main.Vol) AS Vol

    ,SUM(main.effort) as effort

    FROM (

    SELECT

    em.reportMonth

    ,em.reportYear

    ,em.queueID

    ,em.zzzCustomer

    ,sum(em.volume) AS Vol

    ,SUM(em.cost) as effort

    FROM [cts].[exception_Main] em with(nolock)

    WHERE em.exceptionDateTime >= GETDATE()-365

    and em.productArea =('FXMM')

    and em.reportable = 'Y'

    GROUP BY

    em.reportMonth

    ,em.reportYear

    ,em.queueID

    ,em.zzzCustomer

    ) AS main

    LEFT JOIN cts.map_Exception ex with(nolock) ON main.queueID = ex.queueID

    GROUP BY

    main.reportMonth

    ,main.reportYear

    ,ex.service

    ,main.zzzCustomer

    SQL DBA,SQL Server MVP(07, 08, 09) "Money can't buy you happiness." Maybe so, but it can make your unhappiness a LOT more comfortable!

  • Good catch Scott. I'd write it like this which is more or less the same:

    SELECT m.reportMonth,

    m.reportYear,

    ex.[service],

    SUM(m.Vol) AS Vol,

    SUM(m.Effort) AS effort

    FROM (

    SELECT reportMonth,

    reportYear,

    queueID,

    SUM(volume) AS Vol,

    SUM(cost) AS Effort

    FROM [cts].[exception_Main] -- 14.5M rows

    WHERE exceptionDateTime >= GETDATE() - 365

    AND productArea = ('FXMM')

    AND reportable = 'Y'

    GROUP BY

    reportYear,

    reportMonth,

    queueID

    ) m

    LEFT JOIN cts.map_Exception ex -- 420 rows

    ON m.queueID = ex.queueID

    GROUP BY

    m.reportYear,

    m.reportMonth,

    ex.[service]

    The inner select can be accelerated with this index:

    CREATE INDEX ix_Helper ON [cts].[exception_Main]

    (productArea, reportable, reportYear, reportMonth, queueID)

    INCLUDE (volume, cost, exceptionDateTime)

    The rationale for the index goes like this:

    Once the start point has been found in the index using a seek, scan until the end, filtering exceptionDateTime with a residual predicate. Because productArea and reportable are constants, the data comes off in the correct order to support a streaming aggregate.

    A simple test harness shows that the theory works and it's mighty quick too:

    -- set up sample data

    SELECT

    ID,

    reportYear = YEAR(ReportDate),

    reportMonth = MONTH(ReportDate),

    reportable = CASE WHEN ABS(CHECKSUM(NEWID()))%2 = 1 THEN 'Y' ELSE 'N' END,

    queueID = ABS(CHECKSUM(NEWID()))%420,

    exceptionDateTime = DATEADD(DAY,ABS(CHECKSUM(NEWID()))%3,ReportDate),

    productArea = CASE WHEN ABS(CHECKSUM(NEWID()))%2 = 1 THEN 'FXMM' ELSE 'N' END,

    volume = ABS(CHECKSUM(NEWID()))%20,

    cost = ABS(CHECKSUM(NEWID()))%30

    INTO #Sample

    FROM (

    SELECT ID, ReportDate = DATEADD(MINUTE, 0-ID/20,GETDATE())

    FROM (

    SELECT TOP(14500000)

    ID = ROW_NUMBER() OVER(ORDER BY (SELECT NULL))

    FROM SYS.COLUMNS a, SYS.COLUMNS b, SYS.COLUMNS c, SYS.COLUMNS d, SYS.COLUMNS e

    ) d

    ) e

    CREATE UNIQUE CLUSTERED INDEX ucx_Sample ON #Sample (ID)

    CREATE INDEX ix_Helper ON #Sample

    (productArea, reportable, reportYear, reportMonth, queueID)

    INCLUDE (volume, cost, exceptionDateTime)

    -- run the (inner) query

    SELECT

    reportMonth,

    reportYear,

    queueID,

    SUM(volume) AS Vol,

    SUM(cost) AS Effort

    FROM #Sample

    WHERE exceptionDateTime >= GETDATE() - 365

    AND productArea = ('FXMM')

    AND reportable = 'Y'

    GROUP BY

    reportYear,

    reportMonth,

    queueID

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Wow!

    That's a dramatic increase! Thanks so much

  • ChrisM@Work (8/21/2015)

    The inner select can be accelerated with this index:

    CREATE INDEX ix_Helper ON [cts].[exception_Main]

    (productArea, reportable, reportYear, reportMonth, queueID)

    INCLUDE (volume, cost, exceptionDateTime)

    But with the best clustered index now on the table, you don't need this extra, fairly large index. You should get as good a response time -- possibly even better, depends on the details -- from a clustered index on exceptionDateTime. That's the huge advantage --edit: of the best clus index -- that you don't have to create the gazillions on covering indexes for so many queries.

    SQL DBA,SQL Server MVP(07, 08, 09) "Money can't buy you happiness." Maybe so, but it can make your unhappiness a LOT more comfortable!

Viewing 15 posts - 1 through 15 (of 24 total)

You must be logged in to reply to this topic. Login to reply