Metrics Table or Performance Tuning

Question

Post reply

Metrics Table or Performance Tuning

nick.latocha

SSCrazy

Points: 2111
More actions
August 14, 2015 at 7:50 am

#306812

Hi,

Fairly new to managing large datasets - we have a large table - 16m rows.

We load around 800k rows each month and store data on a monthly basis.

We want to display high level, aggregated metrics to users but find that our queries are quite slow and sluggish.

Would you look at performance tuning - or creating a metrics table for this kind of project?

Nick

Viewing 15 posts - 1 through 15 (of 24 total)

You must be logged in to reply to this topic. Login to reply

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 1

nick.latocha (8/14/2015)

Hi,

Fairly new to managing large datasets - we have a large table - 16m rows.

We load around 800k rows each month and store data on a monthly basis.

We want to display high level, aggregated metrics to users but find that our queries are quite slow and sluggish.

Would you look at performance tuning - or creating a metrics table for this kind of project?

Nick

If the data is only changing once a month as you suggest, then it's no contest: precalculate the aggregated metrics into a permanent table, and recalculate each time the source data is refreshed.

As an aside, 16 million rows is trivial and with adequate indexing etc you should get results back in a couple of seconds or less (finger in the air).

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Gail Shaw SSC Guru Points: 1004494 More actions · Answer 2

Probably both.

If the data changes once a month and is queried often, then you can add a job after it's loaded to pre-aggregate into tables for later querying, then tune the queries as well. 16 million isn't particularly large, so I'd definitely suggest tuning as well as pre-aggregating, for when the data volumes grow into something large

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

ScottPletcher SSC Guru Points: 100262 More actions · Answer 3

If you (almost) always process data by month (or other date range(s) too) then cluster the table on the related date column. You should far less performance issues then.

Edit: Agree that 16M rows isn't a priori a lot, but if you are constantly scanning the entire 16M rows, it could add up very quickly.

SQL DBA,SQL Server MVP(07, 08, 09) "Money can't buy you happiness." Maybe so, but it can make your unhappiness a LOT more comfortable!

nick.latocha SSCrazy Points: 2111 More actions · Answer 4

nick.latocha

SSCrazy

Points: 2111

August 16, 2015 at 4:27 am

#1820314

This could be where we are going wrong. We have clustered the unique ID column. Let me try this out on our development environment.

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 5

nick.latocha (8/16/2015)

This could be where we are going wrong. We have clustered the unique ID column. Let me try this out on our development environment.

You really need to post the DDL for the table and the related constraints and indexes because having a clustered index on a unique ID column is likely not a problem. In fact, it may be essential to keeping the performance up during the inserts. But, "It Depends" on what else is happening. Is there anything else in the data that would uniquely identify each row?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

nick.latocha SSCrazy Points: 2111 More actions · Answer 6

nick.latocha

SSCrazy

Points: 2111

August 17, 2015 at 1:14 am

#1820361

Attached.

Main table, queue table, indexes and execution plan.

Takes around 16 seconds to execute.

Thanks in advance

ScottPletcher SSC Guru Points: 100262 More actions · Answer 7

This is so typical of these types of situations. As usual, there is a vastly better clustering key available than identity.

The (first) clustering key for this table should be exceptionDateTime, period. In this case, that's probably the only key you need, although ( exceptionDateTime, productArea ) could perform better depending on your specific query requirements. If you prefer, you can add identity to the end to make it unique (which will of course be required if you want to classify it as a "primary key" rather than just a clustering index).

You should then review the nonclustered index to see if it's still needed -- my guess is it's not worth it to maintain that as well.

SQL DBA,SQL Server MVP(07, 08, 09) "Money can't buy you happiness." Maybe so, but it can make your unhappiness a LOT more comfortable!

nick.latocha SSCrazy Points: 2111 More actions · Answer 8

Thanks - this definitely improve things. The query alone is now much faster - however the join with the exception table is very slow.



SELECT m.reportMonth

	,m.reportYear

	,ex.service

	,sum(m.volume) AS Vol

	,SUM(m.cost) AS effort

FROM [cts].[exception_Main] m WITH (NOLOCK)

LEFT JOIN cts.map_Exception ex ON m.queueID = ex.queueID

WHERE m.exceptionDateTime >= GETDATE() - 365

	AND m.productArea = ('FXMM')

	AND m.reportable = 'Y'

GROUP BY m.reportMonth

	,m.reportYear

	,ex.service

Even with a PK/FK relationship the join slows things down massively. My solution was to aggregate the "main" data first in to a temp table - then join the exception table afterwards. Is this the best way - or should it be possible to do it using the original query?

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 9

-- create this index



CREATE INDEX ix_Helper ON [cts].[exception_Main]

	(reportYear, reportMonth, exceptionDateTime, productArea)

	INCLUDE (volume, cost, queueID, reportable)

-- run this query, capture the actual execution plan and post back

SELECT m.reportMonth

	,m.reportYear

	,ex.service

	,sum(m.volume) AS Vol

	,SUM(m.cost) AS effort

FROM [cts].[exception_Main] m WITH (NOLOCK)

LEFT loop JOIN cts.map_Exception ex -- ##

	ON m.queueID = ex.queueID

WHERE m.exceptionDateTime >= GETDATE() - 365

	AND m.productArea = ('FXMM')

	AND m.reportable = 'Y'

GROUP BY m.reportYear -- ##

	,m.reportMonth -- ##

,ex.service

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

nick.latocha SSCrazy Points: 2111 More actions · Answer 10

nick.latocha

SSCrazy

Points: 2111

August 20, 2015 at 10:13 am

#1821466

See attached. Looks like it's not using the new index and instead the previous cluster.

ScottPletcher SSC Guru Points: 100262 More actions · Answer 11

Interesting. Edit: The only reason that particular join should be slow is the large number of lookups, since the map_exception table only has a few hundred rows. End-Edit.

At any rate, first create this index. Or not, this should have only a minor effect anyway, but this may be one of those cases where a nonclus index on the same column as the clus index is worthwhile.

Then test the query below (I went back to the original query, naturally make any adjustments you need to; the "with(nolock)" is entirely at your discretion). No guarantees, but I think it should perform better by grouping before doing the lookup.

[Btw, you should increase the FILLFACTOR on the main map_exception table as well, and also very likely on the exception_Main table. As with the clustering index keys, there's no true good "default" value for ff, it should be specifically chosen index by index.]



CREATE UNIQUE NONCLUSTERED INDEX [IX_map_Exception]

    ON cts.map_Exception ( queueID, service )

    WITH ( FILLFACTOR = 99 );

SELECT

         main.reportMonth

	,main.reportYear

	,ex.service

	,main.zzzCustomer

	,sum(main.Vol) AS Vol

	,SUM(main.effort) as effort	

FROM (

    SELECT

             em.reportMonth

	    ,em.reportYear

	    ,em.queueID

	    ,em.zzzCustomer

	    ,sum(em.volume) AS Vol

	    ,SUM(em.cost) as effort	

    FROM [cts].[exception_Main] em with(nolock)

    WHERE em.exceptionDateTime >= GETDATE()-365

	     and em.productArea =('FXMM')

	     and em.reportable = 'Y'

    GROUP BY

             em.reportMonth

	    ,em.reportYear

	    ,em.queueID

	    ,em.zzzCustomer

) AS main

LEFT JOIN cts.map_Exception ex with(nolock) ON main.queueID = ex.queueID

GROUP BY

         main.reportMonth

	,main.reportYear

	,ex.service

	,main.zzzCustomer

SQL DBA,SQL Server MVP(07, 08, 09) "Money can't buy you happiness." Maybe so, but it can make your unhappiness a LOT more comfortable!

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 12

Good catch Scott. I'd write it like this which is more or less the same:

SELECT m.reportMonth,



	m.reportYear,

	ex.[service],

	SUM(m.Vol) AS Vol,

	SUM(m.Effort) AS effort

FROM (

	SELECT reportMonth,

		reportYear,

		queueID,

		SUM(volume) AS Vol,

		SUM(cost) AS Effort

	FROM [cts].[exception_Main] -- 14.5M rows

	WHERE exceptionDateTime >= GETDATE() - 365

		AND productArea = ('FXMM')

		AND reportable = 'Y'

	GROUP BY

		reportYear,

		reportMonth,

		queueID

) m

LEFT JOIN cts.map_Exception ex -- 420 rows

	ON m.queueID = ex.queueID

GROUP BY

	m.reportYear,

	m.reportMonth,

ex.[service]

The inner select can be accelerated with this index:

CREATE INDEX ix_Helper ON [cts].[exception_Main]

(productArea, reportable, reportYear, reportMonth, queueID)

INCLUDE (volume, cost, exceptionDateTime)

The rationale for the index goes like this:

Once the start point has been found in the index using a seek, scan until the end, filtering exceptionDateTime with a residual predicate. Because productArea and reportable are constants, the data comes off in the correct order to support a streaming aggregate.

A simple test harness shows that the theory works and it's mighty quick too:

-- set up sample data



SELECT

	ID,

	reportYear = YEAR(ReportDate),

	reportMonth = MONTH(ReportDate),

	reportable = CASE WHEN ABS(CHECKSUM(NEWID()))%2 = 1 THEN 'Y' ELSE 'N' END,

	queueID = ABS(CHECKSUM(NEWID()))%420,

	exceptionDateTime = DATEADD(DAY,ABS(CHECKSUM(NEWID()))%3,ReportDate),

	productArea = CASE WHEN ABS(CHECKSUM(NEWID()))%2 = 1 THEN 'FXMM' ELSE 'N' END,

	volume = ABS(CHECKSUM(NEWID()))%20,

	cost = ABS(CHECKSUM(NEWID()))%30

INTO #Sample

FROM (

	SELECT ID, ReportDate = DATEADD(MINUTE, 0-ID/20,GETDATE())

	FROM (

		SELECT TOP(14500000)

			ID = ROW_NUMBER() OVER(ORDER BY (SELECT NULL))

		FROM SYS.COLUMNS a, SYS.COLUMNS b, SYS.COLUMNS c, SYS.COLUMNS d, SYS.COLUMNS e

	) d

) e

CREATE UNIQUE CLUSTERED INDEX ucx_Sample ON #Sample (ID)

CREATE INDEX ix_Helper ON #Sample

	(productArea, reportable, reportYear, reportMonth, queueID)

	INCLUDE (volume, cost, exceptionDateTime)

-- run the (inner) query

SELECT

	reportMonth,

	reportYear,

	queueID,

	SUM(volume) AS Vol,

	SUM(cost) AS Effort

FROM #Sample

WHERE exceptionDateTime >= GETDATE() - 365

	AND productArea = ('FXMM')

	AND reportable = 'Y'

GROUP BY

	reportYear,

	reportMonth,

queueID

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

nick.latocha SSCrazy Points: 2111 More actions · Answer 13

nick.latocha

SSCrazy

Points: 2111

August 21, 2015 at 4:43 am

#1821629

Wow!

That's a dramatic increase! Thanks so much

ScottPletcher SSC Guru Points: 100262 More actions · Answer 14

ChrisM@Work (8/21/2015)

The inner select can be accelerated with this index:

CREATE INDEX ix_Helper ON [cts].[exception_Main]

(productArea, reportable, reportYear, reportMonth, queueID)

INCLUDE (volume, cost, exceptionDateTime)

But with the best clustered index now on the table, you don't need this extra, fairly large index. You should get as good a response time -- possibly even better, depends on the details -- from a clustered index on exceptionDateTime. That's the huge advantage --edit: of the best clus index -- that you don't have to create the gazillions on covering indexes for so many queries.

SQL DBA,SQL Server MVP(07, 08, 09) "Money can't buy you happiness." Maybe so, but it can make your unhappiness a LOT more comfortable!