Querying a large table with an odd criteria

Question

Post reply

Querying a large table with an odd criteria

giles.clapham

SSCrazy

Points: 2194
More actions
March 25, 2011 at 8:55 am

#235763

Hi,

I'm looking for a better solutions to the one I've developed, I've been working on it so long that I'm becoming blinkered and need some inspiration.

Basically I have a table, quite narrow but lots of rows. It's made up of ContactIDs and the kind of activity we do with them, I need to be able to return those that have only one specific kind of activity over the past 12 months, in this scenario it's "Email Sent".

There are a little under 62,000,000 rows in the table, approx 50,000,000 are within the past 12 months.

So do this I'm taking a section on the table that I'm interested in into a table variable then querying that rather than the actual table. It's taking so long I've not allowed it to complete.

I've run it on a smaller database where there are only 3.5m rows and it still took 2 hours to complete.

The bit that takes the time is the final bit before returing the results.

The details of what I'm dealing with and how I'm approching it are below.

Thanks

Giles

--Table Schema

Create Table Audits

(

AuditID uniqueidentifier not null primary key,

ContactID uniqueidentifier not null,

CampaignID uniqueidentifier not null

AuditActivityName nvarchar(50) not null,

AuditDate DateTime not null,

AuditLocation uniqueidentifier not null,

AuditClassification nvarchar(50)null,

AdditionalInformation nvarchar(255) null,

IPAddress nvarchar(50)null,

AuditType nvarchar(50) null,

AccountID uniqueidentifier null

LeadID uniqueidentifier null

)

--Created an index

create index ix_AuditDate on g4c_communicationauditextensionbase(AuditDate,ContactID,AuditActivityName)

--Query to find anyone who only have Email Sent as an AuditActivityName in the past 12 months

--Declare some table variables to help manage the data

Declare @AllActivity table(ID int identity(1,1) primary key,CID uniqueidentifier,ActivityName varchar(50))

Declare @ActiveContacts table(CID uniqueidentifier primary key)

Declare @InactiveContacts table (ID int identity (1,1),CID uniqueidentifier)

--get all audits in the past year caputuring the important columns

insert @AllActivity(CID,ActivityName)

Select ContactID,AuditActivityName

from Audits

where g4c_auditdate>dateadd(yy,-1,getdate())

Insert @ActiveContacts

Select Distinct CID

from @AllActivity

where ActivityName<>'Email Sent'

--now find those contacts who are in @AllActivity but not in @ActiveContacts

Insert @InactiveContacts (CID)

Select distinct e.CID

from @AllActivity AA

Left Join @ActiveContacts A on a.CID=AA.CID

Where a.CID is null

--Return the results...

Select * from @InactiveContacts

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply

Orlando Colamatteo SSC Guru Points: 182276 More actions · Answer 1

OK...I'll take a shot at this...

I noticed something in your Audits table design that should be mentioned...UNIQUEIDENTIFIER might be the worst choice on the planet for the first column in a clustering key IF you are generating new ones using the NEWID() function. In the absence of a defined clustered index SQL Server will use the PRIMARY KEY as the clustering index. Even if you are using a sequential GUID it has the potential to still be a really bad choice. Here's a good start as to why but there are lots of people that have talked about this topic: http://www.sqlskills.com/BLOGS/PAUL/post/Can-GUID-cluster-keys-cause-non-clustered-index-fragmentation.aspx

Regarding your choice of workflow, why isn't something like this sufficient to pull Inactive people?

SELECT ContactID AS CID



FROM    Audits

WHERE   AuditDate > DATEADD(yy, -1, GETDATE())

        AND AuditActivityName <> 'Email Sent'

I would scrap all the table variable stuff (btw, you would have been better off using temp tables instead for this volume of data, read up on "temp tables versus table varaibles"). The query above is still going to do an Index Scan however I think that's the best you're going to do (your insert into @AllActivity also does an index scan) unless you add a new non-clustered index where AuditDate and AuditActivityName are the first two columns and include ContactID so to cover your query.

Consider adding this index to cover the above query:

CREATE INDEX ix_AuditDate_ActivityName__Inc_CID ON dbo.Audits(AuditDate,AuditActivityName) INCLUDE (ContactID)

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

Steve Jones - SSC Editor SSC Guru Points: 728159 More actions · Answer 2

Steve Jones - SSC Editor

SSC Guru

Points: 728159

March 25, 2011 at 10:50 am

#1303745

Also, what is in ActivityName besides "email sent"

If you can avoid the <>, you can make this run more efficiently.

Follow me on Twitter: http://www.twitter.com/way0utwest
Forum Etiquette: How to post data/code on a forum to get the best help
My Blog: www.voiceofthedba.com

giles.clapham SSCrazy Points: 2194 More actions · Answer 3

Thanks for your advice, my hand is forced on the UniqueIdentifier as it MSCRM I'm affraid.

The simpler suggestion you mention will only pull in those that have other activity other than Email Sent, the opposite of what I need.

I was toying with using Temp tables rather than Table Variables but I was under the impression that Table Variable automatically use the TempDB when the data got too big. I'll give that a go though and try the included index oppossed to the composite index I was trying.

In answer to Steves question there are 12 different Action Types.

I'll post my findings as I get them.

Cheers

Giles

Orlando Colamatteo SSC Guru Points: 182276 More actions · Answer 4

giles.clapham (3/25/2011)

Thanks for your advice, my hand is forced on the UniqueIdentifier as it MSCRM I'm affraid.

Yikes :crazy: MS = Microsoft? Which version of CRM?

giles.clapham (3/25/2011)

The simpler suggestion you mention will only pull in those that have other activity other than Email Sent, the opposite of what I need.

OK, if you add the index we talked about and use this query with = instead of <> then you should have a well-performing query:

SELECT ContactID AS CID



FROM    Audits

WHERE   AuditDate > DATEADD(yy, -1, GETDATE())

AND AuditActivityName = 'Email Sent'

Data in a table variable may get written to tempdb at some point (I don't know of clear docs on what factors make this occur) however the general consensus is to keep your own threshold, opinions vary but 1000 rows is a decent choice, where if you know you will exceed that number of rows you should opt for a temp table. You'll find plenty of info online debating "tables versus variables" and what the magic number should be.

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

giles.clapham SSCrazy Points: 2194 More actions · Answer 5

giles.clapham

SSCrazy

Points: 2194

March 30, 2011 at 4:58 am

#1305334

Hello,

Sorry it's taken a while to get back with my results.

The revised index and the temp tables got the query down under half an hour, which in the environment is acceptable.

So thanks very much and I won't dismiss Temp tables so quickly next time around.

Giles

giles.clapham SSCrazy Points: 2194 More actions · Answer 6

Just a quick note about the where clause you suggested.

AuditActivityName = 'Email Sent'

wouldn't work as I only want people whose ONLY activity was Email Sent. The Where clause is row specific so they could have had 'Email Sent' in one row and 'Booked' on another. Hence finding everyone who had an entry other than 'Email Sent' and then removing those from the final result.

Orlando Colamatteo SSC Guru Points: 182276 More actions · Answer 7

I am happy you were able to find a workable solution.

In the spirit of completely avoiding temp tables, hopefully improving performance, would something like this work?

SELECT a.ContactID AS CID



FROM    Audits a

WHERE   a.AuditDate > DATEADD(yy, -1, GETDATE())

        AND a.AuditActivityName = 'Email Sent'

        AND NOT EXISTS ( SELECT *

                         FROM   Audits

                         WHERE  ContactID = a.ContactID

                                AND AuditDate > DATEADD(yy, -1, GETDATE())

AND AuditActivityName <> 'Email Sent' ) ;

Caveat: This will only perform well against a large dataset if there is an index on Audits where ContactID is the first column in the index. What other indexes are on the Audits table?

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

giles.clapham SSCrazy Points: 2194 More actions · Answer 8

giles.clapham

SSCrazy

Points: 2194

March 30, 2011 at 5:18 am

#1305349

Of Course, Not Exists!

I'll give that a go as although 30 mins is acceptable if we can get it down then it's got to be worth the investigation.

Thanks

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 9

1) it makes NO sense to simply put data into a temp object just to select it out to the calling entity

2) are you REALLY trying to return 50M rows to the calling entity??

3) no index will be helpful here if you plan on checking 80% of the data in the table (50M/62M) UNLESS the number of email sent rows is VERY LOW percentage of total. Very low in this case meaning like 1% or less. Outside of that, table scan will be (properly) chosen by the optimizer.

4) you can probably to do everything you want in a single pass over the table. If so, I would do that.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

paul.jones SSC Eights! Points: 804 More actions · Answer 10

A few thoughts:

You are using table variables. Is it possible to use temporary tables instead? I think (but can't remember where I read it) the optimiser will assume the table variable has 1 row, so in the final left join it will use a nested loop. If each table has about 50,000,000 rows and neither table has a relevant index it will have to loop through 50,000,000 x 50,000,000 combinations, which is slow. With a temporary table even with no indexes it would at least know the tables were big and choose a different join method.

In the final step, if you used something based on NOT EXISTS it wouldn't have to carry out the entire join then pick the rows that didn't have a match, i.e.

Insert @InactiveContacts (CID)



Select distinct AA.CID

from @AllActivity AA

WHERE NOT EXISTS (SELECT * FROM @ActiveContacts A WHERE a.CID=AA.CID)

See Gail's excellent article http://sqlinthewild.co.za/index.php/2010/03/23/left-outer-join-vs-not-exists/

Even if you can't switch to a temp table, you could try changing the primary key to store the data ordered on the fields you want to join on - e.g.

Declare @AllActivity table(ID int identity(1,1),CID uniqueidentifier,ActivityName varchar(50), primary key (CID, ID))

WayneS SSC Guru Points: 95405 More actions · Answer 11

giles.clapham (3/25/2011)

Insert @InactiveContacts (CID)

Select distinct e.CID

from @AllActivity AA

Left Join @ActiveContacts A on a.CID=AA.CID

Where a.CID is null

Remember that with table variables, SQL can't create statistics, so it knows nothing about the rows/selectivity/distribution. It treats table variables has having just one row, so the query plan is best for table variables having an extremely low number of rows. With the millions that you are talking about, other query plans would be better (almost ANY other plan would be better!). For more information, this article would be a good read: Comparing Table Variables to Temporary Tables[/url].

Edit: Personally, I have found that you should rarely, if ever, perform a join operation against a table variable. There was an example yesterday of a table variable with just one row causing issues when joined to another table.

Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes

If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

Links:
For better assistance in answering your questions
Performance Problems
Common date/time routines
Understanding and Using APPLY Part 1 & Part 2

WayneS SSC Guru Points: 95405 More actions · Answer 12

opc.three (3/25/2011)

In the absence of a defined clustered index SQL Server will use the PRIMARY KEY as the clustering index.

Not so. A table either has a clustered index (clustering table), or it doesn't (heap table). If it has a clustered index, then the keys of that clustered index (and if necessary an internal "uniqueifier" column if it's not a UNIQUE or PK) are added to each non-clustered index. If the table is a heap, then each non-clustered index stores data page/row information instead - not keys from a non-clustered primary key.

Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes

If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

Links:
For better assistance in answering your questions
Performance Problems
Common date/time routines
Understanding and Using APPLY Part 1 & Part 2

Orlando Colamatteo SSC Guru Points: 182276 More actions · Answer 13

WayneS (3/31/2011)

opc.three (3/25/2011)

In the absence of a defined clustered index SQL Server will use the PRIMARY KEY as the clustering index.

Not so. A table either has a clustered index (clustering table), or it doesn't (heap table). If it has a clustered index, then the keys of that clustered index (and if necessary an internal "uniqueifier" column if it's not a UNIQUE or PK) are added to each non-clustered index. If the table is a heap, then each non-clustered index stores data page/row information instead - not keys from a non-clustered primary key.

Wayne, I hear what you're saying, and you are absolutely correct, but it's not exactly relevant to my comment. I concede I could have worded my post better but you took my words out of context. Please re-read my post. You'll find that I was commenting on the DDL in the initial post creating the Audits table. You'll notice that a clustered index was not defined as part the CREATE TABLE statement and I was pointing out that in the absence of a, [explicitly] defined clustered index SQL Server will use the defined PRIMARY KEY, in this case AuditID, as the clustering index which is default behavior. I pointed this out in case it was not intended to have a column with data type UNIQUEIDENTIFIER as a single-column clustered index on the table...this was before I knew it was not the original poster's table design.

To illustrate, if I run the DDL from the initial post:

CREATE TABLE Audits



(

 AuditID UNIQUEIDENTIFIER NOT NULL

                          PRIMARY KEY,

 ContactID UNIQUEIDENTIFIER NOT NULL,

 CampaignID UNIQUEIDENTIFIER NOT NULL, -- Note: comma added by opc.three

 AuditActivityName NVARCHAR(50) NOT NULL,

 AuditDate DATETIME NOT NULL,

 AuditLocation UNIQUEIDENTIFIER NOT NULL,

 AuditClassification NVARCHAR(50) NULL,

 AdditionalInformation NVARCHAR(255) NULL,

 IPAddress NVARCHAR(50) NULL,

 AuditType NVARCHAR(50) NULL,

 AccountID UNIQUEIDENTIFIER NULL,  -- Note: comma added by opc.three

 LeadID UNIQUEIDENTIFIER NULL

) ;

A quick check of the index structure after running the CREATE TABLE statement above reveals that SQL Server did in fact utilize the defined PK as the clustered index which supports my post:

SELECT OBJECT_NAME(i.object_id) AS table_name,



        i.name AS index_name,

        i.is_primary_key,

        i.type_desc,

        c.name AS column_name

FROM    sys.indexes i

        JOIN sys.index_columns ic ON i.object_id = ic.object_id

                                     AND i.index_id = ic.index_id

        JOIN sys.columns c ON ic.object_id = c.object_id

                              AND ic.column_id = c.column_id

WHERE OBJECT_NAME(i.object_id) = N'Audits' ;

Results of above query:

table_name = Audits

index_name = PK__Audits__A17F23B81273C1CD

is_primary_key = 1

type_desc = CLUSTERED

column_name = AuditID

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 14

opc.three (3/31/2011)

You'll notice that a clustered index was not defined as part the CREATE TABLE statement and I was pointing out that in the absence of a, [explicitly] defined clustered index SQL Server will use the defined PRIMARY KEY, in this case AuditID, as the clustering index which is default behavior.

OPC, you are aware that an un-specified Primary Key definition is the clustered index by default, correct? It's PRIMARY KEY CLUSTERED. You only add on NONCLUSTERED when you want it differently. Thus, you're both saying the same thing, just very different approaches and I fear yours might confuse the newbie.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA