Querying a large table with an odd criteria

  • Hi,

    I'm looking for a better solutions to the one I've developed, I've been working on it so long that I'm becoming blinkered and need some inspiration.

    Basically I have a table, quite narrow but lots of rows. It's made up of ContactIDs and the kind of activity we do with them, I need to be able to return those that have only one specific kind of activity over the past 12 months, in this scenario it's "Email Sent".

    There are a little under 62,000,000 rows in the table, approx 50,000,000 are within the past 12 months.

    So do this I'm taking a section on the table that I'm interested in into a table variable then querying that rather than the actual table. It's taking so long I've not allowed it to complete.

    I've run it on a smaller database where there are only 3.5m rows and it still took 2 hours to complete.

    The bit that takes the time is the final bit before returing the results.

    The details of what I'm dealing with and how I'm approching it are below.

    Thanks

    Giles

    --Table Schema

    Create Table Audits

    (

    AuditID uniqueidentifier not null primary key,

    ContactID uniqueidentifier not null,

    CampaignID uniqueidentifier not null

    AuditActivityName nvarchar(50) not null,

    AuditDate DateTime not null,

    AuditLocation uniqueidentifier not null,

    AuditClassification nvarchar(50)null,

    AdditionalInformation nvarchar(255) null,

    IPAddress nvarchar(50)null,

    AuditType nvarchar(50) null,

    AccountID uniqueidentifier null

    LeadID uniqueidentifier null

    )

    --Created an index

    create index ix_AuditDate on g4c_communicationauditextensionbase(AuditDate,ContactID,AuditActivityName)

    --Query to find anyone who only have Email Sent as an AuditActivityName in the past 12 months

    --Declare some table variables to help manage the data

    Declare @AllActivity table(ID int identity(1,1) primary key,CID uniqueidentifier,ActivityName varchar(50))

    Declare @ActiveContacts table(CID uniqueidentifier primary key)

    Declare @InactiveContacts table (ID int identity (1,1),CID uniqueidentifier)

    --get all audits in the past year caputuring the important columns

    insert @AllActivity(CID,ActivityName)

    Select ContactID,AuditActivityName

    from Audits

    where g4c_auditdate>dateadd(yy,-1,getdate())

    Insert @ActiveContacts

    Select Distinct CID

    from @AllActivity

    where ActivityName<>'Email Sent'

    --now find those contacts who are in @AllActivity but not in @ActiveContacts

    Insert @InactiveContacts (CID)

    Select distinct e.CID

    from @AllActivity AA

    Left Join @ActiveContacts A on a.CID=AA.CID

    Where a.CID is null

    --Return the results...

    Select * from @InactiveContacts

  • OK...I'll take a shot at this...

    I noticed something in your Audits table design that should be mentioned...UNIQUEIDENTIFIER might be the worst choice on the planet for the first column in a clustering key IF you are generating new ones using the NEWID() function. In the absence of a defined clustered index SQL Server will use the PRIMARY KEY as the clustering index. Even if you are using a sequential GUID it has the potential to still be a really bad choice. Here's a good start as to why but there are lots of people that have talked about this topic: http://www.sqlskills.com/BLOGS/PAUL/post/Can-GUID-cluster-keys-cause-non-clustered-index-fragmentation.aspx

    Regarding your choice of workflow, why isn't something like this sufficient to pull Inactive people?

    SELECT ContactID AS CID

    FROM Audits

    WHERE AuditDate > DATEADD(yy, -1, GETDATE())

    AND AuditActivityName <> 'Email Sent'

    I would scrap all the table variable stuff (btw, you would have been better off using temp tables instead for this volume of data, read up on "temp tables versus table varaibles"). The query above is still going to do an Index Scan however I think that's the best you're going to do (your insert into @AllActivity also does an index scan) unless you add a new non-clustered index where AuditDate and AuditActivityName are the first two columns and include ContactID so to cover your query.

    Consider adding this index to cover the above query:

    CREATE INDEX ix_AuditDate_ActivityName__Inc_CID ON dbo.Audits(AuditDate,AuditActivityName) INCLUDE (ContactID)

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Also, what is in ActivityName besides "email sent"

    If you can avoid the <>, you can make this run more efficiently.

  • Thanks for your advice, my hand is forced on the UniqueIdentifier as it MSCRM I'm affraid.

    The simpler suggestion you mention will only pull in those that have other activity other than Email Sent, the opposite of what I need.

    I was toying with using Temp tables rather than Table Variables but I was under the impression that Table Variable automatically use the TempDB when the data got too big. I'll give that a go though and try the included index oppossed to the composite index I was trying.

    In answer to Steves question there are 12 different Action Types.

    I'll post my findings as I get them.

    Cheers

    Giles

  • giles.clapham (3/25/2011)


    Thanks for your advice, my hand is forced on the UniqueIdentifier as it MSCRM I'm affraid.

    Yikes :crazy: MS = Microsoft? Which version of CRM?

    giles.clapham (3/25/2011)


    The simpler suggestion you mention will only pull in those that have other activity other than Email Sent, the opposite of what I need.

    OK, if you add the index we talked about and use this query with = instead of <> then you should have a well-performing query:

    SELECT ContactID AS CID

    FROM Audits

    WHERE AuditDate > DATEADD(yy, -1, GETDATE())

    AND AuditActivityName = 'Email Sent'

    Data in a table variable may get written to tempdb at some point (I don't know of clear docs on what factors make this occur) however the general consensus is to keep your own threshold, opinions vary but 1000 rows is a decent choice, where if you know you will exceed that number of rows you should opt for a temp table. You'll find plenty of info online debating "tables versus variables" and what the magic number should be.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Hello,

    Sorry it's taken a while to get back with my results.

    The revised index and the temp tables got the query down under half an hour, which in the environment is acceptable.

    So thanks very much and I won't dismiss Temp tables so quickly next time around.

    Giles

  • Just a quick note about the where clause you suggested.

    AuditActivityName = 'Email Sent'

    wouldn't work as I only want people whose ONLY activity was Email Sent. The Where clause is row specific so they could have had 'Email Sent' in one row and 'Booked' on another. Hence finding everyone who had an entry other than 'Email Sent' and then removing those from the final result.

  • I am happy you were able to find a workable solution.

    In the spirit of completely avoiding temp tables, hopefully improving performance, would something like this work?

    SELECT a.ContactID AS CID

    FROM Audits a

    WHERE a.AuditDate > DATEADD(yy, -1, GETDATE())

    AND a.AuditActivityName = 'Email Sent'

    AND NOT EXISTS ( SELECT *

    FROM Audits

    WHERE ContactID = a.ContactID

    AND AuditDate > DATEADD(yy, -1, GETDATE())

    AND AuditActivityName <> 'Email Sent' ) ;

    Caveat: This will only perform well against a large dataset if there is an index on Audits where ContactID is the first column in the index. What other indexes are on the Audits table?

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Of Course, Not Exists!

    I'll give that a go as although 30 mins is acceptable if we can get it down then it's got to be worth the investigation.

    Thanks

  • 1) it makes NO sense to simply put data into a temp object just to select it out to the calling entity

    2) are you REALLY trying to return 50M rows to the calling entity??

    3) no index will be helpful here if you plan on checking 80% of the data in the table (50M/62M) UNLESS the number of email sent rows is VERY LOW percentage of total. Very low in this case meaning like 1% or less. Outside of that, table scan will be (properly) chosen by the optimizer.

    4) you can probably to do everything you want in a single pass over the table. If so, I would do that.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • A few thoughts:

    You are using table variables. Is it possible to use temporary tables instead? I think (but can't remember where I read it) the optimiser will assume the table variable has 1 row, so in the final left join it will use a nested loop. If each table has about 50,000,000 rows and neither table has a relevant index it will have to loop through 50,000,000 x 50,000,000 combinations, which is slow. With a temporary table even with no indexes it would at least know the tables were big and choose a different join method.

    In the final step, if you used something based on NOT EXISTS it wouldn't have to carry out the entire join then pick the rows that didn't have a match, i.e.

    Insert @InactiveContacts (CID)

    Select distinct AA.CID

    from @AllActivity AA

    WHERE NOT EXISTS (SELECT * FROM @ActiveContacts A WHERE a.CID=AA.CID)

    See Gail's excellent article http://sqlinthewild.co.za/index.php/2010/03/23/left-outer-join-vs-not-exists/

    Even if you can't switch to a temp table, you could try changing the primary key to store the data ordered on the fields you want to join on - e.g.

    Declare @AllActivity table(ID int identity(1,1),CID uniqueidentifier,ActivityName varchar(50), primary key (CID, ID))

  • giles.clapham (3/25/2011)


    Insert @InactiveContacts (CID)

    Select distinct e.CID

    from @AllActivity AA

    Left Join @ActiveContacts A on a.CID=AA.CID

    Where a.CID is null

    Remember that with table variables, SQL can't create statistics, so it knows nothing about the rows/selectivity/distribution. It treats table variables has having just one row, so the query plan is best for table variables having an extremely low number of rows. With the millions that you are talking about, other query plans would be better (almost ANY other plan would be better!). For more information, this article would be a good read: Comparing Table Variables to Temporary Tables[/url].

    Edit: Personally, I have found that you should rarely, if ever, perform a join operation against a table variable. There was an example yesterday of a table variable with just one row causing issues when joined to another table.

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • opc.three (3/25/2011)


    In the absence of a defined clustered index SQL Server will use the PRIMARY KEY as the clustering index.

    Not so. A table either has a clustered index (clustering table), or it doesn't (heap table). If it has a clustered index, then the keys of that clustered index (and if necessary an internal "uniqueifier" column if it's not a UNIQUE or PK) are added to each non-clustered index. If the table is a heap, then each non-clustered index stores data page/row information instead - not keys from a non-clustered primary key.

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • WayneS (3/31/2011)


    opc.three (3/25/2011)


    In the absence of a defined clustered index SQL Server will use the PRIMARY KEY as the clustering index.

    Not so. A table either has a clustered index (clustering table), or it doesn't (heap table). If it has a clustered index, then the keys of that clustered index (and if necessary an internal "uniqueifier" column if it's not a UNIQUE or PK) are added to each non-clustered index. If the table is a heap, then each non-clustered index stores data page/row information instead - not keys from a non-clustered primary key.

    Wayne, I hear what you're saying, and you are absolutely correct, but it's not exactly relevant to my comment. I concede I could have worded my post better but you took my words out of context. Please re-read my post. You'll find that I was commenting on the DDL in the initial post creating the Audits table. You'll notice that a clustered index was not defined as part the CREATE TABLE statement and I was pointing out that in the absence of a, [explicitly] defined clustered index SQL Server will use the defined PRIMARY KEY, in this case AuditID, as the clustering index which is default behavior. I pointed this out in case it was not intended to have a column with data type UNIQUEIDENTIFIER as a single-column clustered index on the table...this was before I knew it was not the original poster's table design.

    To illustrate, if I run the DDL from the initial post:

    CREATE TABLE Audits

    (

    AuditID UNIQUEIDENTIFIER NOT NULL

    PRIMARY KEY,

    ContactID UNIQUEIDENTIFIER NOT NULL,

    CampaignID UNIQUEIDENTIFIER NOT NULL, -- Note: comma added by opc.three

    AuditActivityName NVARCHAR(50) NOT NULL,

    AuditDate DATETIME NOT NULL,

    AuditLocation UNIQUEIDENTIFIER NOT NULL,

    AuditClassification NVARCHAR(50) NULL,

    AdditionalInformation NVARCHAR(255) NULL,

    IPAddress NVARCHAR(50) NULL,

    AuditType NVARCHAR(50) NULL,

    AccountID UNIQUEIDENTIFIER NULL, -- Note: comma added by opc.three

    LeadID UNIQUEIDENTIFIER NULL

    ) ;

    A quick check of the index structure after running the CREATE TABLE statement above reveals that SQL Server did in fact utilize the defined PK as the clustered index which supports my post:

    SELECT OBJECT_NAME(i.object_id) AS table_name,

    i.name AS index_name,

    i.is_primary_key,

    i.type_desc,

    c.name AS column_name

    FROM sys.indexes i

    JOIN sys.index_columns ic ON i.object_id = ic.object_id

    AND i.index_id = ic.index_id

    JOIN sys.columns c ON ic.object_id = c.object_id

    AND ic.column_id = c.column_id

    WHERE OBJECT_NAME(i.object_id) = N'Audits' ;

    Results of above query:

    table_name = Audits

    index_name = PK__Audits__A17F23B81273C1CD

    is_primary_key = 1

    type_desc = CLUSTERED

    column_name = AuditID

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • opc.three (3/31/2011)


    You'll notice that a clustered index was not defined as part the CREATE TABLE statement and I was pointing out that in the absence of a, [explicitly] defined clustered index SQL Server will use the defined PRIMARY KEY, in this case AuditID, as the clustering index which is default behavior.

    OPC, you are aware that an un-specified Primary Key definition is the clustered index by default, correct? It's PRIMARY KEY CLUSTERED. You only add on NONCLUSTERED when you want it differently. Thus, you're both saying the same thing, just very different approaches and I fear yours might confuse the newbie.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply