Index suggestion for VERY large table

  • I have a reporting table with the previous two years' history of sales for selected markets and products. The table has 125 million rows, and will grow at about 5 million rows per month. All future months will be added at the 'bottom' of the table -- that is, there will be no data for past time periods inserted. At the moment the table structure is as follows:

    CREATE TABLE [dbo].[IRI](

    [Geography] [varchar](100) NULL,

    [Product] [varchar](100) NULL,

    [AsOf] [int] NULL,

    [UPC_1] [bigint] NULL,

    [WtdDist_1] [decimal](16, 8) NULL,

    [DSales_1] [decimal](16, 8) NULL,

    [CSales_1] [decimal](16, 8) NULL,

    [USales_1] [decimal](16, 8) NULL,

    [ID] [bigint] IDENTITY(1,1) NOT NULL,

    [GEOID] [int] NULL,

    [PRODUCTID] [int] NULL,

    CONSTRAINT [PK_IRI_ID] PRIMARY KEY CLUSTERED

    (

    [ID] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    ) ON [PRIMARY]

    In an ideal world, GeoID and ProductID would not be nullable, but loading the data took precedence, so I have a problem I will later need to solve. GeoID is fully populated, and used in the query below; ProductID has some null values, and is not used in the query.

    I also am not pleased that the clustered index is the identity column.

    I think I would prefer it to be a combination of GeoID (the market)/ProductID/AsOf (time period, represented as 20090117).

    A typical query is:

    SELECT a.Geography

    ,a.Product

    ,a.AsOf

    ,a.UPC_1

    ,a.WtdDist_1

    ,a.Dsales_1

    ,a.Csales_1

    ,a.Usales_1

    ,b.Brand

    ,CONVERT(DATETIME,(SUBSTRING(CONVERT(CHAR(8),a.AsOf),5,2)+'/'+RIGHT(CONVERT(CHAR(8),a.AsOf),2)+'/'+LEFT(CONVERT(CHAR(8),a.AsOf),2) ))

    AS Date

    FROM IRI a

    JOIN #t b ON a.UPC_1=b.UPC_1_A

    WHERE a.AsOf IN

    ( SELECT [dtInt]

    FROM [model].[dbo].[Calendar]

    WHERE dt BETWEEN @AsOf-30 AND @AsOf+30

    )

    AND Geography=@Geo

    The temp table #t is loaded before with the relevant subset of products, and is typically about 3-20 line items, so is very quick.

    I created one index:

    CREATE NONCLUSTERED INDEX [InNc_IRI_UPCwAsOfGeoID] ON dbo.IRI

    (

    [UPC_1] ASC

    )

    INCLUDE ( [AsOf],[GEOID])

    WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    My reasoning was UPC_1 has the highest number of unique values, GeoID the next most, then AsOf.

    MY QUESTION: Does anyone have a suggestion for an index that would be optimal for this query. Many thanks. D. Lewis

  • Have you considered partitioning this table by date? If that's not a option then you will want either a clustered or non-clustered index on the date field. At my place of business, we recently took a table of 500 million records (and growing) and changed the clustered index to the date field which made huge improvements in most queries and didn't have major impact on other queries being run. But, for the long term strategy, we're considering breaking the table into "current data" (most recent 5 years) vs. "historical data" (all history) and then partitioning the current data.

  • Yes, I am considering partioning it, but I cannot begin thinking about that for at least a few months. The loading etc. of this table was so stressful (on both people and systems) that I am very reluctant to simply add an index without some good reason. I suspect, too, that an index on the date field would be 'good', but I would be thankful for some specific advice based on the query I posted.

    For example, is the index with the included columns really helping things? I don't have the experience to know. Is there a better candidate given the query/table structure I posted? TIA D. Lewis

  • You should define a clustered index on GeoID, ProductID, AsOf just as you suggested yourself.

    You should also change the query so the SQL server understands that you are really asking for a range in AsOf. The current join with the Calender is bad for perf.

    Perhaps you could use:

    WHERE a.AsOf BETWEEN @AsOf-30 AND @AsOf+30

    This will probably give you pretty good perf as long as you ask for a small set of products.

    To get even petter perf you should also partition the table on AsOf. Note that to gain anything from the partitioning it is vital that you change the query as I described above.

    If you dont want to rebuild the clustered index right now, you could also create a noncustered index with GeoID, ProductID, AsOf. For queries that return a small number of rows this index will be very good.

  • Stefan_G (4/8/2010)


    You should define a clustered index on GeoID, ProductID, AsOf just as you suggested yourself.

    You should also change the query so the SQL server understands that you are really asking for a range in AsOf. The current join with the Calender is bad for perf.

    Perhaps you could use:

    WHERE a.AsOf BETWEEN @AsOf-30 AND @AsOf+30

    This will probably give you pretty good perf as long as you ask for a small set of products.

    To get even petter perf you should also partition the table on AsOf. Note that to gain anything from the partitioning it is vital that you change the query as I described above.

    If you dont want to rebuild the clustered index right now, you could also create a noncustered index with GeoID, ProductID, AsOf. For queries that return a small number of rows this index will be very good.

    Just a thought... with that particular clustered index, the addition of new rows can and will cause page splits. You sure you want to do a rebuild on 125 million rows to fix that now and again?

    To prevent the page splits, save the clustered index for something that matches the temporal nature of the inserts so the new rows are always added near the logical "end" of the table. Use an unclustered index in place of the clustered index you propose.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff,

    I just read your response and I appreciate you input.

    Regards,

    Welsh

    For better, quicker answers on T-SQL questions, click on the following...
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...
    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • I wondered about that (page splits) last night as I couldn't sleep. Thanks for confirming it.

    In the meantime, an index on AsOf (the date INT field) improved the query performance immensely, so I have a little breathing room to work out a better table and index structure.

    Thanks. D. Lewis

  • Have you also considered limitingthe dates via a join instead of the where criteria? A join is typically more powerful than a between or in list.

    SELECT a.Geography

    ,a.Product

    ,a.AsOf

    ,a.UPC_1

    ,a.WtdDist_1

    ,a.Dsales_1

    ,a.Csales_1

    ,a.Usales_1

    ,b.Brand

    ,CONVERT(DATETIME,(SUBSTRING(CONVERT(CHAR(8),a.AsOf),5,2)+'/'+RIGHT(CONVERT(CHAR(8),a.AsOf),2)+'/'+LEFT(CONVERT(CHAR(8),a.AsOf),2) ))

    AS Date

    FROM IRI a

    inner join (Select [dtInt] from model.dbo.Calendar where dt between @AsOf-30 AND @AsOf+30) dt

    on a.AsOf = dt.dtInt

    JOIN #t b ON a.UPC_1=b.UPC_1_A

    WHERE Geography=@Geo

  • Thanks. This morning's task is to refine that date lookup method. I am going to test your join suggestion vs a CTE vs my original between. D. Lewis

  • Results of performance testing with and without a CTE:

    Without (the question was whether a BETWEEN clause for the date would slow down things):

    SELECT a.Geography

    ,a.Product

    ,a.AsOf

    ,a.UPC_1

    ,a.WtdDist_1

    ,a.Dsales_1

    ,a.Csales_1

    ,a.Usales_1

    ,b.Brand

    ,CONVERT(DATETIME,(SUBSTRING(CONVERT(CHAR(8),a.AsOf),5,2)+'/'+RIGHT(CONVERT(CHAR(8),a.AsOf),2)+'/'+LEFT(CONVERT(CHAR(8),a.AsOf),2) ))

    AS Date

    FROM IRI a

    JOIN #t b ON a.UPC_1=b.UPC_1_A

    WHERE a.AsOf BETWEEN @dmin AND @dmax

    AND a.GeoID=@Geography

    resulted in:

    Table 'IRI'. Scan count 6, logical reads 477, physical reads 4, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    ----------------------------------------------

    With CTE:

    WITH asof_cte(dtInt)

    AS

    (

    SELECT [dtInt]

    FROM [model].[dbo].[Calendar]

    WHERE dt BETWEEN @AsOf-30 AND @AsOf+30

    )

    SELECT a.Geography

    ,a.Product

    ,a.AsOf

    ,a.UPC_1

    ,a.WtdDist_1

    ,a.Dsales_1

    ,a.Csales_1

    ,a.Usales_1

    ,b.Brand

    ,CONVERT(DATETIME,(SUBSTRING(CONVERT(CHAR(8),a.AsOf),5,2)+'/'+RIGHT(CONVERT(CHAR(8),a.AsOf),2)+'/'+LEFT(CONVERT(CHAR(8),a.AsOf),2) ))

    AS Date

    FROM IRI a

    JOIN #t b ON a.UPC_1=b.UPC_1_A

    JOIN asof_cte c ON a.AsOf=c.dtInt

    WHERE a.Geoid=@Geography

    resulted in

    Table 'IRI'. Scan count 6, logical reads 3500, physical reads 8, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    So in this instance, the BETWEEN clause is a clear winner.

  • Awesome! Thanks for posting!

  • Welsh Corgi (4/8/2010)


    Jeff,

    I just read your response and I appreciate you input.

    Regards,

    Welsh

    Thanks, Welsh.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Hey folks... as a sidebar to all of this, since the Integer date is in the yyyymmdd ISO format, the formula to convert the Integer date to the DATETIME datatype needn't be so complicated. The following will do the trick just fine...

    DECLARE @IntDate INT

    SELECT @IntDate = 20090117

    SELECT CAST(CAST(@IntDate AS CHAR(8)) AS DATETIME)

    There's also a very high speed method that uses no character based conversions but I can't put my hands on it just now (still looking for it).

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Ah, yes. Thanks. I suspected there was a better way...

  • Jeff Moden (4/9/2010)


    There's also a very high speed method that uses no character based conversions but I can't put my hands on it just now (still looking for it).

    Please post that method if you find it. Sounds interesting.

    /SG

Viewing 15 posts - 1 through 15 (of 23 total)

You must be logged in to reply to this topic. Login to reply