Querying Same Number of Rows in 2 tables

  • I've got a big table. A VERY big table. 20.2 billion (yes, Billion) rows, 2.7 Terabytes.

    It's got a clustered index on (date column, uniqueid)

    Table is partitioned by month.

    If I run the following query:

    select col1

    ,count(col2) as count_rec

    ,avg(col2) as avg_rec

    from BIG_table

    where date_col>= '01-MAR-2008' and date_col < '1-apr-2008'

    group by col1

    order by col1

    it takes between 2.5 hours and 3.5 hours to run, scanning about 900 million rows (the approximate number of March records).

    If I copy those same March records into a separate heap table and run the same query, it completes in about 18 minutes.

    Since the BIG_table has a partitioned, clustered index on the date column, I would expect this to require almost exactly the same number of I/Os as the scan of the small, heap table. Even if the BIG_table was NOT partitioned, I'd still expect the clustered index to successfully limit the required I/O so that there was negligible difference between the performance of the two queries.

    Have I missed something obvious, or is anyone else puzzled by this?

  • If you have this table in a development environment, have you tried creating an index like this:

    create index IX_DateColCol1Col2 on dbo.BIG_table (

    datecol asc,

    col1 asc

    ) include (

    col2

    )

    (Hopefully the syntax is right, I did this off the top of my head)

    On such a large table, this may take a while to create.

    😎

  • You are getting the correct number of scans, and I assume the plan is indicating an index seek and not an index or table scan.

    So, it could simply be fragmentation. 20 billion rows leaves a lot of room for data being in the wrong place. When was the last time you re-organized your index?

  • My first thought was fragmentation and/or out-of-date statistics. The new heap table wouldn't have any fragmentation, so it makes sense it would be faster.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • There's no index on col1 or col2 yet you're aggregating and grouping on those.

    Doing a copy into a heap and then putting an appropriate index on it would really make it fly. So would the index the Lynn suggested. If the underlying tables are updated on a "backdated" basis, having an additional index on them might become prohibative for maintenance. The clustered key based on date, like you have, makes the WHERE criteria pretty fast which is why copying to a heap is so much faster.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Thanks for the suggestions so far.

    Michael - sadly, it simply isn't practical to re-organize the index on such a large table. I know it's partitioned, but the SAN behind it isn't the best-performing SAN you'll ever come across, and experience has taught me that even attempting to reorganise the index is doomed to failure. Having said all that, the data is added pretty much in clustered index order (i.e. the Date column is very close to sequentially increasing - not perfectly, but very nearly.) Since we never delete records from this table, I'd expect the degree of fragmentation to be quite low. I'll double-check this shortly (if I can get the management views to return the results in less than a week!)

    Oh, and the statistics are updated on the table every 30 minutes by a scheduled job, although it's such a simple query in my example that I don't believe statistics would affect the outcome one way or another - it's going to do a partial clustered index scan come what may.

    Jeff, no there isn't an index on col1 or col2, but that's the case for both tables, so in itself doesn't explain the difference in speed between them. I'm not sure I understood your comment about the clustered index? The clustered index on the BIG table is scanned for all the March records, and so (theoretically) should require the same number of IOs as a heap table scan of the small table. The rows in the heap table won't be in order, but that doesn't matter since it needs to read them all in any case.

    The extra disk space required to build an additional NC index on ANY column in this BIG table is massive (I tried it once), and we simply don't have that much space available. It also drastically slows down the import process that writes records to the BIG table (from ~30 seconds to several hours).

  • The question you asked was why was the copy/query to a heap so much faster... I answered that. If you put the index on the original table that Lynn recommended, the query without the copy to the heap would be faster.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Thanks again for the reply Jeff.

    I didn't actually ask why the copy to the heap was faster. In fact, the copy takes roughly the same length of time as the query on the BIG table (which makes sense, since they both use the clustered index on the BIG table, with some extra overhead for the copy when it comes to writing the records to the new table). What I was trying to ask was why, once I've got my 2 tables, it is so much faster to return all the March records from the small heap table than it is to return all the March records from the BIG table using a clustered index?

    I entriely agree with you (and Lynn) that there are better indexes that could be added to make these queries optimal - sadly, I don't have that luxury in reailty due to the sheer size of the table. I'm not looking to make the query as fast as possible - I was just curious about why the difference in timings should be so marked as I couldn't think of an explanation.

    I'm currently trying to get fragmentation information for the BIG table as several people have already hinted. It may be that this is the cause of the problem.

  • Yes, looks like my expectations for the likely fragmentation in this table were completely wrong. It's 98.9% fragmented at the leaf level in the clustered index (across all partitions), which would indeed account for the differences in performance.

    Thanks to those who suggested I check that out.

    Next problem is what the hell can I do about it ! (Ever tried defragging a 20 billion row table? Even the individual partitions have 900 million rows in them, and you can't do an online reorg with a partitioned table. *sob* )

  • Would it be possible to dump the data out of a partition into a separate table, then dump the master, then reload from the separate table? Done correctly, this should allow you to reduce the fragmentation on a range of records at a time. I don't know if that will work with your situation.

    The other option I can think of is create a second copy of the database, defrag that, merge in new data and then load that up as the master. In your case, that'll take significant storage space, but a 3Tb RAID array isn't a fortune these days, if you can get that.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • 1) Did you show the xml query plan or all text plan and verify that the query was doing partition elimination and only reading the march partition?

    2) Does the fragmentation checker actually work correctly with partitioned tables? Exactly what mechanism did you use to check frag? DBCC SHOWCONTIG doesn't work across files IIRC.

    3) 30 min stats update: doesn't stats update a) beat up the already poorly-performing SAN's IO capabilities and/or b) possibly lock the table in some manner that could be preventing access

    4) Did you try your query with NOLOCK hint?

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • GSquared - thanks for your reply. Yes, I was mulling over exactly that scenario on the train on the way home this evening. I think that's my only option, even though some of the more recent partitions contain 900m rows (we partition by month). It's going to be quite a challenge, but I'm sure that the heavy fragmentation is the cause of the difference in performance that I noted. It would also explain why queries on the table as a whole thrash the cr*p out of our SAN !

    SQLGuru - to answer your questions in turn:

    1) Yes, the query was only reading the March partition

    2) To show fragmentation, I used the dynamic management function sys.dm_db_index_physical_stats()

    select * from sys.dm_db_index_physical_stats

    (

    7 -- db_id

    ,1202819347 -- object_id

    , 1 -- index_id (clustered)

    , 29 -- partition_id (for March 2008)

    , 'detailed' -- mode

    )

    3) Update Stats - We've found in the past that if we don't run this, then after a day or so other queries against this table start to perform very poorly. In 2005, sp_updatestats only acts on those rows that are new or have been modified, based on the "rowmodctr" value in sys.indexes, and so isn't as massive a hit on the disks as you might think. However, since you raised the point, I have noticed that the job duration is now exceeding 30 minutes (due to the sheer size of the table these days, I guess), so I think it would be sensible to schedule it to run less-frequently. Thanks for that 🙂

    4) Yes, I did run the queries with NOLOCK - it made no difference to the execution time.

    Thanks for the suggestions everyone.

  • Assuming you are copying the 900M rows to a different physical device, compare the IOStalls between the files during access for this query. It could be simply that.

    Hmm, perhaps not, since you are already reading the data out to copy it to the temp table, which should have no different IO pattern than the scan involved in the original query. Fragmentation should affect the select for populating the temp table and the select aggregate equally one would think. A CI scan is a CI scan. Most peculiar. There IS a reason . . . I just don't know what it is at this point. 🙂

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Philip Yale (5/19/2008)


    Thanks again for the reply Jeff.

    I didn't actually ask why the copy to the heap was faster. In fact, the copy takes roughly the same length of time as the query on the BIG table (which makes sense, since they both use the clustered index on the BIG table, with some extra overhead for the copy when it comes to writing the records to the new table). What I was trying to ask was why, once I've got my 2 tables, it is so much faster to return all the March records from the small heap table than it is to return all the March records from the BIG table using a clustered index?

    Heh... yeaup... I know... that's why is said the "copy/query" to a heap was faster. What I meant (and very poorly stated) was that the combination of the copy to the heap and the query from the heap was faster than just the single query from the main.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply