Are partitioned views not that smart?

  • -- >>>>>>>>>>>>>>>>>>>>> Statistics with "Unified" Table <<<<<<<<<<<<<<<<<<<<<<<<<<<<

    -- Table 'Stores'. Scan count 108, logical reads 210, physical reads 3, read-ahead reads 0.

    -- Table 'Cost'. Scan count 1, logical reads 437, physical reads 224, read-ahead reads 0.

    -- >>>>>>>>>>>>>>>>>>>>> Statistics with "Partitioned" Table <<<<<<<<<<<<<<<<<<<<<<<<<<<<

    -- Table 'Cost_2004Q1'. Scan count 68, logical reads 408, physical reads 107, read-ahead reads 0.

    -- Table 'Cost_2003Q4'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0.

    -- ...........................................................................................

    -- Table 'Cost_2001Q4'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0.

    -- Table 'Cost_2001Q3'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0.

    -- Table 'Cost_2001Q2'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0.

    -- Table 'Stores'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0.

    In the query against the unified table, there are MORE scans against a SMALL table (stores), than in the partitioned table one, where there are MORE scans against a LARGE table ('Cost_2004Q1'). Sure, the partitioned table has fewer logical reads, but has more scans against a larger table. The IO cost here is, 0.0129 for Unified table and 0.0900 for the partitioned table. The CPU cost is not that distorted, though. I surmise, the reason the IO cost for the Partitioned tables is high has a lot to do with the scan count. Remember the indexes are same on the partitioned table and the unified table.

    Remember that "scan" does not mean "full index scan" or "full table scan" - it just means "look."  The query plan for the partitioned table here took more individual looks than the plan for the unified table.  That's not necessarily good or bad.  The main tuning goal in tuning IO is to reduce logical IOs and to spread those IOs across all available hardware in as intelligent a manner as possible.  Your partitioned plan has fewer IOs - this is a win, as far as it goes.  There are lots of cases where you might incur lots of "scans", but this is not necessarily a bad thing!  The increased cost is more troubling to me. 

    But ... the number one thing I see here is that IOs decreased from 647 IOs to 411 IOs.  That's a big drop, and I'm not sure that I would expect that without a fundamental change in the optimizer's strategy.  It's almost like the optimizer thought it was going to get back a "large" proportion of data from the unified table, so it used fewer, larger scans.  The fact that the "unified" plan has only one scan against the large table, and that this scan had lots of IOs, tells me that this was probably a full index or table scan.  The fact that the "partitioned" plan had more scans, but they had fewer IOs, tells me that this involved repeated lookups, such as bookmark lookups (that's probably a good guess here, actually).  But which would you rather have:  a single scan that produces over 400 IOs, along with 108 other smaller scans, or one small scan of the smaller table, followed by 68 bookmark lookups at 6 IOs each?

    Having talked myself through all that, here's probably what's happening:  in the unified plan, you are doing a full table scan on the large Cost table.  You don't have an index on the AdStartDate or Region, and you aren't using either SaleDate or Item in your query, so your indexes on that table aren't worth much to this query - hence the full scan.  Scanning that table found 108 "hits" (based on Region, UPC, and AdStartDate) which were then looked up (by bookmark lookup on Store_Number) in the smaller stores table (presumably these 108 lookups found the 68 hits in the stores table I will mention in a second).  These hits were then examined for the Region=1 criteria.

    In the partitioned plan, the smaller stores table was scanned, probably filtering on Region=1, and this returned 68 "hits" which were then looked up (by bookmark on Store_Number) in the larger tables.  Presumably these 68 lookups found the 108 hits mentioned in the preceding paragraph.  These hits were then examined for the UPC and AdStartDate criteria.

    Now, obviously, I'm trying to use ESP here to figure out what was in your actual show plan.  But whether I've got all the details right or not, I think I've got the right picture, at least roughly.  Here's what I think you need to walk away with:  no matter how you slice this, or which table is used as the driving table in this join, without an index on AdStartDate or UPC, the fact is simply that your indexing just may not be optimal for this query.  This has nothing to do with whether the table is partitioned or not, but has everything to do with the selectivity of UPD and AdStartDate.  Instinctively, I would expect either or both of these columns to be more selective than Region.  The query optimizer could do better with better indexes.

    One final note:  when you say that the indexes are "the same" on the partitioned table as they are on the unified table, that's probably not quite right unless you've actually indexed the partitioned view.  If you have indexes on the individual underlying tables, that is more indexes, and that part of the partitioning overhead (every time you look at one, it's a "scan") may show up in your statistics.

    Just my 0.02,

    Chris

  • The tragedy seems to be an inherent handicap in the partitions itself. When using the single unified table, with SaleDate as the leading column in the clustered index, it just had to scan the clustered index backward ONCE, and pop the top one. In case of the partitioned tables, since there is no guarantee that the current partition (2004Q1) will have data, it cannot assume that the max saledate can be found there.

    I think that's probably right, although "reading the clustered index backward" just means reading the "big" pointers instead of the "small" pointers and then, once you get to the "biggest" page in the index, seeing what's the biggest value there.

    However, remember that parallelization is a goal of partitioning.  Here the scans probably are parallelized as much as your hardware allows, which might be a reason the optimizer wouldn't just start at the "biggest" partition and work its way backward until it found data.

    However, as I did in my previous post (the long one where I said the indexing scheme didn't match the query as well as it could), I'm going to bring this back to a design decision.  This is not a "tragedy," because this plan is still cheap and ... well, we've already discussed that.  It could be more efficient (I'll just say "could" and leave it at that), but it couldn't be much more efficient.  If I'm counting right, you've got 260,000,000 rows over 13 partitions, and this query is taking 48 IOs.  Could be better, but not a tragedy - this query should still perform reasonably well.

    Here's the design issue:  if you want this to be a low-IO query with partitioning involved, you are going to have to build an indexed view so that all the SaleDate information is in a single index.  You could even build an index on max(date).  Otherwise, you're kinda sorta complaining about a simple mismatch in what partitioning does and what your query is asking for.  In my opinion, of course.

    Are you really seeing huge differences in "wall time" for either of these queries you've posted about?  I find it hard to believe that either of them is working that badly from what you've shown so far.  Of course, in the other one I didn't even start to talk about random versus sequential IO, which would be a big factor.  But I'd have to know a lot more about both your tables and your server to start to evaluate this all the way through.  In short, there's nothing in what you're showing us that scares me, although I think that partitions may not do what you expect them to do unless the indexes are appropriately designed.

    I hope these posts make sense.  I have to leave for a meeting now, so I've been typing like a banshee, and I have no time to re-read even once for editing on either of them.  So, please read these with a grain of salt (and grace).  I hope something in here helps you!

    Cheers,

    Chris

  •  

    Chris:

    In my earlier post, I should have clarified that AdStartDate and SaleDate are aliases to the same field, and the same holds for UPC and Item. In my effort to simplify the problem, I used terms that members can easily relate to (like item and saledate), but apparently did not do a good job in using them consistently. I should clarify that there are right indexes in place for the query on hand (namely Adstartdate as the leading column in the clustered index, with UPC and store being the next).

    Though I was a bit rattled by the Scan count in the IO statistics, your analysis suggests that they are not as bad as I thought. That still leaves the question of the query costs.

    You would also notice that my partition load is imbalanced. By that I mean, though each partition covers a quarter, and have approx same number of rows, there is only ONE partition that is active for Inserts (batch inserts to be precise, there are no updates or deletes). I went with the partitions since I did not want to do DBCCs, Reindexing and Backups of the entire table, when only 1/12 th of the table is changing. It did deliver on ALL these three fronts, and everyone is pleased. I would not have minded much to give up "a few" on the performance (to be honest I was thinking it would perform better, rather sheepishly) . But just the performance degradation seems to be manifold. The degradation is more in terms of the abstract "query cost", than in terms of the "wall time". (So not every one is going after my head, not yet). The example of aggregate function I took above (max, min) is not a true representation of operations aganist the server, but something that made me illustrate my point better. I am not so sure if the indexed views would do me any good (I might just go back to my single table solution). If I am missing something here please let me know.

    In my previous post, what I considered tragedy (apologize for bad choice of words, probably "having left no good options" for the optimizer was my intention) was the fact that, the query against the partitions had no option but to do 12 more logical reads than the one against a unified table. I agree, this would have been faster if I had more disk volumes and spread the partitions on those, but refer back to the previous statement, regarding the "imbalanced nature" of the partitions. Also, I hope you agree that when we compare performance, we must compare both the operations on the same hardware. My thoughts are getting incoherent here, but the point I am trying to make is, I would probably not have each partition on a disk volume, when I cannot guarantee that all partitions would generate same load. I would probably get better performance by adding more drives to the file group containing the partitions.

    I also have a lingering suspicion that some of the query cost related issues transcend partitioned views, and just do not seem consistent from query to query on the same system. I tried running the same query against a portion of the table and the whole table (not the partitioned view). Here the query does not require any joins and does NOT involve any partitions, yet the query cost for FULL table is 0.0064 and the one against a portion of the table is 0.0814.  They query plans are identical , have same indexes and have same contiguousness.

    set statistics io on

    go

    SELECT  sum ( Cost )

    FROM workdb.dbo.Cost_2004Q1

    WHERE  (AdStartDate = '1/12/2004') AND

     (UPC = '004222287000')

    go

    SELECT  sum ( Cost )

    FROM workdb.dbo.Cost

    WHERE  (AdStartDate = '1/12/2004')AND

     (UPC = '004222287000')

    go

    go

    I guess that at this stage, it comes to trade offs between ease of management(and availability) or performance. I am also coming to realize that parallelization does not necessarily convert to better performance. I am reading about the SQL Server Yukon "native" data partitioning scheme. I hope that would strike a better balance between the operational/administrative and performance metrics.

    As always, I eagerly await your thoughts and appreciate your time.

    Thanks.

     

  • Hi I think you guys are the best people to ask about my problem with PARTITIONED Views and TAbles

     

    I have created my tables with the following check criteria against them

     

    ALTER TABLE [dbo].[IssuerTermStructuresJun04] ADD CONSTRAINT  CHK_BusinessDateJun04

          CHECK (datepart([month], businessdate) = 6 AND datepart([year], businessdate)=2004)

    The field businessDate is PART of the PK. The PK is compound - 3 fields.

    Right so that is fine. I create my view in the normal way.

     

    ALTER     VIEW IssuerTermStructuresJunToDec04AS

     SELECT *

     FROM [IssuerTermStructuresJun04]

     

     UNION ALL

     SELECT *

     FROM [IssuerTermStructuresJul04]

     

     

    ...

     

    Ok That is fine. There are no identities on the table. no defaults. no other checks on tables. There is however indexes - non clustered indexes on 5 fields. Businessdate being one of them. That shouldnt make a difference.

     

    Now trying to do an update on the view gives me the not all parititoned tables can be found error. Therefore I cannot upate/insert into the view.

     

    This is pretty bad as I need to get this working and I am SURE we have followed B.O.L. criteria.

     

    Can anyone help?

     

    Thanks in advance

Viewing 4 posts - 16 through 18 (of 18 total)

You must be logged in to reply to this topic. Login to reply