Are partitioned views not that smart?

  • I have implemented partitioned views in one of our database to address management issues on a large table (~ 500 mil rows). In essence the tables contains historical sales info for 3 years. The table is updated weekly through  a batch process. Every month the data older than 3 years is purged. The primary key (clustered) is saledate, store and item and there is a NC index on item, saledate and store. The driving factor for partitioning was that the table gets fragmented as the data is purged and added to the table and the reindexing jobs usually take in excess of 20 hours. The weekly full backups were other issue in that the full backups are large, and the entire data has to be backed up even though only change is the data added last week or purged.

    So, I implemeneted local partitioned views, divving the table by quarter into separate databases. So we backup and reindex the current quarter db weekly (which is about 10% of the full table), the inactive partitions are backed up and reindex just once. Archiving is simple, just drop the database containing the quarter that is no longer needed.

    Now, I assumed, an incidental benefit of the partitioning was 'better performance'. But in reality, it looks like the parition implementation leaved lot to be desired. To be precise, some of the queries that have the where clause like,

    saledate= @vardate and

    item=@item and

    store=@store

    scans all the partitions. I would think it should only look in the partition that has the saledate @vardate. Can any one explain why all the partitions are searched?

    Thanks.

  • It could be a bug. Since partitioned views are relatively new, I am not sold that there are not bugs in how they work. We've stayed away becuase of the complexity here and no large tables that would force me to look at those.

    If you figure somethign out, please post back here, I'd be interested to know.

  • Are you sure you meet all the requirements for the "smart" partitioned views?  Specifically, I'm talking about rules like:

    • the view must be UNION ALL (not UNION)
    • the partitioning column must be a non-computed column
    • the partitioning column must be in the same position in each of the UNION ALL subqueries
    • the partitioning column must be of compatible datatypes in each of the UNION ALL subqueries
    • the partitioning column has exacly one check constraint against it in each table
    • those check constraints result in a non-overlapping set of ranges that will be used for partitioning
    • partitioning columns cannot be computed
    • member tables of the view cannot contain index on computed columns
    • etc.

    To be honest, I have worked with updatable partitioned views, but not with nonupdatable ones.  If you can update through your view, then I certainly think you've got a point.  But if you can't, then I'm just suggesting that you find a good list of the requirements for partitioned views and make sure you've met them all (the list above is not a good one, by the way, and it may even have errors - I'm kind of shooting from the hip here).  I mean, it has been possible to write views that did UNION ALL against multiple tables for a long time - you want to make sure that you're not missing one of the requirements and simply reverting to the old behavior. 

    However, if you see showplans that demonstrate that partitioning is definitely working sometimes, then I'm with Steve - have Microsoft check it out, it might be a bug.  In other words, if you have showplans that demonstrate that some queries against the view itself only hit against the appropriate tables, then you may want to feed it up Microsoft's flagpole.  Or at least post more details here with respect to the view definition and query ... although I can't imagine any good reason for scanning the other tables if the "smart" partitioning requirements are satisfied.

    Good luck!  I hope you post something back, either more details, or whatever you might learn from Microsoft if you file a report with them.

    Cheers,

    Chris

  • My partitioned tables meet all the conditions and they are updatable. More details emerged upon a closer look at the query plan and the statistics. Even though the query plan shows all the partitions, there is a THICK pointed line in the query plan that touches only the correct partition. When I first saw it, did not think much of it, but then I ran the io statistics, the query is indeed not scanning / seeking other partitions. So far so good, however it still shows the query cost higher for partitioned tables than for the one WITHOUT partitioned tables. BTW, I created a single table with all the partitions to compare the query performance. I will post more details as they become available.

    Thanks for your feedback.

  • Any chance you could post a jpeg of that query plan?  Or is it too big?

    Thanks,

    Chris

  • I'm a newbie, and I may be wrong but...

    I believe partitioned views must have a CHECK constraint that defines the horizontal partitioning. Apparently the CHECK is actually used by the server to decide which server has the data. Without the CHECK, all servers are searched.

  • Hi John,

    Newbie or no, you are correct: you need those check constraints.  But if you looke further back in this thread you'll see that we've discussed that a bit already.  Friendly user mssql_rules has indicated that their view meets all those requirements for partitioned views, and that he can in fact update through them.  If the view was not set up properly, I do not think the updates would work at all, so I think they must have this set up correctly already.

    If you have any other ideas, though, keep chiming in, because you never know what small thing will help somebody finally figure something out!

    Chris

  •  

    I made the following tables and view which are as per requirements of Partitioned View. Strangely the queries that reffer to partitioned columns using operators such as = & in  gives proper query plans. but queries that use between , <= , < , > , >=  shows index scan on all tables.

    Strange !

    -- amit

     

    -- The script

     

    drop table  sales2000

    drop table  sales2001

    drop table  sales2002

    drop table  sales2003

    drop table  sales2004

    go

    create table sales2000

    ( yr int check (yr <= 2000) PRIMARY KEY not null , sales float )

    create table sales2001

    ( yr int check (yr = 2001) PRIMARY KEY not null ,sales float )

    create table sales2002

    ( yr int check (yr = 2002) PRIMARY KEY not null ,sales float )

    create table sales2003

    ( yr int check (yr = 2003) PRIMARY KEY not null ,sales float )

    create table sales2004

    ( yr int check (yr = 2004) PRIMARY KEY not null ,sales float )

    go

    drop view AllSales

    go

    create view AllSales

    As

    select yr, sales from sales2000

    UNION ALL

    select yr, sales from sales2001

    UNION ALL

    select yr, sales from sales2002

    UNION ALL

    select yr, sales from sales2003

    UNION ALL

    select yr, sales from sales2004

    go

    insert AllSales select 1000, 909

    insert AllSales select 1200, 809

    insert AllSales select 2000, 709

    insert AllSales select 2001, 609

    insert AllSales select 2002, 509

    insert AllSales select 2003, 409

    insert AllSales select 2004, 309

    insert AllSales select 1600, 209

    go

    select  * from AllSales where yr = 2003

    go

    select  * from AllSales where yr between 2002 and  2003

    go


    -- Amit



    "There is no 'patch' for stupidity."



    Download the Updated SQL Server 2005 Books Online.

  • I think I can easily illustrate my concerns with the simplified structure posted by Amit.

    I would create an "unpartitioned table" by combining the partitions, like this:

    select * into AllSalesTBL from AllSales

    alter table AllSalesTBL add constraint AllSalesTBL_PK Primary Key (yr).

    Now run the following two queries:

    select  * from AllSales where yr between 2002 and  2003

    go

    select  * from AllSalesTBL where yr between 2002 and  2003

    go

    If you look at the exec plan not only are all partitions accessed by the first query but the query cost for the partitioned table is (0.0320) about 5 times that of the unified table (0.00640). This is what bothers me, why would the partitoned table query have more cost than going against the unified table?

  • mssql_rules,

    I think COST is the Key word here. The fact that you have HIGER Query cost does NOT means you have HIGHER IO!! set the STATISTICS IO ON and re-run your queries you may learn someting interesting there

     

     

     

     

     


    * Noel

  • Well, in this case there is so little IO that the query cost of the parallelized version is going to be unusually impacted by the query optimization phase, and also your IO's are going to be unrealistically inefficient, since each physical IO is going to only fetch one logical row.  Ouch.  So you will get more IO's with the parallel query (at least possibly) and also get a higher query cost.  That's not a problem with parallelization, though; it's just an artifact of having small datasets and then insisting on parellelizing the access to them.

    But remember, parallelization / partitioning will not necessarily cut down on your IOs; the idea is to take the work required by the IOs of your whole workload and to spread them out over multiple processors (and IO channels, etc.).  So for single queries within a partition, you should expect performance roughly equal to a single-table solution - although the size of the partition and the table are not the same, and obviously this has an effect.

    When I test this kind of scenario with more rows, I get just what I'd expect:  almost exactly the same IO levels, and slightly higher query cost for the parallelized version (because parallelization is not, after all, free).  I don't think you're demonstrating that anything is wrong with the optimizer - at least not as far as I can tell.  Rather, you might misunderstand what partitioning can and cannot do on an individual query (which is different from what it does to an entire workload, especially if that workload is spread nicely across the partitions).

    Make sense?  Hope so.  Let me know if not...

    Chris

  • Chris,

    I am not saying that there is anything wrong with the optimizer I just thought that the way things were explained before my previous post were looking at the wrong place for small data sets, for the way queries are evaluated (execution plan First!)

    I agree 100% with your post It just happened to me once because the compilation time was equivalent to the data retrieval so the cost was high for again small dataset but the IOs were giving the expected results.

    Hope is clear


    * Noel

  • I agree with the dataset being too small to be of a lot of value in this case. But, the discrepancy between the IO costs and IO operations seems to be more pronounced (and not in favor of the Partitioned tables ) when I go against large partitions, as explained below.

     For this discussion, let me just use the IO statistics in isolation without much regard to the query since the query is the same for both plans, one uses partitioned and the other uses unified table. If you were curious, the Stores table is about 200 rows and has a StoreNumber and Region. The cost table has Saledate, Store, Item and Cost with about 20 million per partition ( partitioned by saledate and Clus Index on SaleDate, Store and Item). There is  NC index on the cost table on (Item, SaleDate and Store). The query is to select sum of itemcost for stores in a region by date,

    SELECT  sum ( Cost )

    FROM Cost  INNER JOIN

         Stores ON

         Cost.Store_Number = Stores.Store_Number

    WHERE  Stores.Region = 1 AND

     UPC = 'xxxxxxxxxxxxx' AND

     AdStartDate = '1/12/2004'

    -- >>>>>>>>>>>>>>>>>>>>> Statistics with "Unified" Table <<<<<<<<<<<<<<<<<<<<<<<<<<<<

    -- Table 'Stores'. Scan count 108, logical reads 210, physical reads 3, read-ahead reads 0.

    -- Table 'Cost'. Scan count 1, logical reads 437, physical reads 224, read-ahead reads 0.

    -- >>>>>>>>>>>>>>>>>>>>> Statistics with "Partitioned" Table <<<<<<<<<<<<<<<<<<<<<<<<<<<<

    -- Table 'Cost_2004Q1'. Scan count 68, logical reads 408, physical reads 107, read-ahead reads 0.

    -- Table 'Cost_2003Q4'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0.

    -- ...........................................................................................

    -- Table 'Cost_2001Q4'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0.

    -- Table 'Cost_2001Q3'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0.

    -- Table 'Cost_2001Q2'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0.

    -- Table 'Stores'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0.

    In the query against the unified table, there are MORE scans against a SMALL table (stores), than in the partitioned table one, where there are MORE scans against a LARGE table ('Cost_2004Q1'). Sure, the partitioned table has fewer logical reads, but has more scans against a larger table. The IO cost here is, 0.0129 for Unified table and 0.0900 for the partitioned table. The CPU cost is not that distorted, though. I surmise, the reason the IO cost for the Partitioned tables is high has a lot to do with the scan count. Remember the indexes are same on the partitioned table and the unified table.

    I would spend more time to get more insight into this behaviour, since I have a lot at stake with the partitioned tables. So far the input from the forum has been extremely helpful and please keep your feedback coming in (if it were not for you guys, I would have given up on the partitioning).

    thanks.

  • Hi noeld,

    I was referring more to the user named mssql_rules, actually.  Maybe I've gotten my users crossed up someplace...

    Anyway, as a sidebar:  I believe there is an attitude in the optimizer that something is not necessarily worth thorougly optimizing if it is small.  This makes sense, because the optimization itself is a pain, and if I (the optimizer) can come up with a sub-optimal but excellent query plan with minimal optimization effort, it is not necessarily worth it to come up with an absolutely optimal plan that was really, really hard to figure out.  In fact, the optimizer will take a trivial plan, if it is known to be efficient, without looking at more complex plans, even though they might be *more* efficient.  I always thought that was pretty interesting.  It does this through a phased approach, where the trivial set of possible plans are looked at first, and if one of them will work, then it will be used.  If not, then a second stage of optimization occurs where somewhat more complex access paths are considered, and so on.

    But back to the main topic, with partitions you always have to think about all of your workload, as distributed across all partitions, and as matched to available processors.  In one of the previous threads I thought we might be losing site of this basic idea.

    Cheers,

    Chris

  • Here are more details on what I found:

    Most aggregate operations on partitions take more resources, do more scans, and are slower. Here is simple example, in my partitions divided by saledate, I ran the following command:

    select max (Saledate) From PartitionedTable

    Select max (SaleDate) From UnifiedTable

    The query against the PartitionedTable did 12 times more logical reads (and 12 times more resources  and was slower ) than the one against the Unified table, pasted below:

    Table 'Cost_2004Q1'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0.

    Table 'Cost_2003Q4'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0.

    Table 'Cost_2003Q3'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0.

    Table 'Cost_2003Q2'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0.

    Table 'Cost_2003Q1'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0.

    Table 'Cost_2002Q4'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0.

    Table 'Cost_2002Q3'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0.

    Table 'Cost_2002Q2'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0.

    Table 'Cost_2002Q1'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0.

    Table 'Cost_2001Q4'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0.

    Table 'Cost_2001Q3'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0.

    Table 'Cost_2001Q2'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0.

    Here is the one against the unified table:

    Table 'Cost'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0.

    The tragedy seems to be an inherent handicap in the partitions itself. When using the single unified table, with SaleDate as the leading column in the clustered index, it just had to scan the clustered index backward ONCE, and pop the top one. In case of the partitioned tables, since there is no guarantee that the current partition (2004Q1) will have data, it cannot assume that the max saledate can be found there. Hence had to do to backward scan of clustered index of EACH partition. That explains the 48 logical reads by partitions as opposed to the 4 logical reads for single unified table.

    The inserts and deletes seem to be performing OK. There must be some other areas where the selects are slower, since our delete and update operations against the partition are working OK, but reads flounder.

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply