SS08 SP2 w/ Compression - effectiveness varying wildly on a single table

  • Preface - I'm dealing with a 3rd party vendor supplied product and architecture. Please don't accost me for a crazy design, I promise it isn't mine. 🙂

    Have a database with roughly 600 tables and 30 'archive'-labeled tables used to pull data out of high volume tables on a monthly basis. One of these archive tables is very large - currently 100 million rows and growing roughly 8 million every 90 days. Due to the volume of transactions, the number of columns, and the number of rows, there are a huge number of indexes on this table. 13G of data, 110G of indexes. Yes, I agree, that's ridiculous. The vendor has basically written an index for each query they use against this monster table.

    So great, there's what I've inherited. Last month I implemented page compression against this table for all index types (Clus+non-clus). The monthly process that writes data to this table drops the non-clustered indexes, populates the data, rewrites the indexes, viola. The most recent iteration of this presented roughly 800MB of data to the table (the previous month had been skipped so it was a larger data load). However, when i go to build&compress the indexes, the delta is tremendous. Previously, i had about 35GB of indexes post-compression. Now, I have 70. That's no exaggeration. Interestingly as well, the clustered index has also doubled in size from 3.5gb to almost 7.

    I've also ran through decompressing everything to make sure the data hasn't change in volume tremendously. It hasn't. PK_IDX: 13GB, All_non-clus: 110GB. Not quite a 10% growth from before. Totally normal.

    How could compression suddenly lose 50% of its effectiveness. I've also ran through reorganize and the like hoping for an opportunity to better align repeating data types to improve compression. What's going on?

  • I have to agree the scale of these changes does appear to be out of line.

    However, before we assume they are it is important t point out that compression ratios can be wildly affected by the data that is being compressed and the method used. Have there been any updates to the software in the last 2 months? Has the content of the data changed at all in that timeframe? Like moving BLOBS from TIFF to GIF format or something along those lines.

    Also, when you rebuilt the non-clustered indexes, did you make sure that each had the compression set, I had been working with compression recently and discovered that compression is set at each index. I had made an assumption, and you know what they say about those..

    Also, what kind of compression are you doing, row or page?

    As a note I have not done a great deal with compression but do have some experience.

    CEWII

  • I (re)wrote their data migration package to an extent - I took out all of their index operations and perform them as before & after steps in an Agent job. So the index metadata is written to a temp table for every index, they're dropped, the job runs, they're added back, the PK index is rebuilt.

    I'm using page compression. The data (shouldn't) doesn't change outside of this population process, so I'm using fillfactor = 99, data_compression = page, maxdop = 12

    No changes to software (or, more importantly, to the table schema). This table stores the following types: int, smallint, char2, smalldatetime,money,datetime,char1. there are several of each, but those are the datatypes involved. The indexes are just a scatter of which cols are used per query.

    I hope that answers your questions.

  • chandleya (9/22/2011)


    I (re)wrote their data migration package to an extent - I took out all of their index operations and perform them as before & after steps in an Agent job. So the index metadata is written to a temp table for every index, they're dropped, the job runs, they're added back, the PK index is rebuilt.

    I'm using page compression. The data (shouldn't) doesn't change outside of this population process, so I'm using fillfactor = 99, data_compression = page, maxdop = 12

    No changes to software (or, more importantly, to the table schema). This table stores the following types: int, smallint, char2, smalldatetime,money,datetime,char1. there are several of each, but those are the datatypes involved. The indexes are just a scatter of which cols are used per query.

    I hope that answers your questions.

    Page compression (from my experience) produces the best compression rations so thats probably not it.. Is your data considered a heap or will there be data written in the middle of the table that might cause a lot of page splits? you mention some character columns, they are short ones like 1-2 characters?

    While I agree table schema changes could be a significant cause, changes to the character of the data can also have just as great an effect. Case in point, you go from storing large text files to storing a mix of large text files and BLOBs of JPG images. Your ability to compress the JPG is effectively nil and is probably has negative performance impacts. But you are saying that the character of the data has not changed so this probably isn't in play..

    CEWII

  • I listed all data types, so no risk of blobs or anything messy like that. The data change are insert-only, there are no updates/changes as this is reference data for a billing system.

    One column is an entrydate field, so we're just adding more entrydates to the end of the table. I just can't grasp why adding 8 million rows to an existing 90 million rows would cause 50% growth of compressed bytes without a schema change. How can I research splits effectively with such a monster table? I'm doing work in a test environment with big hardware - still takes 4 hours to rebuild all the indexes due to the volume. Given that I'm dropping/building indexes - are splits a fresh-after-build risk - especially given the compression?

  • Table splits (off the top of my head) really only occur for two reasons.

    1. A record on the page is modified and the space required to store that row changes and with the change will not fit on the existing page.

    2. A record is inserted into table between two records as defined by the clustered index. ie: clustered index has records 1 and 3 and record 2 is inserted, if there is insufficient space then there will be a page split.

    You can make some assumptions about page splits based on the activity against the table and its key structure. Inserts only where the clustered key is ALWAYS higher than any existing row should not generate page splits.

    I agree that this growth seems really odd and I don't have a good reason why it would be so substantial especially with the quantity of the data you have and have just added.

    What just came to mind was nullable columns, are you filling in more nullable columns than you did before?

    CEWII

  • Honestly, it's entirely random. There are 8 or 9 columns that have a high on/off null requency. But the data width is small, dates, 5-6 digit entries, etc. No big text fields or anything dangerous like that.

  • Well I admit I'm stumped.. Sorry..

    CEWII

  • Elliott Whitlow (9/23/2011)


    Well I admit I'm stumped.. Sorry..

    CEWII

    +1, but I still didn't get the report on the schema diffs between now at 2-3 weeks ago when it began.

    SOMETHING changed.

  • I have confirmed that without any doubt, the table (and index) schema has not changed in any way. I was considering opening an MS support case - but i'm not entirely convinced i'll get anywhere there, either. I also use Quest Spotlight to monitor use and allocation and have been trying to make sense of Index Distribution and statistics.

  • You may have hit a bug. PSS might be your only option at this point. For only a few 100$ it might not be a bad idea to check it out.

  • can you post the table definition

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • You need to compare the physical index details from before and after the most current load:

    can you post the results of this: (preferrably both before/after)

    declare @tableid int

    select @tableid = object_id from sys.tables where name = 'myTable'

    SELECT * FROM sys.dm_db_index_physical_stats(db_id(), @tableId, NULL, NULL, 'DETAILED')

  • See attached. There are two tabs.

    I've never used the detailed view of sys.dm_db_index_*_stats

  • The attached has been sanitized of any secret sauce - this is a 3rd party product so i can't be giving away their design.

Viewing 15 posts - 1 through 15 (of 20 total)

You must be logged in to reply to this topic. Login to reply