Query on a large database

  • Jayanth_Kurup (10/22/2015)


    here is the blog link with video

    http://enabledbusinesssolutions.com/blog/does-partitioning-improve-performance/

    You're obviously not listening what people trying to tell you.

    Your example repeats that very "junior mistake" I was talking about:

    CLUSTERED INDEX ON PK COLUMNS

    ADD NON CLUSTERED INDEX ON REF_AGE COLUMN

    And to prove the advantages of partitioning you do this:

    I create two different file groups, the primary FG now contains only the rows where age is 30 and the rest of the data has been moved to another file.

    To compare apples to apples you need to:

    CREATE NON CLUSTERED INDEX ON PK COLUMNS

    ADD CLUSTERED INDEX ON REF_AGE COLUMN

    And then try to summarize data for all people in their 30th.

    I mean "Age between 30 and 39".

    Or in any other age group: 18-25, 26-35, 36-49, etc.

    How will your queries against "partitioned" table look like?

    Will you be able to demonstrate any advantage of partitioning at all?

    _____________
    Code for TallyGenerator

  • Jayanth_Kurup (10/22/2015)


    Horizontal table partitioning , link with performance characteristics document below

    http://enabledbusinesssolutions.com/blog/does-partitioning-improve-performance/

    I could be missing something on that link but it would appear that you've once again written an article about performance without providing a test data generator, DDL for the table under test, nor the partition code. I appreciate what you've tried to do but without enabling someone to repeat your findings, it's similar to claims of achieving cold fusion. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Sergiy (10/22/2015)


    Jayanth_Kurup (10/22/2015)


    Partitioning can improve performance mainly in the following ways

    -For very large tables partition elimination does provide real benefits

    With correctly chosen clustered index it does not matter how big is the table.

    Does not matter at all.

    Querying against clustered index gives you a partition matching you selection criteria.

    And that partition is dynamic - if you need a month it will be month, if you request a week it will be a week.

    If you need last week of the last month and 1st week of the current month - you'll get a partition of exactly those 2 weeks.

    Static partitioning you suggest only adds an overhead of chosing which table to query.

    -Disk placement of files - striping the data where different months are places on different disks , or if the user is querying just the latest months data placing just that data in a fast disk improves performance - probably the most important way partitioning helps improve performance.

    Users always work with the last month. At least 95% of queries always go against "the current month".

    Placing other month on other disks won't improve anything, as the data is just sitting on the disk, not taking any resources aoart from the disk space.

    If you want to really speed up queries you better distribute indexes on the same table over different drives. That's something which can really help with query performance, mainly due to parralell execution.

    Sliding windows is one of the reasons Partitioning improve ETL performance, Read performance is improved by proper disk placement.

    Only on badly designed databases.

    Good stuff here Sergiy. I agree, but correct me here if I'm wrong. If I have table partitioning setup, with two partitions and two users querying the data at similar times, one query hitting partition A and the other query hitting partition B, then you should see a performance gain over non-partitioned tables? This is from my understanding because the I/O activity is split if partition elimination is happening.

    If it's not happening, then obviously you have multiple trees, which would effectively reduce performance versus a non-partition table with one tree. Or the opposite, you have both queries hitting the same partition, then you're still better off with a proper clustered index. Correct?

    I personally have been debating on partition indexes, but have not used them yet.

  • xsevensinzx (10/22/2015)


    Good stuff here Sergiy. I agree, but correct me here if I'm wrong. If I have table partitioning setup and there are two partitions, one query hitting partition A and the other query hitting partition B, then you should see a performance gain over non-partitioned tables? This is because the I/O activity is split if partition elimination is happening. If it's not happening, then obviously you have multiple trees, which would effectively reduce performance.

    You may have a second partition implemented via a covering index. Then your "partition B" query will seek the data in that index, without a need to access the table itself.

    And yes, you can create that covering index in another file group residing on another disk, reducing IO contention.

    _____________
    Code for TallyGenerator

  • Jeff Moden (10/22/2015)


    Jayanth_Kurup (10/22/2015)


    Horizontal table partitioning , link with performance characteristics document below

    http://enabledbusinesssolutions.com/blog/does-partitioning-improve-performance/

    I could be missing something on that link but it would appear that you've once again written an article about performance without providing a test data generator, DDL for the table under test, nor the partition code. I appreciate what you've tried to do but without enabling someone to repeat your findings, it's similar to claims of achieving cold fusion. 😉

    Cold Fusion? Isn't that just dipping a pork chop in liquid nitrogen, then lick it? :w00t:

    Clustered Index on Primary Key kind of defeats the whole idea of having one from what I understand.

    I kind of got lost when running multiple threads was 'proof' that the partitioning was faster.

    There were 2 changes made at once, multiple file groups and multiple threads made at the same time.

    Should have tested between changes, then start drawing conclusions.

  • Big table resides in a single file , it can be made to sit in multiple files but without any control over which row goes into which file ( proportinal filling) , with partitioning you can split a big table into many smaller files such that year 2010 data goes into file 1 on c drive , 2011 goes into file 2 on d drvie , 2012 goes into e drive etc.

    Now when you do a select such that you need to fetch data from multiple years you end up parallely reading from c: d and e thus improving the efficient of read ahead optimization.

    Consider how RAID 1 , 5 provides good read thorughput .... it splits the data across multiple disks and is able to read from them in side by side.

    Jayanth Kurup[/url]

  • Here are the scripts , I usually create the video to show the steps to reproduce the issue, mostly for those cases where it more important to explain the concept than how i got there , the video is not meant to explain how partitioning works just that if you compare a table with and without partitioning the latter does provide benefits.

    . As on linkedin I provide the code when requested it help keep things focused on the main point.

    https://drive.google.com/folderview?id=0ByGed3QzAHjfZl9OTEdZRHZBdnc&usp=sharing

    Jayanth Kurup[/url]

  • Mainly I think the reason this discussion is dragging on is because there is a misconception of how indexes work for OLTP system vs DW systems. best practices for 1 doesn't work for the other. A simple example would in OLTP system we aim for seeks and DW we aim for scans. As mentioned in the blog, in a DW with over 100 Million rows , most of the queries will end up performing scans and the clustered index might help you get to the right part of the table it doesn't help you fetch from that part of the disk any faster. If you want to fetch faster from the disk you need partitioning.

    Yes Sergiy even when aggregating over age ranges partitioning will help. Let me see if I can explain below:-

    you need to create one partition per age , e.g age = 1 goes into one partition , age = 2 goes into another partition , age = 3 goes into another partition and so on.

    Now when you perform a where condition for age between 30 and 39 drives j, k , l , m are performing IO in order to fetch data for each of these ages parrallely.

    Take a look at the execution plan of a group by statement , why do you think SQL optimizer uses multiple threads when fetching the data , so that each thread can summarize the data for subset of the total. With partitioning we can emulate that kind of behavior at the disk level. why do we have multiple files for tempdb ?? one per CPU core ??

    So when its being said that partitioning doesn't provide any benefits it just wrong , because its being implemented wrong.

    About creating the clustered index on ref_age column. You realize that clustered column is going to sort the data on disk and in my example new ages will be added later on that will fragment the index.In my original example I was correct in using a NC on the Age column. If ref_age was an ID column or sequential in anyway I would understand the logic for making REF_AGE a Clustered index but in this example it doesn't work because i will add a million customers today with ages between 1 to 100 and load another set of customers with ages 1 to 100 tomorrow in which case my clustered index would be fragment.

    Also if my requirement is to aggregate data by 15 different columns for 15 different reports, should I drop the index on ref_age and now add an index to the new columns as per each reports requirements? No !! we achieve this by using Column store indexes which allows us to aggregate data faster by eliminating unwanted segments - a batch of 1 million rows ( columnstores way of partitioning a table) and unwanted columns.

    Either ways I have attached the scripts with sample data for 1 M please feel free to try it out.

    Jayanth Kurup[/url]

  • covering indexes and included columns are fine for OLTP systems but for DW where the base tables are already in 100 of GB or TB it doesn't become viable. Design patterns for DW and OLTP are different, normalization vs kimbal dimensional modelling.

    multiple partitions does split the IO.

    Jayanth Kurup[/url]

  • Jayanth_Kurup (10/23/2015)


    Big table resides in a single file

    Are you sure?

    with partitioning you can split a big table into many smaller files such that year 2010 data goes into file 1 on c drive , 2011 goes into file 2 on d drvie , 2012 goes into e drive etc.

    How many drivers (which actually means - how many RAIDs do you have on your server?

    50? 99? How long do people live?

    Now when you do a select such that you need to fetch data from multiple years you end up parallely reading from c: d and e thus improving the efficient of read ahead optimization.

    I'm not sure there are enough letters in English alphabet to give it to each AGE in the database.

    Consider how RAID 1 , 5 provides good read thorughput .... it splits the data across multiple disks and is able to read from them in side by side.

    That's right, but it has nothing to do with your case.

    The query for 30 year of AGE will go against the single file, sitting on a single drive. No parallel execution at all.

    When propely clustered and indexed query quite possible will use parallel execution.

    _____________
    Code for TallyGenerator

  • Hi Sergiy

    I can demonstrate what I am explaining in my video , no gimmicks just plain tables and rows.If you have a few minutes , please add me on skype and I can demo the concept.

    my skypeid is Jayanth (dot) Kurup at hotmail dot com

    More than happy to demo the concept if it helps , Also the sample script and data are shared on google drive have you tried it?

    PS:Big table resides in a single file , it can be made to sit in multiple files but without any control over which row goes into which file ( proportinal filling) ,

    if you are going to quote me please do so completely.

    Jayanth Kurup[/url]

  • Jayanth_Kurup (10/23/2015)


    Mainly I think the reason this discussion is dragging on is because there is a misconception of how indexes work for OLTP system vs DW systems.

    The main reason of discussion is domination lack of understanding how properly design databases.

    best practices for 1 doesn't work for the other. A simple example would in OLTP system we aim for seeks and DW we aim for scans.

    Who told you that?

    As mentioned in the blog, in a DW with over 100 Million rows , most of the queries will end up performing scans

    Only because you failed to use correct indexing strategy.

    Create an appropriate clustered index and you'll see Clustered Index seek in the execution plan, with overall performance not worse than wit partitioning.

    Yes Sergiy even when aggregating over age ranges partitioning will help. Let me see if I can explain below:-

    you need to create one partition per age , e.g age = 1 goes into one partition , age = 2 goes into another partition , age = 3 goes into another partition and so on.

    Now when you perform a where condition for age between 30 and 39 drives j, k , l , m are performing IO in order to fetch data for each of these ages parrallely.

    It must be quite easy foryou to prove the point.

    I'm sure yoiu've done it before.

    Can you show how a query for an age group for ages between @m and @N should look like?

    Take a look at the execution plan of a group by statement , why do you think SQL optimizer uses multiple threads when fetching the data , so that each thread can summarize the data for subset of the total. With partitioning we can emulate that kind of behavior at the disk level. why do we have multiple files for tempdb ?? one per CPU core ??

    Why do we need to emulate this behaviour when SQL Server already behaves like that, without any emulation?

    About creating the clustered index on ref_age column. You realize that clustered column is going to sort the data on disk and in my example new ages will be added later on that will fragment the index.In my original example I was correct in using a NC on the Age column. If ref_age was an ID column or sequential in anyway I would understand the logic for making REF_AGE a Clustered index but in this example it doesn't work because i will add a million customers today with ages between 1 to 100 and load another set of customers with ages 1 to 100 tomorrow in which case my clustered index would be fragment.

    Oh, man.

    One advice - try it.

    Do what the best professionals in the neighbourhood tell you (I don't mean myself.

    Take the same table you used for you blog, create a clustered index on REF_AGE run the query and post the outcome.

    Should not be too difficult.

    _____________
    Code for TallyGenerator

  • Hi Sergiy

    Here is a screen shot with clustered index on the monolithic table on REF_AGE column as requested. Notice that it took 17 sec , just the same amount of time my partitioned heap table took without any indexes ( as shown in the video).

    So far I have given scripts , a video and a details explanation of how it works , and all I have got in return in up in the air statements like "how long do people live".

    Normally I would have ignored the posts after its starts getting rude but in this case I want to point out that

    "I am sorry I didnt know" is a perfectly acceptable answer and there no matter how big we think we are it happens to the best of us. I know I am wrong often enough to admit it.

    I promise to provide you with all the data you need as longs as we can keep it professional.

    Hope we can continue to debate this without things getting ugly.

    Jay

    Jayanth Kurup[/url]

  • Jayanth_Kurup (10/23/2015)


    why do we have multiple files for tempdb ?? one per CPU core ??

    If someone has one CPU per core, it's because they don't understand what they're doing. That is NOT a good thing to do, not in modern servers. It creates way too many files, leading to performance degradation.

    We add extra tempDB files because of contention on the allocation pages in the TempDB files. Nothing at all to do with IO and not related in any way to partitioning. When a temp table is created, SQL needs an exclusive latch on one or more of the allocation pages (GAM, SGAM, PFS). When there are lots of sessions creating temp tables, that can be a bottleneck, and so the guideline is to add extra files (I like to start with 4 total and if I still see allocation contention go to 8) so that the creating of temp tables can be spread across files and so there's less contention on the allocation pages.

    It's not related to sessions running in parallel and it's not related in any way to partitioning

    You realize that clustered column is going to sort the data on disk

    Clustered indexes (or indexes of any type) don't enforce physical storage order within the file, much less on the drive.

    As for partitioning and performance improvements:

    https://www.simple-talk.com/sql/database-administration/gail-shaws-sql-server-howlers/

    http://qa.sqlservercentral.com/articles/Performance+and+Tuning/126532/

    It can, but it's not automatic and it requires that the queries are written in such a way as to take advantage of the partitioning. If not, it can easily degrade performance.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Jayanth_Kurup (10/23/2015)


    More than happy to demo the concept if it helps , Also the sample script and data are shared on google drive have you tried it?

    I built a whole system on it.

    Global system for a global international corporation.

    Original architecture assumed moving 3 and and more years old data to "Archive" storage. Pretty much that horisontal partitioning you suggest to use.

    Because after 2 years in production the system did not show any sign of performance degradation, the "Archive" project was postponed.

    After 6 years in Production it was ditched for good.

    Correct indexing strategy proved the partitioning absolutely useless.

    _____________
    Code for TallyGenerator

Viewing 15 posts - 31 through 45 (of 104 total)

You must be logged in to reply to this topic. Login to reply