My Own Custom Table Partitioning

  • It seems to me any custom partitioning method for a new system which isn't even built yet is premature optimization. Now perhaps learning the techniques of partitioning is best done before the system gets big, but I would tend to stress that you should aim to keep things as simple as possible. Why is a table being big assumed to be a bad thing? Surely there are caching and queueing and batch-update techniques that can keep a large table manageable as usage increases. Even built-in partitioning, it seems to me, has little advantage for an application which accesses one-row at a time, randomly, connected to a single logical disk system. It is mostly an advantage to partition if you have a lot of I/O (many disks or logical disks) to spread the partitions among, or if you are able to keep more frequently hit partitions in faster storage..I would think. The fact that each partitioned b-tree is a little smaller may help some applications, but is offset by the need for the query optimizer to pick among the partitions with each query.

  • Aaron West (6/9/2008)


    It seems to me any custom partitioning method for a new system which isn't even built yet is premature optimization.

    That is a very true statement. My plan is not to partition the tables at the beginning. Instead I want to have everything designed and coded so that later on down the road if/when I need to partition the tables I don't have to completely re-code my Data Access Layer.

    Also, many of my tables will exceed the max size of an int, and I'd rather not have to use bigint for my ID fields. If possible I'd like to keep my tables under 10 million rows.

  • Just curious... Max size for INT is over 2 billion... that should handle your 10 million rows quite nicely for 214 tables. Why would you need unique numbers for more than that?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff Moden (6/10/2008)


    Max size for INT is over 2 billion... that should handle your 10 million rows quite nicely.

    Yes it would. However, there will be over 2 billion rows per table. The 100 million rows (yes 100, not 10, sry about the typo) is the max number of rows I would like to have in each table. (That is why I'm planning to partition it)

  • Then, BigInt should do the trick.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff Moden (6/10/2008)


    Then, BigInt should do the trick.

    Yes it would. But, from what I've read, having that many records in one unpartitioned table isn't a good thing for a highly available web application. (Even when indexed correctly) Also, my C# code would then need to int64 for all reference to records which uses up twice the space as a standard int32. (When dealing with web services, every byte counts!)

    Having said all that, how well would a table of 10 billion rows perform (the tables average 10 fields, ranging from bits to vchar 5000)? As mentioned previously, I will be doing some complex selects against the table and then paging those results. That is why I was planning on trying to limit row count to about 100 million per table. (I will never need to aggregate records from different partitions)

  • As a database designer/developer one of the most difficult things to do is to differentiate between what the customer/manager SAYS you need to do and what the product/business REALLY needs you to do.

    In your first post you say that there will be 5 million users and in your last post you talk about a table with 10 billion rows. That is 2,000 rows/user. Is that really what you expect? You haven't described what your product is or how it is expected to work well enough for anyone to recommend what you should do, but my nose is really twitching and telling me that it would be a good idea to take another more critical look at the requirements. In my experience, poorly thought-out and unchallenged requirements doom more projects that poor coding or implementation.

    Just my opinion.

  • Paul Lach (6/10/2008)


    As a database designer/developer one of the most difficult things to do is to differentiate between what the customer/manager SAYS you need to do and what the product/business REALLY needs you to do.

    Amen to that!

    Regarding the 2,000 records per user, it is very possible, yes. However, like you mentioned, I am a developer not a business analyst. I can only assume their estimates are correct.

    I do apologize for not going into detail about the system that I am designing. Unfortunately I am not at liberty to do so. However, if you look at the Blog example I used, that draws a very similar parallel.

    Say a user posts one blog entry a day. That is ~365 a year. After 5 years of posts that is ~1,800 posts. Now I'm not sure how many individuals will post one a day, but you get the picture. Now, that is only for the blog post itself. Now say that each post can have comments posted to it. Say that same user gets 5 comments per blog post. On those 1,800 posts, there are now 9,000 comments. *IF* there really are 5 million users, that comment table *COULD* have 45 billion records in 5 years or 9 billion records a year.

    But even if there are only 500k users, that comment table *could* still be 4.5 billion records in 5 years or 900k records per year.

    Hopefully that give a enough insight to something like I am trying to tackle. (Without having me break the NDA)

  • A common error in estimating something (the required size of a database, new product sales, world population, etc.) is to take a series of worst-case or best-case values and then MULTIPLY them together. The multiplicative effect of the error in each value makes the result horribly wrong.

    Suppose your company was a manufacturer and not a web site. If they were coming out with a new product, do you think they would make decisions about how many plants to build, where to build them, and how big each plant should be by just sitting around a table and guessing? Or would they do test marketing to get more precise estimates of how many consumers would try their product and how many would become repeat users (i.e. would there be closer to 500,000 or 5,000,000 users, and how often and for how long would they post messages?)

    It sounds like your employer is coming out with a new product and using best case estimates for 5 years to size the database (no growth curve, no obsolesence, no test market). If that is what they want you to do and you can't talk them out of it, then keep going. However, I wouldn't spend a lot of time worrying about partitioning until you see what the volume looks like after 3 months, 6 months, and a year. Instead I would spend my time making sure that the design (tables, indexes, and stored procedures) are as efficient as possible. In my experience, slow databases are due to poor design NOT the amount of data.

  • I agree with Paul's assessment... it's a good thing to plan on that type of scalability, always... but I see no mention of "clustered" servers or the other things that would make such an endeavor a success.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff Moden (6/10/2008)


    I agree with Paul's assessment... it's a good thing to plan on that type of scalability, always... but I see no mention of "clustered" servers or the other things that would make such an endeavor a success.

    We will eventually be using clustered servers I would imagine. (I'm a developer, not a DBA. We will have one of those by the time we get to that point) Who knows how many boxes we will actually end up needing. But since I am developing the code from the beginning, I want to make sure it is designed in such a way that when the day does come to start partitioning out tables and adding more servers, that it will be a smooth(er) process.

    So, regarding one of my questions, how well does a table perform with over 2 billion rows? Take the comment table for example. It would probably have 6 fields, the only large one being a vchar 5000. If it will still be able to perform complex queries quickly then I might just use the bigint and not worry about any partitioning until I need to.

  • ammon (6/10/2008)


    So, regarding one of my questions, how well does a table perform with over 2 billion rows?

    Ya got me... I've worked with tables as large as 200 million rows, but that's an order of magnitude smaller than 2 billion rows. Performance is really good on 200 million rows provided you have the correct indexes and the related T-SQL is written correctly to actually use them. I'd imagine that the same would hold true for 2 billion rows... especially if some form of logic partitioning that would somewhat match the related queries were used.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • I know it's redundant, but I think that my earlier post really bear repeating here:

    rbarryyoung (6/2/2008)


    I'll just add that you really should have an experienced DBA or DB developer involved at this point.

    and...

    rbarryyoung (6/2/2008)


    ammon (6/2/2008)


    I couldn't agree with you more!

    But we are a small startup company and there just isn't any $$ for a DBA. 🙁

    Right. In that case I would recommend that you contract the services of an experienced professional DBA consulatant to do Architecture and Design reviews and some mentoring.

    Yeah, I know, there's no money for that either.

    The problem is that mistakes and oversights at this stage can prove to be very costly or limiting down the road. And the things that are going to bite you are not the things that you are thinking of now, because you are smart enough to either figure them out or to come to a place like this and ask about them.

    Rather, the real problem will be the things that don't even occur to you because you do not yet know enough to realize how important they are (or will be). And we cannot help you much with those now because you will not know to ask us and we do not have enough information (nor is it pratical in a forum) to warn you yet, it's just too general.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • With respect to your question about how well a 2 billion row table will perform, the answer is "great" unless you do a poor job of designing the database and the stored procedures that access the database. SQL Server stores data in "pages" that are 8,000 bytes in size. If you want to retrieve a record and know the primary key value for the record you want then SQL Server can very quickly figure-out the page in which the record resides and retrieve the page and get the record. HOWEVER, if you only know that the record you want has a varchar column that will have the word "stupid" or "idiotic" somewhere in the contents of the column and it has another varchar column that will have the word "very" or "incredibly" somewhere in it's contents then SQL Server will have to walk through every record in the table looking for the record and that will be EXTREMELY slow.

    With respect to the idea of designing your new web site without hiring or contracting with an experienced and talented database designer let me give you an analogy that I hope will demonstrate the stupidity of doing so:

    Suppose there was a new company that was formed with the idea of creating a new car that would get great gas mileage by being very lightweight and aerodynamic. The owners had some great ideas and were very excited to get started so they hired some engineers to design the exterior of the car and some engineers to design the interior of the car, but they decided that they didn't have enough money to hire any engineers to design the engine and drive train. After all, the engineers they hired to design the exterior and interior could probably design an engine that would be "good enough".

    The web site is the "exterior" and "interior" that the users will see and work with, but the database is the engine that will get the user where they want to go. Failure to design a good engine will quickly lead to dissatisfied customers and failure of the company.

    Just my opinion (and experience!)

  • Don't worry, I'm not going to be doing any 'like' statements on a vchar field. All the queries will be searching on a User_ID field. (each row will have its own primary key ID field, bitint) And of course the User_ID field will be indexed.

    I completely understand your analogy. And I would love to contract a DBA for consulting, but there isn't much $$. Now having said that, I am designing my application in such a way that the actual RDBMS can be changed out easily. (I overheard someone once mention maybe migrating to Oracle down the road) So while your point is VERY valid, my code will be designed to easily adapt to any changes made in SQL Server down the road. (like when we hire a DBA)

    Does anyone here want to offer their consulting services cheap? :laugh:

Viewing 15 posts - 16 through 30 (of 33 total)

You must be logged in to reply to this topic. Login to reply