An Inconceivable Scale

  • Sean Redmond (12/9/2015)


    A part of this increase in data-size is our sloppiness in database design. We throw in GUIDs everywhere partly because it is in fashion, partly because it makes development easier [1] and partly because we can. We use unicode by default because it saves us having to worry about codepages. We use datetime2(7) when a simple date would do. We use int and bigint when tinyint would do, e.g. for an enumeration table with the list of states in it. Sizewise it makes almost a negligible difference to the size of the enum table, but in the table with 120 million rows, the difference alone is almost 70MB, and that is before any indexes that use it are taken into consideration. But then storage and resources are cheap, until they are not.

    We are no longer concerned about compactness and efficient resource-management, which I find to be a pity.

    [1] We might be developing a webservice that'll be using this table in the future so we need to put in GUIDs now!

    I don't see GUIDs everywhere, but some frameworks use them by default. They are useful when you need to generate the code on the client, or you need something like peer to peer replication. However I wouldn't think they should be everywhere.

    Unicode is hard. If you have a chance of using other languages, I think Unicode makes sense. Otherwise retrofitting that is crazy. If you plan to be English only (or spanish, or something that works in 8byte char sets), then no reason for sure.

  • william-700725 (12/9/2015)


    I would suggest that the growth in the volume of data stored does not reflect a growth in actual information.

    If you look, how much of that growth is a result of our Xerox mentality? We make endless copies of the same data, sharing it out, stashing copies here and there just in case, sometimes rearranging it for yet another effort at divining information from it. For example, even with the huge volume of sales transactions taking place every day, there is a finite amount of actual information generated by those transactions, and it is orders of magnitude smaller than the massive volume of data generated, passed around and archived by the stores, banks, credit-card vendors, fraud-detection services, Department of Treasury, and whatever other parties I haven't thought of.

    The other driving factor is our collective obsession with the idea that all data has value and should be preserved until we can find a way to derive that value. The harsh reality is that most of what we are preserving is as full of noise as surveillance video -- minutes (or mere seconds) of key information buried in thousands of hours of the camera watching people do ordinary things.

    Stop and imagine for a moment how long you could get people to sit and nod their heads to a straight-faced presentation about the value of a sensor-enhanced Roomba which would map how much lint and dust was pulled from each square inch of the room and relay that data to a cloud-based cleaning analysis service using highly optimized proprietary algorithms to generate an adaptive optimized route for cleaning your house. Then think about how much of our data is about as vital as ( room, x, y, dustballsize ).

    I don't have fundamental problems with gathering some of this data. Now keeping it, that's perhaps something we don't want to do. However we are still learning here. What do we keep and for how long?

  • aalcala (12/9/2015)


    Steve,

    It has been a long time since I was in school, but I thought an Exabyte would be 3 orders of magnitude larger than a petabyte.

    I thought it was

    MB

    GB

    TB - 1000^4

    PB - 1000^5

    EB - 1000^6

    ZB - 1000^7

    I guess that is probably 3 orders of magnitude of 10.

    Thanks for the note.

  • If the promise of quantum computing is realised, a great deal of data might be stored in a series of superimposed states. It may not be an exaggeration to say all the existing data storage on the planet might be stored on a single small quantum storage device.

  • robert.sterbal 56890 (12/9/2015)


    Because of the alphabet company my 5 year old daughter often asks me what a google plus some number is.

    "googol"

    Not trying to be pedantic, but I do like to help preserve the original, non-corporate, common noun that Edward Kasner's 9-year-old nephew coined (https://en.wikipedia.org/wiki/Edward_Kasner)

    You can tell your daughter that someone around her age "created" googol!

    Rich

  • Steve Jones - SSC Editor (12/9/2015)


    I don't have fundamental problems with gathering some of this data. Now keeping it, that's perhaps something we don't want to do. However we are still learning here. What do we keep and for how long?

    That depends, Steve.

    How would you like to star in a reality TV show titled "Useless Data Hoarders"?

  • 41,000 Blu-ray disks, billions of MB sized pictures...that's amazing. All this data makes the business intelligence & data analytics fields look very promising for the future. Someone needs to sift through all of this noise and extract the value from it.

  • william-700725 (12/9/2015)


    Steve Jones - SSC Editor (12/9/2015)


    I don't have fundamental problems with gathering some of this data. Now keeping it, that's perhaps something we don't want to do. However we are still learning here. What do we keep and for how long?

    That depends, Steve.

    How would you like to star in a reality TV show titled "Useless Data Hoarders"?

    Not at all. I really only care about pictures/video for me. A few words, but I tend to keep those togother.

    However in business, how much BI data? No idea. I think maybe a year or so in detail before we roll up.

    How much performance data? I've leaned towards no more than a month in detail, but then summaries.

    Sensor data? Video surveillence? I think you need some limits, and more importantly, your systems should allow for the trim / archive of some of this data.

    We will have more copies. We have databases and data warehouses, and reporting systems, and more, so we have copies. But we need archival strategies.

  • My time interval for performance detail is 2x to 4x the time the issue has been around. With our systems the detail is kept for about a week.

    The key measure is some expected economic value you hope to derive. Right now we are keeping too little data still.

    I'd like to see a visualization or a table of the amount of data broken down into types, usage and growth rate.

    412-977-3526 call/text

  • robert.sterbal 56890 (12/9/2015)


    The key measure is some expected economic value you hope to derive. Right now we are keeping too little data still.

    I doubt that.

    I've lost count of how many times I've either witnessed (or been a reluctant party to) an effort which results in spending $10 to save a dime -- chasing the end of the rainbow because somebody *swears* we'll find a pot of gold there, we just need to buy these new shoes so that we can run fast enough to get there.

  • A TB is no big deal, with many of us having that much storage in a desktop or laptop. In fact, I really think I'll see a TB on my phone sometime before the end of this decade.

    Stuff like video and internet device telemetry probably will make up the bulk of data going forward. As for personal use data (photos, email, MP3s), I think that has been expanding at a fairly regular rate. I'm not seeing TB sized storage on cell phones. One bottleneck is cell phone carrier charges for data.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Iwas Bornready (12/9/2015)


    Just bought the grandson an XBox One with 1 TB. How much space do you really need to store a game?

    Considering most AAA games are 10GB and above per game, you need a lot. This does not account for downloadable content he/she may purchase in app.

    I used to work in the AAA game industry for 7 years. Our average sized game was 30 GB per game because of how big the assets were to ensure the 3D models, sound and everything were extremely high quality. This is reflected on why those same games are $60 USD per game with additional pay walls.

    If you think about the data stored for the players of said games, it gets even more crazy. Databases in these games have to be fierce.

  • I would be happy to have a terabyte on my phone. My photos take up at least 300 GB at this point, and I would just side load them if I had the space.

    4K video will also push the high end up pretty quickly

    412-977-3526 call/text

  • GoofyGuy (12/9/2015)


    If the promise of quantum computing is realised, a great deal of data might be stored in a series of superimposed states. It may not be an exaggeration to say all the existing data storage on the planet might be stored on a single small quantum storage device.

    It's hard enough to sell the public the idea of storing their personal data in the cloud. Now try convincing them that trans-dimensional data storage is safe.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Steve Jones - SSC Editor (12/9/2015)


    william-700725 (12/9/2015)


    I would suggest that the growth in the volume of data stored does not reflect a growth in actual information.

    If you look, how much of that growth is a result of our Xerox mentality? We make endless copies of the same data, sharing it out, stashing copies here and there just in case, sometimes rearranging it for yet another effort at divining information from it. For example, even with the huge volume of sales transactions taking place every day, there is a finite amount of actual information generated by those transactions, and it is orders of magnitude smaller than the massive volume of data generated, passed around and archived by the stores, banks, credit-card vendors, fraud-detection services, Department of Treasury, and whatever other parties I haven't thought of.

    The other driving factor is our collective obsession with the idea that all data has value and should be preserved until we can find a way to derive that value. The harsh reality is that most of what we are preserving is as full of noise as surveillance video -- minutes (or mere seconds) of key information buried in thousands of hours of the camera watching people do ordinary things.

    Stop and imagine for a moment how long you could get people to sit and nod their heads to a straight-faced presentation about the value of a sensor-enhanced Roomba which would map how much lint and dust was pulled from each square inch of the room and relay that data to a cloud-based cleaning analysis service using highly optimized proprietary algorithms to generate an adaptive optimized route for cleaning your house. Then think about how much of our data is about as vital as ( room, x, y, dustballsize ).

    I don't have fundamental problems with gathering some of this data. Now keeping it, that's perhaps something we don't want to do. However we are still learning here. What do we keep and for how long?

    That's generally the problem I run into. Every business user wants to keep all the data forever regardless if the data is used or not. It's the perception it's there when they need it and that's what we pay for.

    I feel as database professionals, we have to be firm in these areas about data storage to the point of losing our jobs. We have to make it clear that, "No, we cannot just store a billion records to just store it..."

    Then again, that's why alternatives to the traditional RDBMS are becoming popular. The cost per TB for NoSQL solutions is a lot less expensive than say, SQL Server. Storing more and more data in SQL Server does not scale well compared to others where you have the added benefit of distributed processing across multiple machines.

Viewing 15 posts - 16 through 30 (of 39 total)

You must be logged in to reply to this topic. Login to reply