Is Big Data good for Data Professionals?

  • Unfortunately, I firmly believe from what I saw during my years in IT is that 'big data' is often the result of system design that is incomplete and lacking in foresight.  With hardware less and less expensive and at the same time more and more powerful and fast, it was easy to overlook the eventual need to dispose of, if not archive the volume.  We must always bear in mind that the older data becomes, the less value it has for analysis.  A truly good system design will include an understanding of how best to summarize 'data' into meaningful 'information' without chewing up the terabytes of storage needed to keep the raw formats. 

    For instance, I have a life insurance policy that was taken on my life in 1943 for a face value of $1000.00  The annual premium was, and is, $10.44 per year.  Each year the company has paid a dividend deposit, and paid interest on the deposit. 

    Now, on my side, it's not a huge deal to maintain the annual detail for the 74 year history, if I so desire.  Most likely I will never actually look at this information but will continue to look at the CURRENT dividend, interest, and total value.  If I wish, I am free to store whatever data I desire.

    However, on the  insurer's side, it makes absolutely no sense for them to have accumulated detail for 74 premium payments, 74 annual dividends, and 74 years of interest payments on deposits - in addition to the likely thousands and thousands of additional policies they have issued, especially those which are no longer in effect.  I would expect that if I contacted the company, they would surely tell me they no longer have the historic data.  While there may be analytical value in the effective date, insured's birthdate, from which age may be calculated, the premium payment dates, dividend, and interest detail is of little if any value.

    While this is obvious to you as you look at this explanation now, you would probably be shocked to discover how many times in my IT years I found that there was absolutely no thought given to exactly this situation.  This then leads to the scenario in which IT has to try to explain to users why they now have to make detail disappear that 'we have always had' and that 'we might need', even though it has not been looked at in decades.  Maybe most of you don't remember the old variable-length flat file record where it was common design to populate repeating data into repeating segments within a single record.  Imagine my current record consisting of a base record and 74 repeated historical segments.  Yes, we didn't have relational technology in those days.

    In summary, we must be sure that 'big data' is not in reality 'no policy' (please pardon the pun).

    Rick

    One of the best days of my IT career was the day I told my boss if the problem was so simple he should go fix it himself.

  • Eric M Russell - Monday, March 27, 2017 8:51 AM

    ...I hope that our managers start to think that when our systems run slower that we're dealing with big data, and we need more resources. The whole big data phenomenon could be a way for data professionals to start a new hardware renaissance, where hardware budgets grow and we begin to replace our current systems with bigger, faster servers...

    The problem is when an organization believes that the solution to any database performance problem is to throw more computing resources at it. Many organizations don't have the database engineering know-how to properly design databases or trouble-shoot root causes, so they treat it as an issue of scaling up or scaling out the platform operations. However, a single strategically placed index can save $100,000 in new hardware or cloud hosting.

    Man, where to go with this one.

    I wrote some big long post in response, but removed it because it may not be proper to show examples. But, let's not assume it's all solvable with a strategic placed index. I firmly believe it's not that easy, especially if you're dealing with truly large volumes of data for a symmetric infrastructure like those of traditional RDBMS to easily solve every problem with large volumes of structured data.

    Many of us who deal with large volumes of structured data cannot be as nimble to tackle remodeling, new indexes and so forth with questions asked of the data are not consistent. Ultimately, scaling up resources becomes a viable approach because most of the issues tie to not having enough resources to complete a job when you talk about scaling across the business.

    That's why products like Azure Data Warehouse, Amazon Redshift, Google Bigquery and much more shine. They are tools for specific needs that are not easily resolved by a strategic index.

  • Big Data isn't OLTP data unless you are doing high speed financial trading.

    Buck Woody said that "Big Data is data that you struggle to process with the technology you have today".  The unstated assumption is the technology you are applying today is used competently and appropriately.

      A true big data problem is not one you are going to solve using SMP technology, otherwise it wouldn't be a big data problem.

    I know of one company that specialises in dealing with sensor data.  This flows in at over 1billion transactions an hour and they have to turn around that data in an incredibly tight time frame.
    I live near Joderell Bank.  That needed a dark fibre connection to Holland because the internet simply isn't fast enough!

    I find my head nodding at the tales of systems groaning because people didn't specify a purge strategy and every user pays the price for that oversight.

    My contention is that most organisations don't have and never will have a Big Data problem.  

    However saying that is akin to publishing "Pure, White and Deadly".  John Yudkin had his reputation trashed and didn't live to see his work vindicated.  Many unkind and untrue things were said of him that, as a scientist rather than a politician, he was not equipped to defend himself against.

  • xsevensinzx - Monday, March 27, 2017 9:23 PM

    That's why products like Azure Data Warehouse, Amazon Redshift, Google Bigquery and much more shine. They are tools for specific needs that are not easily resolved by a strategic index.

    It seems like you have no idea how all these tools work and what technology they are based on.

    Amazon Redshift is based on PostgreSQLPostgreSQL 8.0.2.


    https://en.wikipedia.org/wiki/Amazon_Redshift
    OK, Amazon is Redshift is basically a classical RDMS with with additional processing responsible for defining and creating correct indexing basing on the nature of the data and the queries you run.

    I guess I don't need to say much about Azure. 
    It's basically the same SQL Server with on-top "warehouse tools" which apply some limitations comparing to the classic SQL Server.

    Google Bigquery works on BigTable data structure which is - try to guess - an index.

    Data is assembled in order by row key, and indexing of the map is arranged according to row, column keys and timestamps.


    BigTable

    "Big Index" approach used in BigTable is inherited by the most of "Big Data" solutions we aware of:

    Google's thorough description of BigTable's inner workings has allowed other organizations and open source development teams to create BigTable derivatives, including the

    Apache HBaseApache HBase database, which is built to run on top of the Hadoop Distributed File System (HDFS). Other examples include CassandraCassandra, which originated at Facebook Inc., and Hypertable, an open source technology that is marketed in a commercial version as an alternative to HBase.

    Bottom line - all Big Data tools do only one thing: placing the right index on the data.
    And you pay the providers for exactly that - strategically placing the index on your data which you failed to define and place yourself.

    _____________
    Code for TallyGenerator

  • Here is a write up how it actually works for one of them.

    http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf

    Glad you seem to think that the secret to these technologies is just placing indexes on the fields you need. Never knew I could solve all my problems with just a single set of strategic indexes.

    Here I thought that we are just doing large scans of all the data, maybe we just need to focus on better ways to scan and aggregate those larger datasets with these other tools that are specifically designed for it.

  • Sergiy - Tuesday, March 28, 2017 2:47 AM


    Bottom line - all Big Data tools do only one thing: placing the right index on the data.
    And you pay the providers for exactly that - strategically placing the index on your data which you failed to define and place yourself.

    Ermmm....No.  Try putting an index on a multi-petabyte table and see how far it gets you.

    RedShift may use a Postgres implementation but it isn't RDBMS, it's a column store.  A clustered index may help a range scan if the fields in the index are what you are scanning on but column stores work on differently to OLTP databases.

    Not all Big Data tools are databases.  Apache Spark and Beam are distributed compute engines.  These are serious tools, not the usual seasonal fashion.

  • "Bottom line - all Big Data tools do only one thing: placing the right index on the data.
    And you pay the providers for exactly that - strategically placing the index on your data which you failed to define and place yourself."

    No, Sergiy, that's NOT why I would be paying them.  NOBODY adds indexes to MY database tables without my knowledge.  If I don't have an index, it's my decision and my responsibility.  They are free to suggest, not change.

    Rick

    One of the best days of my IT career was the day I told my boss if the problem was so simple he should go fix it himself.

  • > Ermmm....No. Try putting an index on a multi-petabyte table and see how far it gets you.

    How much time it takes for a page scanned by Google robot to appear in a Google search?

    Several days.

    And believe me, they don't run 486DX2 machines.

    Why does it take so long?

    Probably because it's how long it takes to rebuild "an index on a multi-petabyte table".

    Can you see any other reason?

    As for columnstore - one question.

    Columnstore {what} ?

    _____________
    Code for TallyGenerator

  • > NOBODY adds indexes to MY database tables without my knowledge.

    BigTable *is* an index.

    By loading your data into BigTable you are populating an index.

    Maybe without realising it.

    It's funny we even discuss it.

    Do you really think that BigQuery actually scans those petabytes of data every time you run a query?

    _____________
    Code for TallyGenerator

  • Sergiy - Tuesday, March 28, 2017 2:37 PM

    > NOBODY adds indexes to MY database tables without my knowledge.BigTable *is* an index.By loading your data into BigTable you are populating an index.Maybe without realising it.It's funny we even discuss it.Do you really think that BigQuery actually scans those petabytes of data every time you run a query?

    Umm. When you query in Google BigQuery for example, it's doing a full scan of all the columns specified in your query. This is the complete opposite of what you want with a index. You don't want to do full scans. I surely hope you are not trying to compare the algorithms Google is doing behind the scenes as the same as an Index. Regardless of the filters applied with your query, it's snagging all the data within the specific columns you specified and then on the way up, aggregates and filters those results to you.

    Full table scans are the complete opposite of what you want to achieve with an index in a traditional RDBMS. However, if you're trying to aggregate large volumns of data that are not easily indexable or known to be used in queries like a true ad-hoc platform, BigQuery and various other platforms are the right tools for you. 

    You can surely try to say, "oh, just add a index and we can easily query 10 billion records in a single table." But that means you must know what you want to index first. Some of us are using our data for ad-hoc analysis where nothing is really consistent. The traditional RDBMS is not a good platform for those types of workloads. The data is very large, hard to index and even harder to do large table reads without throwing more resources at the problem. Instead, we use platforms like BigQuery, Redshift and so forth that break those HUGE READS into smaller problems within lightning fast columnstores.

    Speaking on SQL Server specifically, it does not have the same power as PostgreSQL (also used by Redshift). It does not allow us to cluster servers together and do proper distributed processing within a true MPP infrastructure (i.e.: Greenplum for example does). Trying to query large sets of unindexed data and aggregate it across billions of records across multiple users at the same time is hard to support on a single machine. Memory and Processing is shared with a single box versus a share nothing environment where each query has it's own dedicated resources.

  • David.Poole - Tuesday, March 28, 2017 1:48 PM

    Sergiy - Tuesday, March 28, 2017 2:47 AM


    Bottom line - all Big Data tools do only one thing: placing the right index on the data.
    And you pay the providers for exactly that - strategically placing the index on your data which you failed to define and place yourself.

    Ermmm....No.  Try putting an index on a multi-petabyte table and see how far it gets you.

    RedShift may use a Postgres implementation but it isn't RDBMS, it's a column store.  A clustered index may help a range scan if the fields in the index are what you are scanning on but column stores work on differently to OLTP databases.

    Not all Big Data tools are databases.  Apache Spark and Beam are distributed compute engines.  These are serious tools, not the usual seasonal fashion.

    Distribution is key here too. How the data is fetched and even how it is stored is in that same fashion. I believe the storage engine for BigQuery takes every field and stores them as separate files. The engine is essentially scanning those column based files every query from top to bottom like a full scan. How they are stored is distributed across the nodes and dispatchers are reading the data in parallel to improve speed. But I could be off with that specifically.

  • xsevensinzx - Tuesday, March 28, 2017 4:35 PM

    Sergiy - Tuesday, March 28, 2017 2:37 PM

    > NOBODY adds indexes to MY database tables without my knowledge.BigTable *is* an index.By loading your data into BigTable you are populating an index.Maybe without realising it.It's funny we even discuss it.Do you really think that BigQuery actually scans those petabytes of data every time you run a query?

    Umm. When you query in Google BigQuery for example, it's doing a full scan of all the columns specified in your query. This is the complete opposite of what you want with a index. You don't want to do full scans. I surely hope you are not trying to compare the algorithms Google is doing behind the scenes as the same as an Index. Regardless of the filters applied with your query, it's snagging all the data within the specific columns you specified and then on the way up, aggregates and filters those results to you.

    Full table scans are the complete opposite of what you want to achieve with an index in a traditional RDBMS. However, if you're trying to aggregate large volumns of data that are not easily indexable or known to be used in queries like a true ad-hoc platform, BigQuery and various other platforms are the right tools for you. 

    You can surely try to say, "oh, just add a index and we can easily query 10 billion records in a single table." But that means you must know what you want to index first. Some of us are using our data for ad-hoc analysis where nothing is really consistent. The traditional RDBMS is not a good platform for those types of workloads. The data is very large, hard to index and even harder to do large table reads without throwing more resources at the problem. Instead, we use platforms like BigQuery, Redshift and so forth that break those HUGE READS into smaller problems within lightning fast columnstores.

    Speaking on SQL Server specifically, it does not have the same power as PostgreSQL (also used by Redshift). It does not allow us to cluster servers together and do proper distributed processing within a true MPP infrastructure (i.e.: Greenplum for example does). Trying to query large sets of unindexed data and aggregate it across billions of records across multiple users at the same time is hard to support on a single machine. Memory and Processing is shared with a single box versus a share nothing environment where each query has it's own dedicated resources.

    That's a nice tale.
    Yes, and pixie dust lets you fly, sure.

    Let's refer to the masters of the game:
    https://cloud.google.com/files/BigQueryTechnicalWP.pdf

    Columnar Storage
    Dremel stores data in its columnar storage, which means it separates a recordinto column values and stores each value on different storage volume, whereastraditional databases normally store the whole record on one volume.


    Here is the answer on you "But that means you must know what you want to index first"
    Dremel indexes *every* column by putting it into a separate columnstore index.
    You say columnstore is not index? 
    then read the following statement:

    Higher compression ratio. One study3 reports that columnar storage can achieve a compression ratio of 1:10, whereas ordinary row-based storage can compress at roughly 1:3. Because each column would have similar values,especially if the cardinality of the column (variation of possible column values) is low, it’s easier to gain higher compression ratios than row-based storage

    Do those key word used in there give you a hint?
    Cardinality - is a characteristic of what?
    "Grouping similar values into one holding references to the original positions of the values" - isn't it what they name "indexing"?
    Columnar store is an index by its nature, by design.
    By loading the data into columnar store you create index on every column involved in the process.

    And to finish with those fairy tales about full scans, from the same sourse:

    Columnar storage has the disadvantage of not working efficiently when updating existing records. In the case of Dremel, it simply doesn’t support any update operations.


    What?
    Why it's such a big deal to add another value to existing columnar store?
    If we do full scans anyway - would extra 10 bytes make any noticeable difference? 

    Yes, update is a problem, because it would require an index to be rebuilt.
    The Columnar Store index, and the Data Distribution indexes which define which Columnstores to scan for the values in your query.

    This is why *ALL* big data storage engines are read-only.
    Modifying data online would simply destroy all the heavy indexing they rely on.
    therefore new data loads always happen asynchronously, off-line, on predefined schedule.
    And only when the load is finished the "current" columnstore index is dropped and replaced with newly built one.

    _____________
    Code for TallyGenerator

  • Now you're getting into semantics. This is how columnstores work in general. They are not referred to the way they store and retrieve this data as indexing regardless of how other engines are positioning themselves with columnar based features like Clustered and Non-Clustered Columnstore Indexing. They operate differently and have different use cases depending on how you are utilizing the data.

    When you get into scenarios where you have large datasets where columnar based solutions are needed to perform analytics on the data, it goes beyond just what you are referring to as, "the same as indexing every column." How the data is compressed, stored and read across your disks and nodes is important. Data becomes not only compressed very tightly, but replicated and extremely fault tolerant. It's also distributed and running in parallel with one another across multiple commodity hardware versus the traditional RDBMS that has columnar features that stuck within one box. Again, MPP versus SMP, which allows each of those readers to have dedicated resources that are not shared.

    The columnar features, storage engine, query engine and more are all part of a complete ecosystem that allows for blazing fast results across very large datasets that is not just easily solved by indexing one or more columns in your data warehouse. This is why these products exist because you can scale out versus constantly having to scale up (i.e.: throw more resources at the problem).

  • Semantics or no semantics - can you answer the simple question?

    Why columnar storage (any implementation of it) is not updateable?

    _____________
    Code for TallyGenerator

  • Sergiy - Tuesday, March 28, 2017 10:48 PM

    Semantics or no semantics - can you answer the simple question?Why columnar storage (any implementation of it) is not updateable?

    You can do updates in Redshift and HP Vertica.  I wouldn't recommend it though.
    You seem to equate big data with simply a bigger version of an existing database.  That simply isn't the case.  Image analysis, audio and video analysis doesn't fit the RDBMS (or column store) use case.

    Either the execution plans on my column stores are giving me "magic pixie dust fake news" or they are scanning.  The column stores use various compression techniques as part of their strategy to make this viable.
    It would be more accurate to say that big data solutions often make multiple read optimised copies of their data.

    HP Vertica has the concept of a projection which is a materialised view of the data.  In Vertica a table is simply a metadata construct containing no data whatsoever.
    There is a minimum of one super-projection which contains all the data with a particular choice of column order and sorting to aid the scan.  There may be other projections, on subsets of the columns, again column order and sort order optimised.  These are not indexes.

Viewing 15 posts - 31 through 45 (of 49 total)

You must be logged in to reply to this topic. Login to reply