The Age of Multiple Databases

  • Yeah, it's not that big in comparison to what's out there obviously, but it's many many TB's.

    The issue is not really the size of the data. I was pushing large datasets on SQL Server 2008 fine. The issue is ad-hoc queries where the data is not modeled for it and we can never get to a point of modeling it all the time. Thus, using these other tools that make it easier to do these types of scenarios works out great because little effort is needed to make them sing. Then when they do become to a point of modeling, that's where the RDBMS comes into play. It's also why the data warehouse is not going away anytime soon. Just more tools for the toolbox.

    However, issues people are having with how these tools are being used by other teams is really irrelevant. How someone uses a tool has no bearing on the tech unless how they use it is forced by the tech itself. Hearing that people have crazy designs that turn you away is a business process problem, not a tech problem. For example, plenty of people can design a database bad in SQL Server, it doesn't make SQL Server a bad tool.

    That's in reference to previous discussions we've had where people using how others are designing these tools as a reason to not use them.

  • I agree with the article, use the technology that is best for the purpose. In my last place of work, we used SQL to store and rationalise the data coming in from multiple sources, then used Solr to display the data to customers as it was much faster. With a baseline and display only data, this worked great, but even with updateable data, this would work as you just switch back to the SQL Server on write and fire off the changes as they happen to Solr (could be any other product quite frankly)..

  • NoSQL database engines are essential for specific use cases; like TB scale analytics, IoT scale request throughput, or a mobile app scale user base. If you're a SQL Server DBA, then NoSQL is your friend, because it offloads pressure from your line of business MSSQL databases. However, if developers are using NoSQL because it's easier to prototype or to work around "annoyances" like relational modeling, then the argument in favor is far less compelling.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell - Tuesday, July 17, 2018 7:03 AM

    NoSQL database engines are essential for specific use cases; like TB scale analytics, IoT scale request throughput, or a mobile app scale user base. If you're a SQL Server DBA, then NoSQL is your friend, because it offloads pressure from your line of business MSSQL databases. However, if developers are using NoSQL because it's easier to prototype or to work around "annoyances" like relational modeling, then the argument in favor is far less compelling.

    This is how I feel. Don't just pick something because you like it, or you don't want to learn something else. Have a good reason why a tool is better. I have no issue with NoSQL stores, just remember that they're not necessarily better than an RDBMS, and there are limitations.

  • Eric M Russell - Tuesday, July 17, 2018 7:03 AM

    NoSQL database engines are essential for specific use cases; like TB scale analytics, IoT scale request throughput, or a mobile app scale user base. If you're a SQL Server DBA, then NoSQL is your friend, because it offloads pressure from your line of business MSSQL databases. However, if developers are using NoSQL because it's easier to prototype or to work around "annoyances" like relational modeling, then the argument in favor is far less compelling.

    Not necessarily. I would say the largest benefit for our business with choosing the route we have is for the fact it's easier to load data without the need of the relational model. This is because of how long it takes for DBA's to model the data and finally integrate it into the underlying RDBMS system.

    I'm starting to find as I age in this profession that a lot of data professionals seem to think this is not a problem like users just need to deal with it. This kind of baffles my mind why anyone would dismiss the fact that longer time to market is a positive thing versus a negative thing. It's one of the major reasons so many people are disliking the idea of the data warehouse for the simple fact it cannot turn on a dime when the business changes. If another solution can, you'd be surprised how much more that business would like that feature over say, other features that the solution would have once the data is finally modeled.

  • xsevensinzx - Wednesday, July 18, 2018 6:01 AM

    Not necessarily. I would say the largest benefit for our business with choosing the route we have is for the fact it's easier to load data without the need of the relational model. This is because of how long it takes for DBA's to model the data and finally integrate it into the underlying RDBMS system.

    I'm starting to find as I age in this profession that a lot of data professionals seem to think this is not a problem like users just need to deal with it. This kind of baffles my mind why anyone would dismiss the fact that longer time to market is a positive thing versus a negative thing. It's one of the major reasons so many people are disliking the idea of the data warehouse for the simple fact it cannot turn on a dime when the business changes. If another solution can, you'd be surprised how much more that business would like that feature over say, other features that the solution would have once the data is finally modeled.

    What you just described would fall under the category of what I called "TB scale analytics". It seems your business model requires ingesting and reporting on large volumes of data where the structure changes frequently, but that's an atypical data usage pattern for most organizations. The way I see it, Hadoop is good for staging and exploring data; the core components (HDFS and MapReduce) are like the foundational base of a data warehouse pyramid. In a nutshell, I see it as a staging and ad-hoc analysis tool. But once we've identified useful patterns and the subset of data that users's actually need, it makes sense to have those golden records contained in a relational or OLAP database at the top of that pyramid, where the data can have stronger semantic rules enforced, enriched with meta-data from multiple sources, and providing low latency access.

    When it comes to multi-TB scale ETL and analytics, there are certainly better options than SQL Server. Yet, ETL and analytics are just facets in the broader data warehouse architecture.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell - Wednesday, July 18, 2018 6:53 AM

    What you just described would fall under the category of what I called "TB scale analytics". It seems your business model requires ingesting and reporting on large volumes of data where the structure changes frequently, but that's an atypical data usage pattern for most organizations. The way I see it, Hadoop is good for staging and exploring data; the core components (HDFS and MapReduce) are like the foundational base of a data warehouse pyramid. In a nutshell, I see it as a staging and ad-hoc analysis tool. But once we've identified useful patterns and the subset of data that users's actually need, it makes sense to have those golden records contained in a relational or OLAP database at the top of that pyramid, where the data can have stronger semantic rules enforced, enriched with meta-data from multiple sources, and providing low latency access.

    When it comes to multi-TB scale ETL and analytics, there are certainly better options than SQL Server. Yet, ETL and analytics are just facets in the broader data warehouse architecture.

    For sure, but even in smaller datasets, such as general operational reporting, you still face the underlying issue of not being able to be flexible with the model. It doesn't matter if you have TB of data or not. Speaking on cases such as establishing and maintaining a proper data warehouse for whatever the use case, it's not as flexible for most data warehouses to pivot in another direction. This alone refers to anyone who goes through a lot of iterations in their models where they are constantly shifting. Analytics or not, that applies to a lot of organizations and the time to shift is insanely long. Hence the huge debate of just letting the developers do that versus landing it in the DBA's lap to do it.

    I'm not trying to spawn the argument on whatever approach you should take. Just highlight that so many data professionals are ignoring the fact that being flexible with the model is INSANELY important for a lot of organizations. To think that it's not is silly.

  • xsevensinzx - Monday, July 16, 2018 8:33 PM

    Yeah, it's not that big in comparison to what's out there obviously, but it's many many TB's.

    The issue is not really the size of the data. I was pushing large datasets on SQL Server 2008 fine. The issue is ad-hoc queries where the data is not modeled for it and we can never get to a point of modeling it all the time. Thus, using these other tools that make it easier to do these types of scenarios works out great because little effort is needed to make them sing. Then when they do become to a point of modeling, that's where the RDBMS comes into play. It's also why the data warehouse is not going away anytime soon. Just more tools for the toolbox.

    However, issues people are having with how these tools are being used by other teams is really irrelevant. How someone uses a tool has no bearing on the tech unless how they use it is forced by the tech itself. Hearing that people have crazy designs that turn you away is a business process problem, not a tech problem. For example, plenty of people can design a database bad in SQL Server, it doesn't make SQL Server a bad tool.

    That's in reference to previous discussions we've had where people using how others are designing these tools as a reason to not use them.

    You said a mouthful there.  Thanks for the feedback.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • I find a lot of business requirements for data involve testing and experimentation.  They don't know what they don't know and want to explore the data.  They are not yet at the position of being able to define some concrete requirements that would be suitable for a data warehouse.

    It is combinations of volume, velocity an variety that cause the issues rather than just one of those characteristics.

    Languages like Python are incredibly useful because you can achieve a lot very quickly from a development perspective.  This lets business people and data people get a rapid feel for what is useful, what is a promising lead and what are the dead ends.  It is flexibility that is required not perfection.  The low cost experimentation benefits of open-source software have spilled over into the data world where such benefits can yield profitable results.

    In the ideal world we could produce perfectly modelled, structured and conformed data extremely rapidly.  In practise there are technical hurdles and information gaps that prevent this happening.

  • If you're a  government, healthcare, or national retailer, then you probably ingesting a large variety and volume of data from many sources and constantly have new requirements. Whether or not a schema-less data warehouse is right for an organization depends on how we're defining the data warehouse, what are it's boundaries, and use cases. Going back to my analogy of a pyramid with data ingest and staging at the bottom, what's at the top are operational data marts, OLAP cubes, PowerBI reporting, and other end user facing databases. End users on the business side of things want to work with consistent, contextual, and low latency data, and that means strong schema and conformed meta-data.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • I try to be open about options, if there is a better solution to a problem than "just put it in SQL Server" then it is worth exploring. You do have to be careful to avoid the "I want to play with this, it just sounds cooler/easier" though. I do try to push the concept that SQL Server, despite it's name, is more than just an RDBMS and can be greatly beneficial as a hybrid data platform.

    For example, we have some data that is mostly used by client applications and using a key/value approach with json blobs provides a much more seamless development experience for 90% of the workload and avoids the pain of constantly trying to handle relational/OO mapping (front-end developers aren't data experts, data-experts aren't used to OO thinking) for things that don't really need it. Behind the scenes though, we don't want the pain of an EAV database - so Indexed Views can be created to transparently transform the bits of data we do need extracted for other data tasks into suitable table structures for conventional queries. And all of the entity storage built atop temporal tables to provide an audit trail/changelog that "just works".

Viewing 11 posts - 16 through 25 (of 25 total)

You must be logged in to reply to this topic. Login to reply