Production Subsets

  • What's the compelling reason to use a subset of data? If it is no big deal to copy a recent backup into a test area, why not just do it? This works well for us. We have four test areas that we do this in, used for various purposes.

    have spawned a few comments, and I agree with @gary, @steve-2, @podmate, and @tom.

    Valid and good points, all around. The last post from @tom, though, could use a comment. I am working in EU, and am neither a lawyer, nor a legal counsel! However, I have a few experiences regarding 1) how things work, and 2) how things ought to work.

    The description by @tom is correct, taken literally; but it is too black-and-white. Keep in mind that the lawsuits in EU does not come near to US standard by any means, and that risk isn't bad. Badly handled risk is bad. So in many situations, you can argue the use of production data, like the story of @podmate (which concerns Health Care Data, which is really a delicate area!).

    However, @tom has a true point: testing isn't an allowed purpose! And if you are big enough, or "governmental" enough, you simply has no excuse for mistreating "personal data"!! But it all comes down to the possible event of "misuse". And if you can establish other layers of information security, and a very good reason which can hold both in court and in the newspapers, things start to grey... Not that you won't get yelled at if caught, possibly fired, and maybe with a big tab to pick up after the party.

    A few top managers in Europe can testify to that, because breaches of principles in the legislation framework on Personal Data is recurring in Europe. (But generally, awareness is high).

    One example: If you have systems in place to track each client's consent to "sharing personal data internally", you can do a lot more stuff in the development and testing departments than if you don't. However, you may never put any client in any situation of data misuse - and senior management can't get off the hook by claiming "we told them not to". They have to make sure logging and monitoring is in place, for example, and prove a record of taking corrective actions on minor breaches, if they are ever going to convince a Court of Law that they hold no guilt for a major breach.

    I really appreciate the notion from @tom that "use of production data for testing purposes" is not optimal:

    if one uses only production data one can do none of (i) adequate stress testing (ii) adequate performance testing and (iii) testing new features that need additional tables or attributes

    is simply right on the bullseye!

    The story from @podmate just underlines that the world isn't perfect, and sometimes database professionals have to make do with what they have.

    A thought just occurred to me: You practically ever only need to test on production data, if you are building (or rebuilding) on a code system which is bugged down to begin with... However, if your development people have no clue what they are working on (e.g. professional terminology) and you don't provide them any other way of gaining insight, you may speed up the creation of solid code considerably by allowing them to see "the real world". But as mentioned, this is neither a risk-free approach, nor a cheap one. Even though Senior Management and Share Holders sometimes seem to think so...

  • There has been talk in this thread by some that you will only have "real" data when developing systems if you use production data. The assumption has been that the applications need to handle this. In ETL then I would agree in the main, however, many application programmes and larger ETL projects have a data cleansing stage for a reason. Sometimes the solution to bad data is to NOT code around it but to fix the data. Sometimes.

    Gaz

    -- Stop your grinnin' and drop your linen...they're everywhere!!!

  • I have always tended to have a sub-set of production data (the largest system used to have 350k transactions per day but we are talking of nearly two decades ago) and a development set. The development set is to test boundary conditions, e.g. 29th February 2017!

  • Tom Thomson is now even more right in a GDPR world.  With a €20millon fine hanging on it you want to make sure that all steps to prevent exposure of personally identifiable data are absolutely bomb proof.
    5 years ago a restore from production was probably acceptable but a bit dubious legally even in non-regulated industries.  The game has moved on.  The consequences and likelihood of data breach have escalated.

  • While acknowledging the legal issues (and the law of unintended consequences politicians live by) I have to say that where ever possible the superior theoretical choice will always be developing against a full copy of the production data.
    Of course our company is somewhat unique in that we don't have data from individuals. We don't store customer credit cards, contact info, or whatever else might be considered PII.

    Second, I'm a lone wolf developer/DBA, there's just me to handle the entire development/production cycle, I'm both developer and DBA. Naturally, that's a huge advantage in some respects (with admitted downsides, naturally).

    A full blown production dataset has several advantages:

    1) Performance wise, what you see is what you'll get. Keeping data  synced between dev and prod with each release (new features/code goes up, data comes down) means I can safely reproduce any performance issues in development EXACTLY.

    2) Data wise you can pinpoint deficiencies in data import, data validation, expansion of data domains, etc. Not possible without production data.

    3) Related to point 1, it's much easier to tune performance with prod data, both in terms of data volume and those pesky (and ever changing) parameter sniffing and index usage issues.

    Now, if I were in healthcare.... 🙂

  • roger.plowman - Monday, February 19, 2018 7:15 AM

    While acknowledging the legal issues (and the law of unintended consequences politicians live by) I have to say that where ever possible the superior theoretical choice will always be developing against a full copy of the production data.

    I'll disagree slightly here. Your advantages are correct, but you're ignoring a few things.

    1.  As you're developing, larger data sets slow down access and experimentation. They especially become apparent during mistakes, and the larger the data, the easier it can be to miss subtle mistakes because you can't easily run calculations or checks of groups and aggregations in your head. Small data sets are superior here for initial development.
    2. For  testing, I'd rather have  proctution * 1.2 or 1.5 or 2.0, depending on growth rates. I have found too often I'm chasing perf issues because I'm not planning for growth.

  • roger.plowman - Monday, February 19, 2018 7:15 AM

    While acknowledging the legal issues (and the law of unintended consequences politicians live by) I have to say that where ever possible the superior theoretical choice will always be developing against a full copy of the production data.

    Regardless of the  data protection law issue, it is still true that usinga full copy of production data is far from optimal.  As I wrote in 2014, is is clear that one compelling reason for not doing that is that if one does all testing on a complete set of production data the testing may be very slow - a lot of testing can be done on a small dataset - and that another compelling reason is that if one uses only production data one can do none of (i) adequate stress testing (ii) adequate performance testing and (iii) testing new features that need additional tables or attributes, because production data is not suitable for any of that.

    Of course the  proposed  Cloud Act suggests that US courts can issue subpoenas requiring citizens or companies of other countries, even if they have no connection whatsoever with the USA, to deliver data which they hold on any machines, whether of not in any way connected with the USA, situated in any country in teh world, concerning any person, whether or not that person has any connection with the USA,  regardless whether the citizens or companies subpoenaed would be breaking the law of the country in which they live and work by delivering that data.  If those citizens or companies failed to deliver the data, the US court would hold them in contempt, probably calling it criminal comtempt, and the USA might then attempt extradition the citizens (or company officers) concerned.  I find it hard to believe that such extradition atttempts would be successful, but given the irresponsible twits at the head of the current UK government I can't find it impossible in the UK lthough it is in the rest of the EU - indeed our prime minister has stated she thinks this proposed law is a good idea, although she must know that supporting it destorys any hope of any sensible Brexit deal since that support (a) guarantees no transitional period and (b) ensures that there can be no European Union PI data processed in the UK, which means no online sales or postal sales to anywhere in the EU from any person or company in the UK - bang goes any prospect of a trade deal).

    Clearly if this law were passed, it would destroy the current USA-EU agreement on data protection for personaly identifiable data of EU citizens, which would be cease to be appliable immediately that law were enacted in the USA.   When the old "safe harbour" agreement was shown to be useless and declared invalid by European courts that caused a great degree of panic in American companies since that meant that they would lose a great deal of business (since it would be illegal for them to have access to the data); the European courts would probably ull the plug a good deal faster with the passing of a law like this, which clearly invalidates the agreement. So perhaps American companies currently processing European data in the USA would be in serious trouble.

    Tom

  • TomThomson - Tuesday, February 27, 2018 9:38 AM

    roger.plowman - Monday, February 19, 2018 7:15 AM

    While acknowledging the legal issues (and the law of unintended consequences politicians live by) I have to say that where ever possible the superior theoretical choice will always be developing against a full copy of the production data.

    Regardless of the  data protection law issue, it is still true that usinga full copy of production data is far from optimal.  As I wrote in 2014, is is clear that one compelling reason for not doing that is that if one does all testing on a complete set of production data the testing may be very slow - a lot of testing can be done on a small dataset - and that another compelling reason is that if one uses only production data one can do none of (i) adequate stress testing (ii) adequate performance testing and (iii) testing new features that need additional tables or attributes, because production data is not suitable for any of that.

    Of course the  proposed  Cloud Act suggests that US courts can issue subpoenas requiring citizens or companies of other countries, even if they have no connection whatsoever with the USA, to deliver data which they hold on any machines, whether of not in any way connected with the USA, situated in any country in teh world, concerning any person, whether or not that person has any connection with the USA,  regardless whether the citizens or companies subpoenaed would be breaking the law of the country in which they live and work by delivering that data.  If those citizens or companies failed to deliver the data, the US court would hold them in contempt, probably calling it criminal comtempt, and the USA might then attempt extradition the citizens (or company officers) concerned.  I find it hard to believe that such extradition atttempts would be successful, but given the irresponsible twits at the head of the current UK government I can't find it impossible in the UK lthough it is in the rest of the EU - indeed our prime minister has stated she thinks this proposed law is a good idea, although she must know that supporting it destorys any hope of any sensible Brexit deal since that support (a) guarantees no transitional period and (b) ensures that there can be no European Union PI data processed in the UK, which means no online sales or postal sales to anywhere in the EU from any person or company in the UK - bang goes any prospect of a trade deal).

    Clearly if this law were passed, it would destroy the current USA-EU agreement on data protection for personaly identifiable data of EU citizens, which would be cease to be appliable immediately that law were enacted in the USA.   When the old "safe harbour" agreement was shown to be useless and declared invalid by European courts that caused a great degree of panic in American companies since that meant that they would lose a great deal of business (since it would be illegal for them to have access to the data); the European courts would probably ull the plug a good deal faster with the passing of a law like this, which clearly invalidates the agreement. So perhaps American companies currently processing European data in the USA would be in serious trouble.

    Amazing breath control there. :laugh:

    As for speed of testing, I can't agree. If you want faster tests, include a WHERE clause. However, there are two arguments to using production data. A) Realistic performance testing. Nothing like using a copy of the real thing to get accurate performance data. (Always assuming the data is mature enough to reflect real-world conditions, such as replacing an existing system). And B) real data tells you what kind of screwed up data soup you're going to have to force through the system. Collectively humans are endlessly inventive in finding ways to muck up a perfectly good system. They do things that defy common sense and the laws of nature in ways no small testing group using artificial data ever could.

    As for your rant about the Cloud Act I'm in complete agreement--and I'm an American. This is just overreach by law enforcement, and a reinforcement of the old adage about power corrupting. I suspect the law of unintended consequences is about to start laying into the politicians involved with a clue stick covered in rusty rail-road spikes...

  • roger.plowman - Wednesday, February 28, 2018 6:27 AM

    As for speed of testing, I can't agree. If you want faster tests, include a WHERE clause. 

    I'd say a test != a test.

    For developers, when working out issues, I want a small data set most of the time. There's no need for a large query set to work out how to build logic. Just enough to understand the issue, which I would say is really 10-30 rows max for most cases. A WHERE clause may or may not help here, but more importantly it's another thing for developers to remember and deal with. Anything distracting is unwanted.

    However, at some point, all code should be exercised against 1.0*production to 2.0*production size to be sure there aren't issues with performance. This is where you learn what works and what doesn't. Developers should then take feedback here to understand how to write better code in the future.

  • Steve Jones - SSC Editor - Thursday, March 1, 2018 2:59 AM

    roger.plowman - Wednesday, February 28, 2018 6:27 AM

    As for speed of testing, I can't agree. If you want faster tests, include a WHERE clause. 

    I'd say a test != a test.

    For developers, when working out issues, I want a small data set most of the time. There's no need for a large query set to work out how to build logic. Just enough to understand the issue, which I would say is really 10-30 rows max for most cases. A WHERE clause may or may not help here, but more importantly it's another thing for developers to remember and deal with. Anything distracting is unwanted.

    However, at some point, all code should be exercised against 1.0*production to 2.0*production size to be sure there aren't issues with performance. This is where you learn what works and what doesn't. Developers should then take feedback here to understand how to write better code in the future.

    Um....no. Just no.

    This is THE single worst idea ever. Development should happen against full size data sets from the beginning. Over and over again there have been disastrous scaling-related performance issues that render systems unfit for purpose--HealthCare.gov is the first example that springs to mind, but the laundry list of similar failures is depressingly epic in length.

    Developers of non-SQL languages simply do not comprehend the antiquated dinosaur that is T/SQL. It's a completely different paradigm from a bygone era. Absolutely required best practices in, say, .net or Java are WORST practices in T/SQL. Principles like DRY and code reuse are not only discouraged in high-performance T/SQL, they are actively lethal to it!

    The ONLY way developers of non-SQL languages will understand just how badly "T/SQL written like C#" really (doesn't) work is to run "simple" queries against a copy of production data and find out it takes 15 minutes to fill one combo box--when the form has 15 more...

    T/SQL scaling is not linear, it's freaking exponential. Or compound-exponential. Developing against small data sets hides that fact until it's too late.

  • roger.plowman - Thursday, March 1, 2018 7:03 AM

    Um....no. Just no.

    T/SQL scaling is not linear, it's freaking exponential. Or compound-exponential. Developing against small data sets hides that fact until it's too late.

    Disagree, and if you develop until it's too late before applying an idea against a large data set, your process is broken. Developing against small data sets and doing large data sets later doesn't mean when the feature is done, checked in and sent to QA. That's a fundamental development flow issue. 

    Every developer needs to understand the impact of their code against production sized loads, but they don't need to build everything against them, especially early on.

  • Many schools of thought behind that one. I tend to get the logic down first to simply get something done quickly, then iterate over it for scalability later. Too many times, the end user is waiting for everything to get done before a working example or prototype is available for feedback. Then once you finally get something done, the returned feedback becomes delayed and you're back to iterating on it again anyways. Thus, I tend to stick to smaller and grow into larger as part of the natural life-cycle of a query for example.

  • Rubrik, the backup solution we use, allows us to live mount a SQL Server database from backup, which is great for QA testing. Redgate has something similar.

    But the good thing about using mockup data in development is that you can unit test your application against configuration scenarios that may not even exist in production yet. For example, assume you're developing an application that uses a multi-tenant database. Currently, there may only be a handful of clients in production, and they only use a subset of the application's functionality. However, you can test the full range of features using mockup data.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Viewing 13 posts - 31 through 42 (of 42 total)

You must be logged in to reply to this topic. Login to reply