Bad habit #2: Prod data -->> dev, test databases

  • It is concerning that your developers believe they not only require production data, but that they are unable to do their job without it. I'd be looking for new developers that actually knew what they were doing and what a software development lifecyle is all about.

    Production data should only ever sit in Production. Period.

    Developers develop and their task is to meet functional requirements. This means that they can prove through each iteration of their development that functions / procedures / methods perform their function correctly and are regression tested. For this, they obviously require input data. They also need this data to be a known good input set that provides repeatable outputs. Development environments do not require the same performance as Production - it'd be nice if you can afford it, but that is very rare. Development also doesn't require the same storage allocation.

    Using Production data fails this requirement, as by definition, it changes - it's Production.

    Once the developers have reached their milestone and are ready for Test / Acceptance, the project is progressed to the next lifecycle environment. Now, depending on the size of your organisation your milage will vary. Obviously larger enterprises have greater resources and can provide an ideal path for development. In saying that, being large enterprises, they also require the rigour around the development lifecycle.

    Test / Acceptance contains many sub-environments. You may have a UAT (User Acceptance Test), System Integration Testing (testing that disparate components communicate correctly), Systems Test (varying parts work together on their destined systems), etc...

    Once your product has passed those tests, it is ideal to perform Volume Testing. This environment is classified as Production for reasons revealed next. This is in an environment that mirrors your Production configuration - same hardware, memory, cpu, disk, etc... in terms of capability and performance. You take a representative sample of your Production data and perform load testing to ensure that your Production configuration meets your non-functional requirements (query times, backup windows, downtime restrictions, user responsiveness).

    Now you progress to Production.

    There are many issues hosting Production data in a lower lifecycle environment. Not the least of which is protecting the business - Production data is sensitive. It is what runs your business. In those organisations where auditing, especially legislative, is required - you will get nailed.

    As I mentioned above, Production data is not suitable for a Development environment due to its fluid nature. Developers meet functional requirements and to do that they need a known good base level with static data to provide repeatable tests.

    Visual Studio can provide unit testing and testing data. I'm sure there are other tools out there that will do this too.

    If you are unable to, for whatever reason, provide random non-Production data and must use Production, then at least scrub it. Take a representative sample, randomise records around and break your Production referential links and mess up the actual values so that they are meaningless. It is important that you do both - merely changing the data and leaving the links inplace will allow someone with the time and a little knowledge to workout your business data.

    So, to reiterate - use development data in development, use acceptance data in acceptance, use a representative sample of production data in volume test and production data in production. It'll save you both time and money, lower risk and increase protection. Anything else and you're on a short road to problems.

    --
    Andrew Hatfield

  • Andrew is pretty on here with what I think. The exceptions I'd make for production data is in a brake/fix or User Acceptance Test environment where the people verifying things need to work with known quantities for their testing. And again, these aren't developers, they're business people (or clients) that validate the way things work.

    If you must test in a test spot, it should be set up and quickly torn down when tests aren't done. Developers, no matter what they say, don't NEED production data. They want it because it makes testing easier, and they can verify things with a business person, or use info from a business person. If that's the case, let the business person test.

    With VMs, there's no reason not to set up an extra environment for testing.

  • Steve Jones - Editor (5/13/2009)


    Andrew is pretty on here with what I think.

    You're a top bloke

    Steve Jones - Editor (5/13/2009)


    The exceptions I'd make for production data is in a brake/fix or User Acceptance Test environment where the people verifying things need to work with known quantities for their testing. And again, these aren't developers, they're business people (or clients) that validate the way things work.

    If you must test in a test spot, it should be set up and quickly torn down when tests aren't done. Developers, no matter what they say, don't NEED production data. They want it because it makes testing easier, and they can verify things with a business person, or use info from a business person. If that's the case, let the business person test.

    With VMs, there's no reason not to set up an extra environment for testing.

    I agree. If you have an explicit, specific problem that you are unable to replicate in a lower environment, then setup a small problem specific test environment and then replicate - and hopefully resolve.

    --
    Andrew Hatfield

  • Hats off to both Andrew and Steve. You supplied plenty of food for thought, espeically regarindg the need for developer access to production data and what to do in the rare cases where they may need such access.

    Barkingdog

  • Barkingdog (5/13/2009)


    Good point:

    Of course while it may not match production data RedGat'es data generator (and others) is a good start in my opinion.

    Barkingdog

    But that's the whole point, isn't it? A tool or a tester will enter data that is expected. It's easy to write bulletproof code when your data is entered as expected by a simulation. End users will always find a way to enter the unexpected. Those are the conditions that need to be caught in testing. Any bozo can write code that will work if the user enters exactly what is expected.

  • jparra (5/14/2009)


    Barkingdog (5/13/2009)


    Good point:

    Of course while it may not match production data RedGat'es data generator (and others) is a good start in my opinion.

    Barkingdog

    But that's the whole point, isn't it? A tool or a tester will enter data that is expected. It's easy to write bulletproof code when your data is entered as expected by a simulation. End users will always find a way to enter the unexpected. Those are the conditions that need to be caught in testing. Any bozo can write code that will work if the user enters exactly what is expected.

    One of the points being made here is that there are different kinds of testing with different goals, and consequently with different dataset requirements.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Andrew Hatfield (5/13/2009)


    It is concerning that your developers believe they not only require production data, but that they are unable to do their job without it. I'd be looking for new developers that actually knew what they were doing and what a software development lifecyle is all about.

    Wow. I am rather curious if you've years of development in a production environment.

    Production data should only ever sit in Production. Period.

    How very DBAish of you. There is only one answer to every question. Not every circumstance is the same. This kind of one size fits all approach is what causes large projects to fail on a regular basis.

    Developers develop and their task is to meet functional requirements. This means that they can prove through each iteration of their development that functions / procedures / methods perform their function correctly and are regression tested. For this, they obviously require input data. They also need this data to be a known good input set that provides repeatable outputs.

    Um, no. It's easy to write brittle code that will pass a set of known regression tests. To write robust code, variability is needed. Unless your production users only run scripts from RoboTest.

    Using Production data fails this requirement, as by definition, it changes - it's Production.

    If you're code can't handle production data, how is it going to handle production data? That's kind of the point, isn't it?

    Once the developers have reached their milestone and are ready for Test / Acceptance, the project is progressed to the next lifecycle environment. Now, depending on the size of your organisation your milage will vary. Obviously larger enterprises have greater resources and can provide an ideal path for development. In saying that, being large enterprises, they also require the rigour around the development lifecycle.

    Test / Acceptance contains many sub-environments. You may have a UAT (User Acceptance Test), System Integration Testing (testing that disparate components communicate correctly), Systems Test (varying parts work together on their destined systems), etc...

    That's all fine and dandy, but an issue caught in Dev cost ten times less money and resources to fix than an issue caught in development.

    Once your product has passed those tests, it is ideal to perform Volume Testing. This environment is classified as Production for reasons revealed next. This is in an environment that mirrors your Production configuration - same hardware, memory, cpu, disk, etc... in terms of capability and performance. You take a representative sample of your Production data and perform load testing to ensure that your Production configuration meets your non-functional requirements (query times, backup windows, downtime restrictions, user responsiveness).

    If you wait until the end of the lifecycle to perform Volume Testing, you are begging for trouble. If often seen code perform terribly at this stage. Now it's going to cost you twenty to thirty times as much time and resources to fix. Very efficient project management.

    Now you progress to Production.

    There are many issues hosting Production data in a lower lifecycle environment. Not the least of which is protecting the business - Production data is sensitive. It is what runs your business. In those organisations where auditing, especially legislative, is required - you will get nailed.

    Maybe, maybe not. In reality, only a few data elements out of an entire database are sensitive. In most businesses, the books are open to the public because they are a regulated business, or because stockholders want to see them. Only customer data is really private, and that usually isn't difficult to desensitize on the way down to Development.

    As I mentioned above, Production data is not suitable for a Development environment due to its fluid nature. Developers meet functional requirements and to do that they need a known good base level with static data to provide repeatable tests.

    Essentially, you prefer to wait to find issues in production. Not a good way to run a business, IMHO.

    So, to reiterate - use development data in development, use acceptance data in acceptance, use a representative sample of production data in volume test and production data in production. It'll save you both time and money, lower risk and increase protection. Anything else and you're on a short road to problems.

    At least you are consistent. 🙂

  • Steve Jones - Editor (5/13/2009)


    Andrew is pretty on here with what I think. The exceptions I'd make for production data is in a brake/fix or User Acceptance Test environment where the people verifying things need to work with known quantities for their testing. And again, these aren't developers, they're business people (or clients) that validate the way things work.

    If you must test in a test spot, it should be set up and quickly torn down when tests aren't done. Developers, no matter what they say, don't NEED production data. They want it because it makes testing easier, and they can verify things with a business person, or use info from a business person. If that's the case, let the business person test.

    Business persons will test, but it is far and away more efficient to find bugs early in the development lifecycle. That means EVERY environment should have robust data to test with.

  • jparra (5/14/2009)


    Developers develop and their task is to meet functional requirements. This means that they can prove through each iteration of their development that functions / procedures / methods perform their function correctly and are regression tested. For this, they obviously require input data. They also need this data to be a known good input set that provides repeatable outputs.

    Um, no. It's easy to write brittle code that will pass a set of known regression tests. To write robust code, variability is needed. Unless your production users only run scripts from RoboTest.

    It is delusional to think that random production data is a better test of code "brittleness" than good regression tests simply because it's "production" or it's random or it's recent. Good regression tests are in fact based on production data and problems that the code has previously had with it.

    In fact it was the inadequacy of testing against random production data and it's likelihood of allowing the re-entrance of previously fixed bugs ("brittleness" or worse) that led to the regression testing methodology in the first place. Anyone who is getting "brittle" code past the regression tests, either has inadequate regression test sets or is intentionally gaming the system, which is unethical and IMHO really should be grounds for dismissal.

    That all said, there is one real exception to this: initial software releases do not have an sufficient production history to have developed adequate regression tests yet. OF course, you usually do not have adequate production datasets yet either. In this case you must rely on exhaustive functional testing, which is not nearly as good as when it is transitioned to true regression testing. In many cases this is supplemented with extensive Pilot testing.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • In fact it was the inadequacy of testing against random production data and it's likelihood of allowing the re-entrance of previously fixed bugs ("brittleness" or worse) that led to the regression testing methodology in the first place. Anyone who is getting "brittle" code past the regression tests, either has inadequate regression test sets or is intentionally gaming the system, which is unethical and IMHO really should be grounds for dismissal.

    I agree with you in a perfect world scenario. All developers should be of the highest quality, develop perfect test scripts, and all users should use every application precisely as designed. Unfortunately, we all must live in the real world, not in a mythical utopia. There are precious few people that are that good at what they do, and they don't all work for my (or your) company.

    Just as important, IMO, is that performance testing should NEVER be left until the end of the development cycle. That is a recipe for disaster. I do not believe that perfomance testing against a set of data generated by a machine is effective perfomance testing.

  • jparra (5/14/2009)


    Barkingdog (5/13/2009)


    Good point:

    Of course while it may not match production data RedGat'es data generator (and others) is a good start in my opinion.

    Barkingdog

    But that's the whole point, isn't it? A tool or a tester will enter data that is expected. It's easy to write bulletproof code when your data is entered as expected by a simulation. End users will always find a way to enter the unexpected. Those are the conditions that need to be caught in testing. Any bozo can write code that will work if the user enters exactly what is expected.

    No, if a tester is only inputing what is classified as valid data then they also are not doing their job. The whole point of testing is to not only prove that your function / procedure / method does its job correctly with valid inputs, but also to ensure that it handles incorrect inputs correctly.

    --
    Andrew Hatfield

  • jparra (5/14/2009)


    Andrew Hatfield (5/13/2009)


    It is concerning that your developers believe they not only require production data, but that they are unable to do their job without it. I'd be looking for new developers that actually knew what they were doing and what a software development lifecyle is all about.

    Wow. I am rather curious if you've years of development in a production environment.

    I've many years of experience working in software development houses including the banking and finance sector as well as various national, multi-national and government organisations.

    jparra (5/14/2009)


    Production data should only ever sit in Production. Period.

    How very DBAish of you. There is only one answer to every question. Not every circumstance is the same. This kind of one size fits all approach is what causes large projects to fail on a regular basis.

    I hate to use these mnemonics in a forum like this, but I'll make an exception. ROFLCopter. I'm certainly not a DBA - far from it. If the guys I worked with heard that they'd also be laughing on the floor. I'm a solutions architect with a focus on infrastructure. I work with frameworks, metholdogies and rules.

    jparra (5/14/2009)


    Developers develop and their task is to meet functional requirements. This means that they can prove through each iteration of their development that functions / procedures / methods perform their function correctly and are regression tested. For this, they obviously require input data. They also need this data to be a known good input set that provides repeatable outputs.

    Um, no. It's easy to write brittle code that will pass a set of known regression tests. To write robust code, variability is needed. Unless your production users only run scripts from RoboTest.

    The point of the tests is to ensure that the code meets its functional requirements. Input A goes in Process A, Process A computes and results in Output A or Output B. Test == Is Output A or Output B valid for Input A.

    I'd find it hard to believe that developers would work so hard just to pass regressions tests, but not write code that actually worked as the requirements stated - and if you did, its time for them to find another career.

    jparra (5/14/2009)


    Using Production data fails this requirement, as by definition, it changes - it's Production.

    If you're code can't handle production data, how is it going to handle production data? That's kind of the point, isn't it?[/code]

    Code should be written to meet a set of functional requirements. One would hope that they are based on reality. If they're not, the problem lies with the requirements and business analysis - not the testing data.

    Production data is exactly that because its passed validation to enter the system. It met a set of rules. Your test data does exactly the same thing - but without it being real world live data.

    jparra (5/14/2009)


    Once the developers have reached their milestone and are ready for Test / Acceptance, the project is progressed to the next lifecycle environment. Now, depending on the size of your organisation your milage will vary. Obviously larger enterprises have greater resources and can provide an ideal path for development. In saying that, being large enterprises, they also require the rigour around the development lifecycle.

    Test / Acceptance contains many sub-environments. You may have a UAT (User Acceptance Test), System Integration Testing (testing that disparate components communicate correctly), Systems Test (varying parts work together on their destined systems), etc...

    That's all fine and dandy, but an issue caught in Dev cost ten times less money and resources to fix than an issue caught in development.

    I'm assuming you mean caught in production so I'll run with that.

    Of course the earlier something Bad(tm) is caught, the better. But we all know that that isn't always the case - that's why we test. Testing is just as if not more important than the actual coding. QA shouldn't be something in the back room noone talks about and the testers treated like second class citizens.

    Builds should happen regularly - nightly if possible and breakages should be public to the entire development team. That includes testers, developers, writers, team leads, architects, etc...

    The point is test, test and test. If something gets through to production that the team wasn't aware of and accepted as a known bug, then the failure is in the process - not necessarily / just the code. It means that the testers also didn't do their job.

    jparra (5/14/2009)


    Once your product has passed those tests, it is ideal to perform Volume Testing. This environment is classified as Production for reasons revealed next. This is in an environment that mirrors your Production configuration - same hardware, memory, cpu, disk, etc... in terms of capability and performance. You take a representative sample of your Production data and perform load testing to ensure that your Production configuration meets your non-functional requirements (query times, backup windows, downtime restrictions, user responsiveness).

    If you wait until the end of the lifecycle to perform Volume Testing, you are begging for trouble. If often seen code perform terribly at this stage. Now it's going to cost you twenty to thirty times as much time and resources to fix. Very efficient project management.

    Obviously you test as much of as much as you can along the way. But you need to remember that in large enterprise projects, its not always possible to bring all the pieces together until near the end. That's why you have unit tests for specific functions / procedures / methods. That's why you have System Integration Test for testing that the various compents work together and meet their contracts. That's why you have User Acceptance Testing for testing that the end users can actually use the product.

    Primary Volume Testing does belong at the end of the lifecycle because you can rarely do it earlier. Sure, in a dev team of 5 for a user base of 50 working on a product with only a few components and dependencies you have much more flexibility. But when you're working on a banking application that will be rolled out nation wide, or an intelligent traffic analysis and policing application things are a little more complex.

    jparra (5/14/2009)


    Now you progress to Production.

    There are many issues hosting Production data in a lower lifecycle environment. Not the least of which is protecting the business - Production data is sensitive. It is what runs your business. In those organisations where auditing, especially legislative, is required - you will get nailed.

    Maybe, maybe not. In reality, only a few data elements out of an entire database are sensitive. In most businesses, the books are open to the public because they are a regulated business, or because stockholders want to see them. Only customer data is really private, and that usually isn't difficult to desensitize on the way down to Development.

    While its true that there is usually only a few columns that contain highly sensitive data, that isn't always the case. And it's not entirely the point. Once data or infrastructure leaves a Production environment it's no longer subject to strict and rigorous Change Control. The instant it has left Production you loose control over it and it's integrity is gone. Now I'm sure you and others will argue that you and the people at your organisation can be trusted and I hope you can. But it's a simple rule and one that should never be broken, especially because it's "easier." That is the single worst sole argument for any decision. Meeting requirements balanced with cost effectiveness (which includes ease of use) should be the equation. DBAs especially are guilty of not wanting particular controls and are especially resistant to change. Just because it's how you've always done it or you think it is "easier" is no excuse for not implementing security.

    jparra (5/14/2009)


    As I mentioned above, Production data is not suitable for a Development environment due to its fluid nature. Developers meet functional requirements and to do that they need a known good base level with static data to provide repeatable tests.

    Essentially, you prefer to wait to find issues in production. Not a good way to run a business, IMHO.

    You've missed the point entirely. Sure, there are always bugs and they're a pain and noone wants them. But if your SDLC is working then you should have the minimum bugs once you reach production.

    jparra (5/14/2009)


    So, to reiterate - use development data in development, use acceptance data in acceptance, use a representative sample of production data in volume test and production data in production. It'll save you both time and money, lower risk and increase protection. Anything else and you're on a short road to problems.

    At least you are consistent. 🙂

    I try, thanks 🙂

    --
    Andrew Hatfield

  • While its true that there is usually only a few columns that contain highly sensitive data, that isn't always the case. And it's not entirely the point. Once data or infrastructure leaves a Production environment it's no longer subject to strict and rigorous Change Control. The instant it has left Production you loose control over it and it's integrity is gone. Now I'm sure you and others will argue that you and the people at your organisation can be trusted and I hope you can. But it's a simple rule and one that should never be broken, especially because it's "easier." That is the single worst sole argument for any decision. Meeting requirements balanced with cost effectiveness (which includes ease of use) should be the equation. DBAs especially are guilty of not wanting particular controls and are especially resistant to change. Just because it's how you've always done it or you think it is "easier" is no excuse for not implementing security.

    I realize I don't post here, and perhaps I'm missing the point. But how can people reconcile this type of response (which I don't disagree with AT ALL) yet justify and post constantly that refreshing DEV from PROD via a restored backup is okey-dokey?

    Yes, when calculating totals and counts in reports and on screen, you need production numbers. Writing insert scripts (from prod; column specific if need be) isn't all that difficult - especially nowadays.

    So, how, possibly, do I see post after post, article after article, recommending and promoting restoring a database from a different server - for data refresh purposes only - usually from Prod to Dev or Q/A - never mind the security issues that are caused - and maybe corruption of system databases?

    I just don't get it. And I never will. But thanks for listening.

  • RyderD (5/17/2011)


    I realize I don't post here, and perhaps I'm missing the point. But how can people reconcile this type of response (which I don't disagree with AT ALL) yet justify and post constantly that refreshing DEV from PROD via a restored backup is okey-dokey?

    Yes, when calculating totals and counts in reports and on screen, you need production numbers. Writing insert scripts (from prod; column specific if need be) isn't all that difficult - especially nowadays.

    So, how, possibly, do I see post after post, article after article, recommending and promoting restoring a database from a different server - for data refresh purposes only - usually from Prod to Dev or Q/A - never mind the security issues that are caused - and maybe corruption of system databases?

    I just don't get it. And I never will. But thanks for listening.

    I'll avoid my personal opinion for a moment, and simply discuss experience from the developer's side.

    One of the regular reasons that this is necessary is because of bugs. You *can* regenerate the data, possibly, if you know all the moving parts, or even the initial cause. You can probably even, if you have data_reader access on production, pull the problem rows and the full supporting hierarchy down to your Dev and QA environments, or just type the danged things in. I've worked on many a database where I've never even *seen* the front end, or only on very rare occassions, nevermind being able to recreate some annoying piece of data that's barely described by the business other then 'record 4012 is broken on screen x'.

    I have just wasted somewhere between an hour and a day getting the problem data down to development, I haven't even worked on the ticket yet. The queue is now backing up. The DBAs who are trying to create 'specialized environments' for each bug can't restore that many databases at once.

    There used to be a team of five of us clearing an average of 5-10 tickets a day easily with actual data from production being restored nightly so we can locate and clear bugs of a logical or user entry error, and so we can locate the logic that allowed its entry. A change of this nature to non-production data where you can trace and track the particular bug as described by the end user becomes ten times more difficult, either due to recreation or you have to guess until you can locate the break-point. If I'm the SME on the database, sure, this probably isn't a problem. Support 10-20 of them as a developer and you're just hanging onto the rails.

    Now, for some opinions. 🙂 (Come on, you knew they were coming...)

    Is this good practice? Well... no. By preference somewhere in the development, regression testing, and user acceptance testing we probably should have had someone try using the ALT-#### characters in this field, or realized that when you turn on security setting A while using product catalog B and searching through catchall query Q you can see a product code belonging to a different catalog. But we didn't.

    And, well, we won't. If I could think of everything while I did development or my QA folks could before they let it through UAT where most of those users are busy with their real jobs and aren't given time to actually test beyond "Yeah, that does what I need it to" this would be a correct practice. Most places simply don't have the resources, or time, to dedicate to this best practice.

    So, is it wise? No, not really, and it's an auditing nightmare. I've had to be the one to install the data scrubber to clean up confidential information (names, etc) with GUIDs and translate the production to Dev equivalent for their one bug. I've been the one getting that scrubbed data too. It's not the best practice for security, but it increases ROI on corrective measures a few times from a developer cost standpoint. It's easier to locate the problem. It's easier to confirm the fix. You can directly write the data correction scripts that go along with the bug fix.

    This of course is the point that I believe was trying to be made previously... if your testing is good enough, you don't introduce the problems that cause this necessity.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • I'm not actually disagreeing that developers who are supporting production enviornments need access to a version of production data. Of course they do - there, in fact, is really no other way to help diagnose and fix some problems.

    I just think the data ONLY should be refreshed - not an entire database restore - ESPECIALLY if there are security concerns and if there are separate servers involved.

    This obviously is an organizational decision.

    Talk about potentially introducing other problems in to your development environment...

    🙂

    this would be what is known as a "pet peeve" from a "dinosaur".

Viewing 15 posts - 16 through 30 (of 37 total)

You must be logged in to reply to this topic. Login to reply