Data Loss or Downtime

  • I primarily support internal finance systems. Since these are not order intake or revenue generating databases I'm going to assume that management would prefer less data loss with greater downtime, but this is a large assumption. While I have my DR plan on the mechanics of how to recover the data and applications what is missing is this decision by the business of what they thing is most important. Guess I have some meetings to schedule 🙁

    --Tim

  • It not what's important to me personally as a DBA or developer, but what's important to the business: users, clients, and management. Technically speaking, a partial restore that's still functional is not even an option unless the database is partitioned in such a way that supports it. For example, if your data is partitioned across seperate file groups, where all of the critical operational data is contained sperately from historical or non-essential reference data, then you can present the option to business with a high degree of confidence that it will actually work as expected. You don't want to end up spinning you wheels and losing valuable time that could be better spent completing a full recovery. These are good questions, and it would be worthwhile to plan ahead and think about what data would be needed to support a minimal degree of application functionality, test the theory on a QA or staging server, and then have an alternate disaster revocery plan documented.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • It depends strongly on business needs. I work in healthcare, we need systems up 24x7 and people's lives literally depend on them, so Kimberly's statement is a valid one for me - although again the point Gail makes about what data is missing is also important. There is no point getting a db online with critical data to run the system is missing. Also to bear in mind many times is the cost of downtime vs loss of data. It costs money to maintain backups (disk space, off site storage, technology to do the backup or compression...). It even costs money if you consider the fact that i have to run a reindexing job in simple recovery mode and during that time i will not have backups. How much is all that compared to cost of loss of data? A good DR strategy is worked out with due consideration to all these factors and putting business needs in right order.

  • bsheets 73864 (1/7/2011)


    SQLArcher (1/7/2011)


    Hi,

    It depends on the nature of the system. I work in a financial institute with a lot of trading; in a downtime vs. partial data loss situation business has to weigh the cost of being down until the data is recovered, risk of reputational loss and increased revenue loss against a smaller risk of reputational loss and less loss of revenue with partial data.

    In most cases (in this scenario) it would be better to accept a partial data loss until it can be recovered, and get business up and running to mitigate the additional reputational and financial loss.

    I also work in financial services, and I would disagree with this assessment. Having some customers log in and have missing transactions would be worse than keeping everyone offline until all the data is restored - imagine the number of support calls from customers with missing transaction data.

    Also, trying to merge the missing data with a database that has had numerous changes since returning to an online state could be problematic, and could result in duplicate key issues.

    Unless it was static historical data that is missing, staying offline until all is recovered would be better.

    I think it would depend on whether or not the system is customer facing and revenue generating. If it is, I can't imagine any business-minded executive choosing to forego future revenue so the system can come up clean. Sure it would be nice if it did, but if it didn't that's fine too b/c it can be fixed. That's what we as DBA's get paid to do. I don't think anyone will be concerned about the number of support calls either. That's what CS gets paid to do and it's not revenue impacting so no biggie. In the end, like most business decisions, it's about the almighty dollar. Maximizing revenue and minimizing loss. The amount of work it causes you or I or customer service is irrelevant. For some service oriented companies there are SLA's that state for any and all downtime the service provider will share in the revenue loss with their customers. I worked for a financial company that had such a clause. As expected, uptime was their highest priority :-).

  • Oh well, I feel I am getting old.

    As several times before, I need to point out that the decison what is more important (to be up ASAP or to be exactly up to date) is not up to the IT staff, DBAs included. It depends on the industry and legal reponsibility of the company or division, so ultimately (and unfortunately) it is the lawyers who have the last word, like it or not. (I don't.)

    In my case, the picture is fortunately simple: I am on a BI project, and for my team data off by a day or two is no biggie, it is not accounting or CRM, so as long as we present a reliable picture of the business, we are OK.

  • "It depends."

    But for our company, I would guess the exec's would choose to wait for a full back-up. It wouldn't take that long and it would just mean that paperwork would pile-up.

  • Although we have numerous data files and several filegroups in our main prod database, I don't think we could effectively restore a filegroup that would actually allow users to do work. So currently we'd have to wait for the full backup and subsequent differentials/incrementals to restore, which would be at least 6-8 hours assuming there was hardware available to restore onto.

    Obviously we're long overdue for a "hot/warm standby" type solution. We used to have log shipping to a standby server, but the decision was made to abandon that and move to Commvault for backups. Last night during a high cpu period Commvault somehow decided the log backups were not happening ( sql server says they were ), so it launched a full backup this morning during the business day and that is expected to run for 15 hours or more, during which time log backups won't run since Commvault thinks the log chain is broken. Fun eh?

  • I think my customers and my users always like to wait a little bit time more and begin to work with all data, so they don't need to check what data are lost or not and they don't have to enter data again 😉

  • I have a customer where data is stored in retrieval system other than SQL and only meta-data is in SQL. For the other retrieval system, uptime is more important; but for the meta-data, completeness is more important.

  • I think the answer "it depends" is inescapable but begs the question of how much downtime you can you afford (lost production, lost sales, customer inconvenience) versus the use of incorrect or incomplete data (incorrect pricing, erroneous decisions, production errors) and the consequential liablity. Lost production for a day may be more acceptable than the consequences of missing or incorrect data, at least that has been my experience. Others may have different experience or have applications more forgiving of errors. I see the real problem is not disaster recovery, but disaster avoidance.

    The fact we can pack more information into smaller spaces is not necessarily a good thing. The fact that we can run multiple virtual machines on a single piece of hardware simply means more machines go down when the hardware goes down. There is a false economy at work in reducing infrastructure costs when the probability and cost of catastrophic failure is not considered in the cost and benefit analysis. Similarly, that huge monolithic database may be a wonderful way to save the space required by duplication and physical partitioning, and the time required for transaction processing but the potential downtime cost should be considered as well.

    These are common issues in the protection of property (cost versus benefit) and the same principles should apply to disaster avoidance for information as well. Unfortunately, too many IT managers do not consider disaster avoidance but leap to lowest cost.

  • GilaMonster (1/7/2011)


    It depends. Among other things it depends on what data is going to be missing.

    Thinking back to the bank there were some tables that we could do without during business hours but were critical for the overnight processes. There were other tables that we could do without for 3 weeks, but they had to be there (and complete) during the last week of the month. There were other tables where if the information in there was incomplete it was worse than if the system was completely offline.

    I have run into very similar needs at several locations. Where possible, we created multiple filegroups and placed tables into appropriate filegroups and created a recovery plan based on that. The purpose was to get back online in the event of failure as quickly as possible so as to not lose further revenue. At the same time we knew we could get the remaining data back online with minimal loss to the business.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • It would be helpful in the documentation of the disaster recovery and other processes to actually practice partial data recovery. Knowing how much time it takes to merge the data back in based on similar experience would be helpful for all involved.

  • It depends. We effectively have two systems, one for managing the business and the other for reporting. For the management system, downtime is more important. For the reporting system data loss is more important.

  • What is more important to you: downtime or data loss?

    For a typical OLAP system, reference data may be more important than transactional data in the short term, so long as you have a plan in place to recover the transactional records. For something like an eCommerce website, it's critical that the Product and ProductPrice tables to be as recent as possible. Also, if the Customer table isn't entirely recovered, then many users will be unable to login and place orders, even if the website itself is up. For partial recovery, it helps to keep reference tables, recent transactional records, and historical transactional records for prior accounting periods in separate file groups. It also helps if your website caches daily orders to an intermediate data store, because that too can be leveraged for restoration.

    However, it's not what's important to me, but what's important to the business. The appropriate approach is for the DBA to educate themselves on all available options and them present them to executive management. In the aftermath of a disaster, you don't want management to say "While you were in the process of doing ABC the system was down for three days, so why didn't you instead do XYZ first and then ABC ?"

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Xavon (7/10/2015)


    It depends. We effectively have two systems, one for managing the business and the other for reporting. For the management system, downtime is more important. For the reporting system data loss is more important.

    Unless we're switching meanings for downtime and data loss I would disagree with that. For your transaction system data loss is completely critical even if it might mean more downtime, do you think a customer is going to be happy if the $X order they placed and were billed for is lost when you decide that recovering quickly is more important than data loss?

    And inversely as long as the source data is intact the data warehouse can always be rebuilt, even if it means more downtime people can wait awhile on their reports.

    But yes the answer is it depends on the system and what impact downtime vs dataloss has.

Viewing 15 posts - 16 through 30 (of 38 total)

You must be logged in to reply to this topic. Login to reply