Upgrade Woes

  • How'd you like your next SQL Server application upgrade to go like this BART system upgrade? Talk about a high profile upgrade.

    I'm sure there are some competent people that missed something. It happens all the time and it doesn't appear that there was a mention of how many other upgrades were done successfully. Apparently they have been working on the systems for 14 months without issues. At least without issues that caught my attention.

    Still it makes me think of two problems with many systems like this. First is we never really have a full testing environment. In this case they use a virtual environment to stress and test the software. They did that in this case and thought they were prepared. But when it came online, the network switch was overloaded, something probably not simulated. But apparently, something that was needed.

    The other thing is this was pushed out on a Wednesday when most of their previous work was on the weekends. The change occurred because of other issues earlier in the week. A change in process like this, trying to fix things and instead making things worse, is probably something most of us have experienced.

    Setting up a true testing environment that really simulates production is hard and expensive. Even for web apps, the companies that provide testing just make it too hard and too expensive to deploy. I've had quotes of $10,000 or more for a few hours test time with a lot of clients, 500, 1000, or more. If this were a one time thing, no problem. But when we'll likely have some issue and will need to test a few times, we can't justify this cost. Testing companies could really grow their revenue if they would develop more reasonable costs. They'd have more business than they know what to do with.

    We also never seem to learn about process. Whether this was the IT guys or management trying to get this upgrade in quicker, it was a mistake. Without having major issues, it just doesn't make sense to try and get something done too quickly. Time and time again this has proven to be a bad idea in IT and yet it continues to happen.

    I feel bad for the BART IT guys. They probably got a lot of flack for this and hopefully no one lost their job. And I hope they've learned to just not make changes during working hours unless something is truly not working.

    Like the trains.

    Steve Jones

  • Reminds me of what happened when World of Warcraft was first released for online multiplayer.  There was huge outcry about servers lagging, network problems, etc from the users, because the game proved so popular that Battle.net suddenly had to accommodate upwards of 100 000 simultaneous gamers. Not a situation which could be simulated very easily in a testing environment.

    In the end, the Blizzard guys fixed the issues pretty quickly and all the fuss was forgotten because the product was so great.

    -----------------

    C8H10N4O2

  • I find it humorous to see a MMORPG brought up here - I am a World of Warcraft junkie when I'm not at work. 

    However, they still have issues of being able to handle their customer base from a technical standpoint.  They can keep opening new servers, but until they truly stop people from creating new characters on overpopulated servers, they'll continue to have problems.  Even as they roll out new patches, they still have issues with things, as last week Tuesday proved - servers didn't come up until much later than expected.  They honestly don't have a good way of simulating their reality of dealing with the customers, and that creates a problem. 

    As it is, they have database issues on top of just bandwidth issues and latency issues and customer issues.  Ever see mail errors?  My husband has lost items in mailing them to himself (and no, they didn't show up 30 days later)  and their support team has been useless at trying to find out what happened.  There are times when I try to read just a simple text message from them (like an in-game support request that they answered when I had logged off), and I couldn't read that - mail database error.  I ended up sending a new request and telling them to email me, since I know that their mail database isn't reliable. 

    Again, they can't simulate their true player base and how hammered their servers will get - and that leads to problems and makes it hard to tell if their problems really have been remedied.

    Blizzard overall is great with customer support in dealing with the customer personally, but in the technical side, IMO, they seem to be lacking in resources or maybe technical estimating or something.  I still see plenty of technical issues...

    As for BART, as an end user, I'd be pretty upset about the whole delay.  But being in the programming field, I understand the phrase that stuff happens.  (Okay, I'm being clean about it.)  You can't catch everything in a virtual simulation.  However, until there really is a way to duplicate an environment in a reasonably affordable fashion, companies can't always get a significant testing environment.  Like Steve said, it's hard and expensive to get it to what you need.

  • Steve,

    In addition to the test environment you have to have a tested roll-back plan or I would say, Exit Strategy. I had 2 cases when the upgrades were running great in the development environment, even better on the next stage - test environment and something failed during production upgrade. It happens due to the various reasons.

    In my first case that I mentioned it was that one of the environmental variables was missing on the production server during data migration and the previous server support did it on purpose so nobody would use data import utility unless specifically requested. The problem was that I was doing this data migration after he left the company.

    The second case was an error of the application vendor when they send us an update for one of the forms for the upgrade and their form replacement script made 2 forms with the same name and version being loaded. So on the development and test servers the corrected version of the form was loading into the memory first and used. On the production machine the form with UI errors was loaded first. But considering that it was a password change form and we were forcing users to change passwords first thing after the upgrade, it was very relevant to us.

    Unrelated to the certain possibility that the upgrade or some parts of it may fail anyway, you HAVE to have a good comprehansive test plan for the every feature that you use and with expected results. And it should be superusers who know the data well and not the IT engineers who should be performing all this testing.

    Regards,Yelena Varsha

  • If the President of the United States doesn't have a good exit strategy, why would you expect BART to have one? 

    There is no "i" in team, but idiot has two.
  • Good point

    Regards,Yelena Varsha

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply