24x7 availability with SQL Server 2012 Standard Edition

Question

Post reply

24x7 availability with SQL Server 2012 Standard Edition

Pyman

SSChampion

Points: 13830
More actions
May 24, 2016 at 10:16 am

#400196

Hello,

I am using SQLServer 2012 Standard Edition SP3. I am trying to avoid the downtime which is almost 5 hours per day from 12 to 5 AM. During this down time I am archiving data older than x days to an OLAP database and then deleting it from the OLTP database and then rebuilding the indexes in OLTP. Since this is standard edition I cannot rebuild indexes online. Besides while I am doing deletes there are table locks as well.

So I wanted to test if this downtime can be avoided by switching the applications every night from one server to another server. My approach is to set up bidirectional transaction replication between Server 1 and Server 2 and have apps point to one server and one database at any given point of time.

By bidirectional replication I don't mean the exact architecture but a little customized one. I am setting up transaction replication between server 1 and server 2 and also between server 2 and server 1 on the same database but at any given point of time only one set of replication jobs are up an running and the other server's replication jobs will be stopped and disabled. Also when I am configuring transaction replication from server 2 to server 1, I am not using snapshot because it will throw errors as the database is participating in replication. Also the distribution databases are sitting on their own servers.

1. On Day 1 at 12:00 AM Server 1 Database 1 is in Sync with Server 2 Database 2. (data matches).

2. On Server 2 replication jobs are stopped and disabled. Apps point to Server 1 and maintenance jobs run in server 2 and not on server 1

3. On Day 2 at 12:00 AM I will make sure data is all synced up between server 1 and server 2 and at 12:00 AM I am going to stop the replication jobs from Server 1 to Server 2 and then disable them

4. I will then enable the replication jobs from Server 2 to Server 1 and make sure I will seed the identity columns in all tables to match the max value

5. on this day The maintenance jobs will run in Server 1 and replication is from Server 2 to Server 1

6. I will make sure data is synced up between two servers by 12:00 AM and repeat the same process the next day.

This reduces my downtime from 5 hours to probably less than 5 mins a day. So far I havent seen any issue other than seeding the identity column values which can be fixed through an agent job or a script.

Experts I need your valuable inputs on this. I have only tested it out on a dummy database with 5 tables. Do you see any issues with this kind of set up?

Thanks a ton.

“If your actions inspire others to dream more, learn more, do more and become more, you are a leader.” -- John Quincy Adams

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

Steve Jones - SSC Editor SSC Guru Points: 728159 More actions · Answer 1

It's complex, and I think step #3 can perhaps be more difficult that you imagine. How are you sure the data is up to date? If you do maintenance, are you sure nothing will get lost?

Also, there's no magic here. When the work to delete gets done, it needs to replicate, one way or the other.

I'd focus on making your processes more efficient, and less impactful. I think trying to manage replication in a bi-directional fashion is asking for trouble.

Follow me on Twitter: http://www.twitter.com/way0utwest
Forum Etiquette: How to post data/code on a forum to get the best help
My Blog: www.voiceofthedba.com

Pyman SSChampion Points: 13830 More actions · Answer 2

Steve Jones - SSC Editor (5/24/2016)

It's complex, and I think step #3 can perhaps be more difficult that you imagine. How are you sure the data is up to date? If you do maintenance, are you sure nothing will get lost?

Yes applications are based off windows services and just before maintenance kicks off these services are going down. Also from 11:00PM to 12:00AM very minimal or no activity at all.

Also, there's no magic here. When the work to delete gets done, it needs to replicate, one way or the other.

I am still replicating from server 1 while deletes are running on another server lets say server2. Usually archive to an OLAP, delete from OLTP and Index rebuild is a 4~5 hr operation, but I am giving it a whole day while I am also replicating. And by the end of the day when apps are down I will perform activites mentioned in the steps above.

“If your actions inspire others to dream more, learn more, do more and become more, you are a leader.” -- John Quincy Adams

Lynn Pettis SSC Guru Points: 442447 More actions · Answer 3

Lynn Pettis

SSC Guru

Points: 442447

May 24, 2016 at 11:55 am

#1881173

I will ask the one question I haven't seen anyone else ask. Are you doing this on your own to reduce the 5 hour down time or has this been mandated by upper management?

Pyman SSChampion Points: 13830 More actions · Answer 4

Pyman

SSChampion

Points: 13830

May 24, 2016 at 12:05 pm

#1881176

mandated by upper management to reduce downtime.

“If your actions inspire others to dream more, learn more, do more and become more, you are a leader.” -- John Quincy Adams

Steve Jones - SSC Editor SSC Guru Points: 728159 More actions · Answer 5

I still think this is more complex than fixing your process. Replication is brittle, and when it fails, it's a pain to reset. I'm also a bit concerned about you moving data on a subscriber, with the intention of making it a publisher the next day. I suspect this will fail regularly and cause issues.

An archive process to move data and delete it doesn't need a table lock. If this is happening, you can rearchitect the process to not cause downtime. That's much easier than replication, in my opinion.

Follow me on Twitter: http://www.twitter.com/way0utwest
Forum Etiquette: How to post data/code on a forum to get the best help
My Blog: www.voiceofthedba.com

Lynn Pettis SSC Guru Points: 442447 More actions · Answer 6

Steve Jones - SSC Editor (5/24/2016)

I still think this is more complex than fixing your process. Replication is brittle, and when it fails, it's a pain to reset. I'm also a bit concerned about you moving data on a subscriber, with the intention of making it a publisher the next day. I suspect this will fail regularly and cause issues.

An archive process to move data and delete it doesn't need a table lock. If this is happening, you can rearchitect the process to not cause downtime. That's much easier than replication, in my opinion.

I have to agree with Steve. I would also look at the indexes you have on the tables. If you have good clustered indexes that don't need to be rebuilt (at least daily), you can actually rebuild your nonclustered indexes by building new ones to match existing then drop the existing when done.

Sergiy SSC Guru Points: 110192 More actions · Answer 7

Lynn Pettis (5/24/2016)

I have to agree with Steve. I would also look at the indexes you have on the tables. If you have good clustered indexes that don't need to be rebuilt

And I have to agree with Lynn.

Correctly defined clustered index will make the downtime required for creating it the last downtime you're gonna ever need.

What would be the correct clustered index?

The one which fits your requirements:

During this down time I am archiving data older than x days to an OLAP database and then deleting it from the OLTP database

So, there is some kind of datetime column which tells you how old is the data.

This column must be chosen as a 1st column in your clustered index definition.

Then new data will be always added to the tail of the index, without causing fragmentation on existing data, and expired data will be deleted from another end of the index, removing continuous blocks of data, not causing fragmentation as well.

With correctly defined clustered index you may wish even remove outdated data continuously, around o'clock, by small chunks, say by 100 rows at the time. A job running every minute may select top 100 records which meet "outdated" definition (there may be less than 100 of them found at any particular minute), copy them to OLAP and remove them from OLTP.

Such a background process won't make any noticeable effect on overall system performance.

Down time would be brought to 0.

_____________
Code for TallyGenerator

EdVassie SSC Guru Points: 60312 More actions · Answer 8

For many years I worked at Totaljobs Group where we relied on SQL replication to provide scale-out to support our web sites. We had more experience with keeping replication going than most people.

I think what you have planned will cause you data loss. Replication is too brittle to do what you want. In particular, stopping replication from A to B then hoping you can replicate from B to A without problems is a very brave decision, and not one that I would take.

If there were ever issues with the original A to B replication that resulted in the data on B being down-level and not fixed before you replicated B to A then you will loose data. Even with our experience with replication, on average about once in 18 months we would hit a problem that needed a fix from Microsoft to resolve. Some of these fixes took weeks to be ready, and we had to keep track of all the data repairs that were needed while the fix was pending and after it had been applied.

From my viewpoint, preserving the data is far more important than downtime. There are business risks in having too much downtime, but there are probably higher risks in having uptime but with downlevel data.

I completely agree with the other postings about reviewing your indexing. Improving that could speed up your overnight work to fit within your desired window. However, if your company is successful in what it does then it will grow and have more data, and eventually you will need to spend capital to upgrade the equipment that services it.

Many organisations want a DBA to find a solution that does not require a capital spend, but sometimes the right thing is for the DBA to propose a spend. When you have optimised your indexes, it may be a good time to look at server memory - this could reduce disk activity and help speed up processing. Also look at CPU usage, and if you average over 50% busy and often peak at over 90% then a CPU upgrade may help. Talk to your storage people and see if anything can be improved for SQL disk access - your management should encourage this as anything done here that helps is likely to be low risk and without capital expenditure.

But before you look at hardware, have a good look at indexing. If you are concerned about skill levels then talk to your Microsoft Account Manager - you may have some free consultancy time in your license agreement or they may be able to recommend a partner organisation with the skills that can help. Also, consultancy time is Opex, which may be easier to find than Capex.

Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.

When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara