Use Transactional Replication to build a data warehouse

  • I am wondering if Transactional Replication is a way to build up a large "historical" database for reporting and archiving data, allowing our 'live', production database to stay 'lean and mean' and contain only current data.

    We have about 400 tables spread across 8 databases. About 100 of those tables contain business transaction data (as opposed to reference data, like name/address, product SKU, state codes, country codes, etc.) that we want to be able to periodically purge, keeping only the last 90 days or so in the 'live' database. I was hoping to use transactional replication to keep a second set of databases from which we could do long-term reporting and from which data will ultimately be drawn to build a full blown BI data warehouse. This means that every new transaction (sales order, login history record, account credit, etc) should be written to this replicated database within a minute or so of it's being written to the 'live' database. Hence, the idea of using transactional replication.

    But periodically (nighly), I'll want to purge anything in the 'live' database that is over, say, 90 days old, to keep the volume of data there to a minimum. However, I do NOT want those deletes to replicate to the historical database.

    Questions:

    1) Is transactional replication a good way to accomplish this -- or can somebody suggest something better?

    2) Is there a way to globally turn off the replication of DELETEs to the replicated database? I can see how to turn it off at the individual article level --but don't look forward to doing that for 100's of tables.

    3) Can anybody else who has done this point out any 'gotchas' or problems that may be ahead if I go down this path?

    Thanks!

    Rob Schripsema

    Integra Software

    Rob Schripsema
    Propack, Inc.

  • You can specify that you do not want delete to propagate to the Subscribers. That would probably mean that even if there is a real that had to be done, it will never get propagated.

    Therefore I am not sure if Replication is the answer to warehouse. I would suggest using ETLs to get data for warehousing.

    Instead of deleting everything look up on Table partitioning.

    http://msdn.microsoft.com/en-us/library/ms345146%28SQL.90%29.aspx

    Just my .02 cents

    -Roy

  • If the only deleting done to the published articles is to purge old data, then set your "Do Not Replication DELETE Statements" option under the Statement Delivery section of Article properties.

    That should work fine for a DW.

    Converting oxygen into carbon dioxide, since 1955.
  • That's what I had thought would work. We're proceeding down this path...at least until we run up against an insurmountable problem. So far, so good.

    Thanks for your input.

    Rob Schripsema
    Propack, Inc.

  • Thanks for the reply. We never (at least in normal processing) ever delete anything anyway -- it will get flagged as 'inactive', but will never leave the db. That's part of our problem -- we NEED to keep the historical data around, but it is growing to the point where it is bogging down production.

    In the rare instance where we actually need to delete, it will probably be a block of data, and we'll do it manually in our copy.

    Rob Schripsema
    Propack, Inc.

  • <Check this>

    If for some reason you would need to drop the publisher and recreate it , then you would need to go with 'Keep Existing Objects' in the Article properties and in the distribution agent job skip error numbers 2627. In such a case, i observe that , SQL Server replication copies all the data for the tables mentioned in the articles all over again and starts pushing them to the subscriber. 1. This results in delay in propogating the changes to the subscriber 2. disk io/cpu resource consumption for the said period.

    If you are dealing with a large db (100's of GB or TB) then i suggest you keep this in mind. During my trial run's i observed this. It needs to be seen if this is the 'default' behaviour of SQL Server replication if one mentions 'keep existing objects'.

    <Check this>

    In case i find a solution i will update this page.

  • SQL Experts Help Please ......

    We have a requirement to keep only 6 months of production data and move any data prior to that to the reporting database and maintain it there for reporting purposes.

    Production has 24 months of data:

    1) Create a Reporting Database with 24 months of data.

    2) Delete 12 months Data from Production database but retain it in Reporting Database

    3) Add new additions / changes in Production to the Reporting Database.

    My question if this will be feasible in Transactional Replication? to move daily changes to Reports Database but only by pass deletes done by a specific process. I would truly appreciate your thoughts...

    Thanks

    rv

  • You can setup a transactional replication that does not delete records. You can also use filters. Using these, you can replicate your adds and updates to your reporting server and prevent the deletes from being replicated.

    Converting oxygen into carbon dioxide, since 1955.
  • I would not suggest Transactional replication for a datawarehouse. It would be an administrative nightmare. Not to mention getting your data out of the datawarehouse. If the data is written in short write you can cause a lot of fragmentation on the disk eventually would cause you to rewrite the data. I would suggest you use SSIS to pull the data. This would also give you control of the data so you could filter out unneeded data and keep your disk cost lower.

  • JDixon --can you explain further? How is using SSIS less of an administrative nightmare than using Transactional Replication -- and how does it cause less fragmentation? If one wants to keep the reporting database reasonably in sync with production (i.e., getting new data within minutes of it being written to production), the writes are just as short, aren't they?

    What am I missing?

    Rob Schripsema
    Propack, Inc.

  • Rob, Here is something you might want to think about. let us say something happened to your Distributor. Your replication is dead in the water. How are you going to reinitialize the replication again when you dont have data. (You probably might have deleted it off).

    I would still recommend Partitioning of tables. That would mean that you will have all data in Your production and subscriber. Easy to reinitialize and you will be working with a small set of data for your day to day work. Reporting will work all the historical data from your reporting server or warehouse.

    Just my .02 cents.

    -Roy

  • Thanks Roy. Good Point. We have SQL Server 2005 64 Bit Standard so Partitioning is not an option right now.

    One thought was to keep the replicated database same as production and add triggers to the replicated database and move the changes to another historical database. But that may be too complex to implement.

    I would appreciate if you have any other thoughts....

    Thanks,

    Ram

  • Actually Table partitioning is available from SQL 2005 onwards. Before SQL 2005, there was Partitioned views. Take a look at this white paper regarding Table Partitioning in SQL 2005.

    http://msdn.microsoft.com/en-us/library/ms345146%28SQL.90%29.aspx

    If this is not possible for you, You could create Archive tables in a different file group and run a daily job that cleans the current table after inserting into the archive. This would mean that you keep the historical data and have clean and mean current table in the Primary file group.

    Just my .02 cents

    -Roy

  • Here is how we created a reporting environment , right ,wrong or indifferent 🙂

    We had a requirement to keep the production db as small as possible for performance reasons as it interacts with sortation equipment.

    Using Transactional Replication we replicate to a "Reporting" instance. Everything is replicated over to it for the db. Then we created a stored procedure on the publisher side to purge all data older than X number of days from the Publisher. Then you add the stored procedure as a replicated article. The key to this is setting the stored procedure to replicate ONLY the execution call of the command. On the subscriber side the same procedure is created with a blank shell .. commands do nothing. So by doing this when the purge at the publisher is executed, replication sends over only the stored proc call (and not the deletes that were done by it) to the subscriber and nothing happens. We run this purge weekly after intial purge and prod database is kept lean and mean and all data still exists in the reporting instance and is used for crystal reports,web reporting, SSAS, etc. There are some other underlying issues you run into such as how to handle identity columns,fk's etc but once you figure out what needs to be done it can be accomplished easier.

    I have been using this method since SQL server 2000 on our highest transactional db that we have for about 5 years now. The only real issues I have run into have been performance on the reporting side with millions of rows to scan through after accumulating so much data. So what we do there is we created an archive proc that moved for example the data for year 2005 from reporting into archive database and then either leave it active or store it away. Of course we checked with the business to see what their requirement was for how long they needed to go back with active data. We had on occasion maybe 2 times where I had to attach the db files back for the archive for the business to get some data out of it. I can see where partitioning may be the way to go as Roy says but I have not personally explored that option yet as what we have seems to work right now.

  • Thanks Roy I will go through the paper.

    Mike, Though I bumped into the idea few times but your implementation offers scope. How do you manage disaster recovery for your reporting database ? Have you ever come across a situation when you have to regenerate the snapshot for any of the articles? Our concern was the regeneration of snapshot may result in losing historical data.

    I would truly appreciate your thoughts on this.

    Thanks,

    Ram

Viewing 15 posts - 1 through 15 (of 22 total)

You must be logged in to reply to this topic. Login to reply