Best way to merge two large client databases

  • titsiros (1/12/2016)


    The reason for the merge is because one client acquired the other and they don't want to maintain separate systems.

    I know it's a grand task because the db schema is fully normalized with plenty of FKs and Parent_ids. Having said that, I know I m not reinventing the wheel here so I was hoping there is a tool that can do this. The steps I need to take seem all programmable to me so in my opinion a toll should be able to do this, based on some configuration.

    Maybe so, but my intuition says there is an extremely small market for a tool and given the number of permutations possible in a database-merge scenario the chances of it doing everything you needed could be low. Even if it met every last requirement the cost would likely be prohibitive for most potential consumers.

    You still haven't said, are you blending the data as you merge the databases or are the datasets in each database mutually exclusive allowing for a simple append?

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Orlando Colamatteo (1/12/2016)


    titsiros (1/12/2016)


    The reason for the merge is because one client acquired the other and they don't want to maintain separate systems.

    I know it's a grand task because the db schema is fully normalized with plenty of FKs and Parent_ids. Having said that, I know I m not reinventing the wheel here so I was hoping there is a tool that can do this. The steps I need to take seem all programmable to me so in my opinion a toll should be able to do this, based on some configuration.

    Maybe so, but my intuition says there is an extremely small market for a tool and given the number of permutations possible in a database-merge scenario the chances of it doing everything you needed could be low. Even if it met every last requirement the cost would likely be prohibitive for most potential consumers.

    You still haven't said, are you blending the data as you merge the databases or are the datasets in each database mutually exclusive allowing for a simple append?

    I am guessing the OP does not know this as he is privy to at most to one set of data , would be my guess. As has been mentioned, you need to ensure that you are not duplicating clients. What version of the data (if any) you will assign to the new database when a client exists in both sources? You'll need to do some grunge work here to compare by email, residential address, names of course. I dont think its an exact science and will probably require some manual review by subject matter experts to some degree.

    ----------------------------------------------------

  • MMartin1 (1/12/2016)


    Orlando Colamatteo (1/12/2016)


    titsiros (1/12/2016)


    The reason for the merge is because one client acquired the other and they don't want to maintain separate systems.

    I know it's a grand task because the db schema is fully normalized with plenty of FKs and Parent_ids. Having said that, I know I m not reinventing the wheel here so I was hoping there is a tool that can do this. The steps I need to take seem all programmable to me so in my opinion a toll should be able to do this, based on some configuration.

    Maybe so, but my intuition says there is an extremely small market for a tool and given the number of permutations possible in a database-merge scenario the chances of it doing everything you needed could be low. Even if it met every last requirement the cost would likely be prohibitive for most potential consumers.

    You still haven't said, are you blending the data as you merge the databases or are the datasets in each database mutually exclusive allowing for a simple append?

    I am guessing the OP does not know this as he is privy to at most to one set of data , would be my guess. As has been mentioned, you need to ensure that you are not duplicating clients. What version of the data (if any) you will assign to the new database when a client exists in both sources? You'll need to do some grunge work here to compare by email, residential address, names of course. I dont think its an exact science and will probably require some manual review by subject matter experts to some degree.

    From the original post I gather the OP has both DBs.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Orlando Colamatteo (1/12/2016)


    MMartin1 (1/12/2016)


    Orlando Colamatteo (1/12/2016)


    titsiros (1/12/2016)


    The reason for the merge is because one client acquired the other and they don't want to maintain separate systems.

    I know it's a grand task because the db schema is fully normalized with plenty of FKs and Parent_ids. Having said that, I know I m not reinventing the wheel here so I was hoping there is a tool that can do this. The steps I need to take seem all programmable to me so in my opinion a toll should be able to do this, based on some configuration.

    Maybe so, but my intuition says there is an extremely small market for a tool and given the number of permutations possible in a database-merge scenario the chances of it doing everything you needed could be low. Even if it met every last requirement the cost would likely be prohibitive for most potential consumers.

    You still haven't said, are you blending the data as you merge the databases or are the datasets in each database mutually exclusive allowing for a simple append?

    I am guessing the OP does not know this as he is privy to at most to one set of data , would be my guess. As has been mentioned, you need to ensure that you are not duplicating clients. What version of the data (if any) you will assign to the new database when a client exists in both sources? You'll need to do some grunge work here to compare by email, residential address, names of course. I dont think its an exact science and will probably require some manual review by subject matter experts to some degree.

    From the original post I gather the OP has both DBs.

    Hi, yes I believe that. I was implying how familiar he is with both. There will be at least one he is just diving into if it did not belong to his company prior. Or both if he was brought in as a consultant.

    ----------------------------------------------------

  • MMartin1 (1/12/2016)


    Orlando Colamatteo (1/12/2016)


    MMartin1 (1/12/2016)


    Orlando Colamatteo (1/12/2016)


    titsiros (1/12/2016)


    The reason for the merge is because one client acquired the other and they don't want to maintain separate systems.

    I know it's a grand task because the db schema is fully normalized with plenty of FKs and Parent_ids. Having said that, I know I m not reinventing the wheel here so I was hoping there is a tool that can do this. The steps I need to take seem all programmable to me so in my opinion a toll should be able to do this, based on some configuration.

    Maybe so, but my intuition says there is an extremely small market for a tool and given the number of permutations possible in a database-merge scenario the chances of it doing everything you needed could be low. Even if it met every last requirement the cost would likely be prohibitive for most potential consumers.

    You still haven't said, are you blending the data as you merge the databases or are the datasets in each database mutually exclusive allowing for a simple append?

    I am guessing the OP does not know this as he is privy to at most to one set of data , would be my guess. As has been mentioned, you need to ensure that you are not duplicating clients. What version of the data (if any) you will assign to the new database when a client exists in both sources? You'll need to do some grunge work here to compare by email, residential address, names of course. I dont think its an exact science and will probably require some manual review by subject matter experts to some degree.

    From the original post I gather the OP has both DBs.

    Hi, yes I believe that. I was implying how familiar he is with both. There will be at least one he is just diving into if it did not belong to his company prior. Or both if he was brought in as a consultant.

    I got the impression the OP wprks for a SaaS provider so is managing these as blackboxes woth respect to the data but knows the product DB well. My hunch is that this is an exercise in appending data while retaining RI more than it is a merge.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • I have access and relatively good knowledge of both databases. Unfortunately this is not simply a task of appending the data as the system allows users to maintain lookups, categories and items which are then referenced by secondary tables such as invoices, etc. So to answer your question, the final/secondary tables (not referenced) will be appends but the primary tables will have to be a blend as these need to be shared by both clients.

    I understand that this might not be a common task, but I do see it happening often enough in mergers or even consolidation of different departments, locations, etc where they run standalone installations of a software.

  • Thanks for conforming. It sounds like a pretty routine data migration project but some of the concerns mentioned earlier are going to be a challenge, specifically with the user-provided content if you emd up having to match on free-text descriptions. I have done stuff like this in the past and inevitably ongoing patches are required once users get into the syatem and call out things that should or should not have been matched.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • titsiros (1/13/2016)


    I have access and relatively good knowledge of both databases. Unfortunately this is not simply a task of appending the data as the system allows users to maintain lookups, categories and items which are then referenced by secondary tables such as invoices, etc. So to answer your question, the final/secondary tables (not referenced) will be appends but the primary tables will have to be a blend as these need to be shared by both clients.

    I understand that this might not be a common task, but I do see it happening often enough in mergers or even consolidation of different departments, locations, etc where they run standalone installations of a software.

    We found no magic software to do this.

    Every merge of systems (or data) requires decisions to be made.

    Stand alone with same schema, as you know, means little to the actual data and it's use in each system.

    Part of the complexity is that in most systems, you have an identity key which is likely system generated and the user may never see, a unique value for an item, which may be the same if it is locations for the same company, and a description which could be the same or different. And everything that references that item can come into play.

    Be sure to include the sites involved in the process.

    I have always found no matter how well I knew the data, processes, etc. some details important to getting it right would surface.

  • I have done this and have written a script to do so.

    My situation is when customers of mine leave their merge subscribers disconnected longer than the retention period and I need to merge the data back together.

    I rely on the fact that merge replication has a rowguid for every table and therefore every row has a unique identifier.

    I can therefore insert the row in the other database with the SAME rowguid and then use this to look up the NEW foreign key reference.

    The logic is something like:

    For a foreign key, look up the row in the referenced table in the original database.

    Get the rowguid

    Look up the same rowguid in the referenced table in the NEW database

    This will give you your new foreign key

    My script is:

    1. Very database specific

    2. Relies on the our foreign key naming convention i.e. CustomerID is always the foreign key to the Customer Table

    3. Relies on the fact that every table has a rowguid column (this is a requirement for merge replication)

    4. Has an elaborate "priority" order for the tables. i.e. the Customer table needs to be inserted into the new database before the Customer Transactions

    5. Special handling of some tables in our database that don't quite fit the "standard" definitions.

    6. Took me a long time to write and get correct

    7. Every time I use it (several times now), I seem to find another change that needs to be made

    8. I always have cleanup work to do afterwards - i.e. A customer account was added in both places but there should only be one account. All the transactions etc need to be merged together into one account.

    If you need to do it, you will need to know your database and you will need to allocate a lot of development and testing time to the project.

    Be prepared - it is not an easy process.

  • I agree..that seems to be my strategy also. And I also have GUID so it sounds like its going to be very similar. Having said that, i need to worry about 550 normalized tables and some of them include tiered categories (MAIN/SUB/SUB2) and self-references (PARENT_ID)

    Now when it comes to the actual scripting, I seem to have three options I would appreciate your advise on:

    1) MERGE

    2) INSERT INTO SELECT

    3) individual INSERT statements by running a SELECT first (and manipulating the results into standalone INSERT statements)

    Please note i have a dozen records with over 10 million records so performance will come into play here

  • Option 3, I have found, is the simplest to understand and let's you handle an entire "record" (ie the primary parent and all children) as a unit of work but performance becomes a concern as the data volume grows.

    Set-based-wise I have used the pattern where I do an INSERT-SELECT with OUTPUT to capture system assigned IDs so I can use them in later inserts of the child rows. This approach also solves the whole pre-assigned ID blocks which is where this conversation started be ause you are simply adding the new rows, allowing the system to assign the next ID and reacting accordingly.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • I think i prefer the insert into select option since its basically a the same as MERGE but with better performance.

    There are two reasons while i would like to avoid the third option: 1) huge scripts with inserts for millions of records 2) live databases might change after the scripts are created

  • titsiros (1/18/2016)


    I think i prefer the insert into select option since its basically a the same as MERGE but with better performance.

    There are two reasons while i would like to avoid the third option: 1) huge scripts with inserts for millions of records 2) live databases might change after the scripts are created

    Juat to be clear, option 3 is still dynamic, except it's row by row processing. If you have interface stored procs fronting your system, eg you should only create a new "customer profile" using a proc, and the incoming data is from a different system it could be the best option to ensure all rules in the interface procs are honored.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • I deal with smaller databases and SQL express so performance may not be as much of an issue, but I use option 3.

    Regardless, you will need to insert all the rows for your primary tables first so you can then insert the data for tables with foreign keys and "remap" the foreign key column to the correct value or your data will not make sense anymore! This is where your performance will really suffer.

    My code (for each table) goes something like this:

    Use NewDatabase

    insert into NewDatabase..Table NewForeignKey, ColumnData, RowGuid

    select

    -- This remaps the old forign key to the current key based on a rowguid and the new / old databases.

    -- It relies on the Foreign Key table's data being inserted first!

    (select NewDatabase..NewForeignKeyTable.ForeignKey from NewDatabase..ForeignKeyTableName Internal_NewForeignKeyTable

    join OriginalDatabase..ForeignKeyTableName Internal_OriginalForeignKeyTable on Internal_OriginalForeignKeyTable.rowguid = Internal_NewForeignKeyTable.RowGuid

    where Internal_OriginalForeignKeyTable.ForeignKey = OriginalTable.ForeignKey

    ) NewForeignKey,

    -- Use all the data from the original database when inserting

    OriginalTable.ColumnData,

    -- make sure you insert the SAME rowguid into the table

    OriginalTable.rowguid,

    -- This selects the data from the original database that is not in the new database based on rowguid

    from OriginalDatabase..OriginalTableName OriginalTable

    where OriginalTable.rowguid not in (select rowguid from OriginalTableName)

    It uses an inner select and has a join on the table and the foreign key table in BOTH the new and the original database for EACH foreign key.

    As I said, my databases are relatively small so I get away with it. You may have issues with performance and may need other options.

    Option 2 may let you build up a Foreign key mapping table (using @@Identity or scope_identity() ) as you insert the primary rows.

  • I use a similar strategy but instead of having a nested select, I do an INNER join between the two databases

    insert into destinationDB.destinationTable (ID, s...)

    select destinationParentTable.ID, s...

    from sourceTable s

    inner join sourceParentTable sP on sP.ID=n.ID

    INNER join destinationParentTable dP on dP.NAME=sP.NAME

    left join destinationTable d on s.NAME=d.NAME

    where d.ID is null

    My only concern with this approach is that i cant find a way to log errors and commit the successes..any ideas?

Viewing 15 posts - 31 through 45 (of 48 total)

You must be logged in to reply to this topic. Login to reply