Generate CLuster IDs based on 2 columns

  • Hello guys,

    I have the following table with the following data:

    create table tbl1

    (

    ClusterID VARCHAR(20) null,

    ID1 VARCHAR(20) null,

    ID2 VARCHAR(20) null

    )

    GO

    insert into tbl1 tbl1 values

    (null, '1', '2');

    insert into tbl1 tbl1 values

    (null, '2', '3');

    insert into tbl1 tbl1 values

    (null, '1', '3');

    insert into tbl1 tbl1 values

    (null, '4', '5');

    insert into tbl1 tbl1 values

    (null, '5', '6');

    insert into tbl1 tbl1 values

    (null, '6', '7');

    GO

    Is there a way to generate some sort of a Cluster ID based on the columns ID1 and ID2?

    The result should be something like this:

    ClusterID ID1 ID2

    -------------------------

    CL1 1 2

    CL1 2 3

    CL1 1 3

    CL2 4 5

    CL2 5 6

    CL2 6 7

    Any help is much appreciated.

    Thank you!

    Andrei

  • What defines a cluster? You need something to indicate what each grouping should be.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Guys, thanks for the interest in solving this problem. Apologies for not giving enough information. I have it in my head and sometimes I have the wrong impression that just a few details are enough. I will try again 🙂

    I have a table in my database with company information. Unfortunately it does contain duplicated records. I was giving the task to find a solution to remove these duplicates. The problem is that the duplicates are not exact and I used some fuzzy algorithms to identify them.

    This is a sample of my table with company info:

    ID CompanyName

    ------------------------

    1 Alfa Inc

    2 Alfa

    3 Alfa Co.

    4 Beta New York

    5 Beta Incorporated

    6 Beta Co

    7 Beta Oy

    8 Omega

    After deduping I got the following table:

    ClusterID ID1 ID2

    -------------------------

    CL1 1 2

    CL1 2 3

    CL1 1 3

    CL2 4 5

    CL2 5 6

    CL2 7 5

    I will explain it a bit:

    The company with ID=1 is the same as the one with ID = 2 (Alfa Inc and Alfa). At the same time, the company with ID = 2 is the same as the company with ID=3. Also, ID=1 is the same with ID=3. That means that the first 3 lines should get the same CLusterID. The same with the next three lines (company name = Beta). They should all get a new Cluster ID => CL2.

    It is not a general rule that the Cluster is given to only 3 records. It can be only 2 or 5, or etc.

    Thanks

  • What cluster Id you would give to company named as "Alfa Beta Omega Ltd"?

    Do you check company names do match by the first word only?

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • I've found your sample inconsistent. I guess you will have all pairs of related Id's

    I have updated your sample to have all ID pairs. However my query should work even for incomplete pairs.

    But there is one important prerequisite: Id's for related items should be allocated sequentially. For example you should not have something like:

    1, 2

    1, 6

    1, 7

    2, 6

    2, 7

    3, 4

    3, 5

    4, 5

    6, 7

    Please also note that clustered index must be created to guarantee results!

    create table tbl1

    (

    ClusterID VARCHAR(20) null,

    ID1 VARCHAR(20) null,

    ID2 VARCHAR(20) null

    )

    GO

    insert into tbl1 values

    (null, '1', '2');

    insert into tbl1 values

    (null, '2', '3');

    insert into tbl1 values

    (null, '1', '3');

    insert into tbl1 values

    (null, '4', '5');

    insert into tbl1 values

    (null, '4', '6');

    insert into tbl1 values

    (null, '4', '7');

    insert into tbl1 values

    (null, '4', '8');

    insert into tbl1 values

    (null, '5', '6');

    insert into tbl1 values

    (null, '5', '7');

    insert into tbl1 values

    (null, '5', '8');

    insert into tbl1 values

    (null, '6', '7');

    insert into tbl1 values

    (null, '6', '8');

    insert into tbl1 values

    (null, '7', '8');

    insert into tbl1 values

    (null, '9', '10');

    GO

    select * from tbl1

    select cast(null as varchar(13)) as CID, ID1, ID2

    into #cid

    from tbl1

    create unique clustered index ixc_#cid on #cid(ID1, ID2)

    declare @cid int

    declare @idlist varchar(max)

    set @cid = 0

    UPDATE #cid

    SET @cid = CASE WHEN @idlist IS NULL OR

    (CHARINDEX('|'+ CAST( ID1 as varchar(11))+'|', @idlist,0) <= 0 AND

    CHARINDEX('|'+ CAST( ID2 as varchar(11))+'|', @idlist,0) <= 0)

    THEN @cid + 1 ELSE @cid END

    , CID = 'CL' + CAST(@cid as VARCHAR(11))

    , @idlist = CASE WHEN @idlist IS NULL OR

    (CHARINDEX('|'+ CAST( ID1 as varchar(11))+'|', @idlist,0) <= 0 AND

    CHARINDEX('|'+ CAST( ID2 as varchar(11))+'|', @idlist,0) <= 0)

    THEN '|' + CAST( ID1 as varchar(11)) + '||' + CAST( ID2 as varchar(11)) + '|'

    ELSE @idlist + '|' + CAST( ID2 as varchar(11)) + '|'

    END

    select * FROM #cid

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • Thanks, but if I insert into tbl1 the following row in addition to your rows, the approach will not work anymore:

    insert into tbl1 values

    (null, '10', '1');

    This row should get CL1 as well as 1 is already clustered with CL1. The thing is that it is not me who creates this table. I receive it from a data entry department. they are doing manual checks and provide me companies that are duplicates and I have to generate an unique ID for them.

    Not every time the value in the column ID1 < ID2.

    Best regards

    Andrei

  • Sounds like you have a nightmare to deal with. It sounds like your "cluster" is based on the first word of the company name. Is that correct? How do you guarantee that will always work. What if you added Alpha Romeo Motor company? They are no longer part of your cluster because it isn't a dupe. Or you have Alpha Graphics (printing company) and Alpha Company (military supply)? We are willing and able to help but there just isn't much clarity to the problem still.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Sean, if the Data Entry department makes mistakes in finding matches, this is their problem, I don't care. My job is to create Cluster IDs based on what they provide me with.

    I have written the below code. It works fine and it is what I was searching for. The only small issue I have now is that its performance is low. I have to tell you that I expect a table of ~ 50000 records to process. nice, huh ... :-(.

    The code:

    ------------------------------

    drop table dbo.tbl1

    go

    create table dbo.tbl1

    (

    ClusterID VARCHAR(20) null,

    ID1 VARCHAR(20) null,

    ID2 VARCHAR(20) null

    )

    GO

    delete from tbl1;

    insert into tbl1 values

    (null, '1', '2');

    insert into tbl1 values

    (null, '2', '3');

    insert into tbl1 values

    (null, '1', '3');

    insert into tbl1 values

    (null, '4', '5');

    insert into tbl1 values

    (null, '5', '6');

    insert into tbl1 values

    (null, '6', '7');

    insert into tbl1 values

    (null, '7', '8');

    insert into tbl1 values

    (null, '9', '10');

    insert into tbl1 values

    (null, '11', '12');

    insert into tbl1 values

    (null, '13', '14');

    insert into tbl1 values

    (null, '14', '1');

    insert into tbl1 values

    (null, '14', '13');

    GO

    update dbo.tbl1 set clusterid=null

    go

    DECLARE @id1 VARCHAR(85)

    DECLARE @id2 VARCHAR(85)

    declare @found_cluster int

    declare @max_cluster int

    DECLARE cluster_cursor CURSOR FOR SELECT id1, id2 FROM tbl1

    OPEN cluster_cursor

    FETCH NEXT FROM cluster_cursor INTO @id1, @id2

    WHILE @@FETCH_STATUS = 0

    BEGIN

    set @max_cluster = (select MAX(clusterid) from dbo.tbl1)

    if @max_cluster is null

    set @max_cluster = 0

    set @found_cluster = (select min(clusterID) from dbo.tbl1

    where id1 = @id1 or id2 = @id2 or id1 = @id2 or id2 = @id1)

    if @found_cluster is not null

    update dbo.tbl1 set clusterID = @found_cluster where id1 = @id1 or id2 = @id2 or id1 = @id2 or id2 = @id1

    else

    update dbo.tbl1 set clusterid = @max_cluster + 1 where id1 = @id1 and id2=@id2

    FETCH NEXT FROM cluster_cursor INTO @id1, @id2

    END

    CLOSE cluster_cursor

    DEALLOCATE cluster_cursor

    GO

    select * from dbo.tbl1

    --------------------------------------

    Any help in improving the completion time is much appreciated.

    Thanks guys!

    Andrei

  • OK, sounds like you have inherited some dodgy data, but what can you do.

    My understanding of your problem is that you want to "cluster" CompanyNames where the first word is the same.

    I'm going to break my approach down into a few steps just for simplicity.

    So, first I would add a column, maybe called ShortName, and that would hold the first word of the CompanyName. You can always delete it afterwards. You can do that thus:

    alter table #ClusterMe add ShortName varchar(10)

    Then populate the shortnames:

    UPDATE #ClusterMe SET ShortName = CASE CHARINDEX(' ', CompanyName, 1)

    WHEN 0 THEN CompanyName

    ELSE SUBSTRING(CompanyName, 1, CHARINDEX(' ', CompanyName, 1) - 1)

    END

    Coolio. Now we have a column with ShortNames in.

    Now to create the clusters, I think you can use dense_rank for this like so:

    select CompanyName, ShortName, 'CL' + cast(dense_rank() over ( order by ShortName)as varchar(5)) as ClusterID from #ClusterMe

    Hope this helps,

    Jamie

  • andrei.burdun (7/26/2011)


    Thanks, but if I insert into tbl1 the following row in addition to your rows, the approach will not work anymore:

    insert into tbl1 values

    (null, '10', '1');

    This row should get CL1 as well as 1 is already clustered with CL1. The thing is that it is not me who creates this table. I receive it from a data entry department. they are doing manual checks and provide me companies that are duplicates and I have to generate an unique ID for them.

    Not every time the value in the column ID1 < ID2.

    Best regards

    Andrei

    I have mentioned it as important prerequisite, haven't I?

    If the decision to group companies are made by the business, I would suggest that business should allocate some kind of "Company Group Name" and set it to whatever they like.

    If companies are groupped by SQL, then you should allocate Cluster ID as part of this SQL, otherwise, I'm afraid, you will need to use cursor as you now do and it will be slow. However 50000 records doesn't sound a lot to me, even my iPad can handle it quite fast 😀

    To make your cursor a bit faster, you can reduce number of unnecessary updates: you should only update records where clusterid is not already set.

    ....

    if @found_cluster is not null

    update dbo.tbl1 set clusterID = @found_cluster where (id1 = @id1 or id2 = @id2 or id1 = @id2 or id2 = @id1)

    and clusterID is null

    else

    update dbo.tbl1 set clusterid = @max_cluster + 1 where (id1 = @id1 and id2=@id2) and clusterID is null

    ...

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • I am thinking there has to be a set based solution to this but my brain is too full with my own stuff today. 😉

    At any rate I moved a few things around inside you cursor which should help a little bit. Unfortunately **cough** cursors **cough** are slow beasts.

    DECLARE @id1 VARCHAR(85)

    DECLARE @id2 VARCHAR(85)

    declare @found_cluster int

    --declare @max_cluster int --don't really need this

    DECLARE cluster_cursor CURSOR FOR SELECT id1, id2 FROM tbl1

    OPEN cluster_cursor

    FETCH NEXT FROM cluster_cursor INTO @id1, @id2

    WHILE @@FETCH_STATUS = 0

    BEGIN

    set @found_cluster = (select min(clusterID) from dbo.tbl1

    where id1 in (@id1, @id2) or id2 in (@id1, @id2)

    if @found_cluster is not null

    update dbo.tbl1 set clusterID = @found_cluster where id1 in (@id1, @id2) or id2 in (@id1, @id2) and clusterID is null

    else

    begin

    --moved this inside so you don't do this every iteration.

    --also moved the if null and such to a single statement

    update dbo.tbl1 set clusterid = (select ISNULL(MAX(clusterid), 0) + 1 from dbo.tbl1)

    where id1 = @id1 and id2 = @id2 and clusterID is null

    end

    FETCH NEXT FROM cluster_cursor INTO @id1, @id2

    END

    CLOSE cluster_cursor

    DEALLOCATE cluster_cursor

    If I get caught up and have some time later this afternoon I will try to revisit this and see if there isn't a way to make this set based.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Sean Lange (7/26/2011)


    I am thinking there has to be a set based solution to this but my brain is too full with my own stuff today. 😉

    ...

    There is...

    At least one, and hope it will be faster than a cursor :hehe::

    --drop table tbl1

    create table tbl1

    (

    ClusterID VARCHAR(20) null,

    ID1 VARCHAR(20) null,

    ID2 VARCHAR(20) null

    )

    GO

    insert into tbl1 values

    (null, '1', '2');

    insert into tbl1 values

    (null, '1', '6');

    insert into tbl1 values

    (null, '1', '7');

    insert into tbl1 values

    (null, '2', '6');

    insert into tbl1 values

    (null, '2', '7');

    insert into tbl1 values

    (null, '3', '4');

    insert into tbl1 values

    (null, '3', '5');

    insert into tbl1 values

    (null, '4', '5');

    insert into tbl1 values

    (null, '6', '7');

    GO

    --drop table #cid

    select cast(null as varchar(30)) as CID, ID1, ID2

    into #cid

    from tbl1

    create unique clustered index ixc_#cid on #cid(ID1, ID2)

    declare @cid int

    declare @clcid char(13)

    declare @idlist varchar(max)

    set @cid = 0

    set @idlist = ''

    UPDATE #cid

    SET @cid = CASE WHEN @idlist = '' OR

    (CHARINDEX('|'+ CAST( ID1 as char(11))+'|', @idlist,0) <= 0 AND

    CHARINDEX('|'+ CAST( ID2 as char(11))+'|', @idlist,0) <= 0)

    THEN @cid + 1 ELSE @cid END

    , CID = rtrim(CASE WHEN CHARINDEX('|'+ CAST( ID1 as char(11))+'|', @idlist,0) > 0

    THEN SUBSTRING(@idlist,

    CHARINDEX('|'+ CAST( ID1 as char(11))+'|', @idlist,0) + 14,

    13)

    WHEN CHARINDEX('|'+ CAST( ID2 as varchar(11))+'|', @idlist,0) > 0

    THEN SUBSTRING(@idlist,

    CHARINDEX('|'+ CAST( ID2 as char(11))+'|', @idlist,0) + 14,

    13)

    ELSE 'CL' + CAST(@cid as VARCHAR(11))

    END)

    , @clcid = 'CL' + CAST(@cid as VARCHAR(11))

    , @idlist = CASE WHEN @idlist = '' OR

    (CHARINDEX('|'+ CAST( ID1 as char(11))+'|', @idlist,0) <= 0 AND

    CHARINDEX('|'+ CAST( ID2 as char(11))+'|', @idlist,0) <= 0)

    THEN @idlist + '|' + CAST( ID1 as char(11)) + '|~' + @clcid +'~|' + CAST( ID2 as char(11)) + '|~' +@clcid + '~'

    ELSE @idlist + '|' + CAST( ID2 as char(11)) + '|~' + @clcid + '~'

    END

    select * FROM #cid

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • Sweet. I am guessing that is way faster than that old nasty cursor. Nice work! Will have to see if this meets the criteria for the OP but it looks like it does to the best of my understanding anyway.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Thank you to all those involved in this brainstorming. I really appreciate your help. Will go on with the last suggested approach. It seems correct and fast.

    Andrei

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply