Generate CLuster IDs based on 2 columns

Question

Post reply

Generate CLuster IDs based on 2 columns

andrei.burdun

SSC Rookie

Points: 31
More actions
July 25, 2011 at 8:25 am

#240396

Hello guys,

I have the following table with the following data:

create table tbl1

(

ClusterID VARCHAR(20) null,

ID1 VARCHAR(20) null,

ID2 VARCHAR(20) null

)

GO

insert into tbl1 tbl1 values

(null, '1', '2');

insert into tbl1 tbl1 values

(null, '2', '3');

insert into tbl1 tbl1 values

(null, '1', '3');

insert into tbl1 tbl1 values

(null, '4', '5');

insert into tbl1 tbl1 values

(null, '5', '6');

insert into tbl1 tbl1 values

(null, '6', '7');

GO

Is there a way to generate some sort of a Cluster ID based on the columns ID1 and ID2?

The result should be something like this:

ClusterID ID1 ID2

-------------------------

CL1 1 2

CL1 2 3

CL1 1 3

CL2 4 5

CL2 5 6

CL2 6 7

Any help is much appreciated.

Thank you!

Andrei

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply

Sean Lange SSC Guru Points: 286562 More actions · Answer 1

What defines a cluster? You need something to indicate what each grouping should be.

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

jpomfret7 SSCommitted Points: 1628 More actions · Answer 2

jpomfret7

SSCommitted

Points: 1628

July 25, 2011 at 2:20 pm

#1358141

I gave my input on this thread!

http://www.dbforums.com/microsoft-sql-server/1668584-generate-cluster-id-based-two-other-columns.html

andrei.burdun SSC Rookie Points: 31 More actions · Answer 3

Guys, thanks for the interest in solving this problem. Apologies for not giving enough information. I have it in my head and sometimes I have the wrong impression that just a few details are enough. I will try again 🙂

I have a table in my database with company information. Unfortunately it does contain duplicated records. I was giving the task to find a solution to remove these duplicates. The problem is that the duplicates are not exact and I used some fuzzy algorithms to identify them.

This is a sample of my table with company info:

ID CompanyName

------------------------

1 Alfa Inc

2 Alfa

3 Alfa Co.

4 Beta New York

5 Beta Incorporated

6 Beta Co

7 Beta Oy

8 Omega

After deduping I got the following table:

ClusterID ID1 ID2

-------------------------

CL1 1 2

CL1 2 3

CL1 1 3

CL2 4 5

CL2 5 6

CL2 7 5

I will explain it a bit:

The company with ID=1 is the same as the one with ID = 2 (Alfa Inc and Alfa). At the same time, the company with ID = 2 is the same as the company with ID=3. Also, ID=1 is the same with ID=3. That means that the first 3 lines should get the same CLusterID. The same with the next three lines (company name = Beta). They should all get a new Cluster ID => CL2.

It is not a general rule that the Cluster is given to only 3 records. It can be only 2 or 5, or etc.

Thanks

Eugene Elutin SSC Guru Points: 59322 More actions · Answer 4

What cluster Id you would give to company named as "Alfa Beta Omega Ltd"?

Do you check company names do match by the first word only?

_____________________________________________
"The only true wisdom is in knowing you know nothing"
"O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
(So many miracle inventions provided by MS to us...)

How to post your question to get the best and quick help[/url]

Eugene Elutin SSC Guru Points: 59322 More actions · Answer 5

I've found your sample inconsistent. I guess you will have all pairs of related Id's

I have updated your sample to have all ID pairs. However my query should work even for incomplete pairs.

But there is one important prerequisite: Id's for related items should be allocated sequentially. For example you should not have something like:

1, 2

1, 6

1, 7

2, 6

2, 7

3, 4

3, 5

4, 5

6, 7

Please also note that clustered index must be created to guarantee results!



create table tbl1

(

ClusterID VARCHAR(20) null,

ID1 VARCHAR(20) null,

ID2 VARCHAR(20) null

)

GO

insert into tbl1  values

(null, '1', '2');

insert into tbl1  values

(null, '2', '3');

insert into tbl1  values

(null, '1', '3');

insert into tbl1  values

(null, '4', '5');

insert into tbl1  values

(null, '4', '6');

insert into tbl1  values

(null, '4', '7');

insert into tbl1  values

(null, '4', '8');

insert into tbl1  values

(null, '5', '6');

insert into tbl1  values

(null, '5', '7');

insert into tbl1  values

(null, '5', '8');

insert into tbl1  values

(null, '6', '7');

insert into tbl1  values

(null, '6', '8');

insert into tbl1  values

(null, '7', '8');

insert into tbl1  values

(null, '9', '10');

GO

select * from tbl1

select cast(null as varchar(13)) as CID, ID1, ID2

into #cid

from tbl1 

create unique clustered index ixc_#cid on #cid(ID1, ID2)

declare @cid    int

declare @idlist varchar(max)

set @cid = 0

UPDATE #cid

SET   @cid = CASE WHEN @idlist IS NULL OR

                       (CHARINDEX('|'+ CAST( ID1 as varchar(11))+'|', @idlist,0) <= 0 AND

                        CHARINDEX('|'+ CAST( ID2 as varchar(11))+'|', @idlist,0) <= 0)

                  THEN @cid + 1  ELSE @cid END

    ,   CID = 'CL' + CAST(@cid as VARCHAR(11))

    ,   @idlist = CASE WHEN @idlist IS NULL OR

                       (CHARINDEX('|'+ CAST( ID1 as varchar(11))+'|', @idlist,0) <= 0 AND

                        CHARINDEX('|'+ CAST( ID2 as varchar(11))+'|', @idlist,0) <= 0)

                  THEN '|' + CAST( ID1 as varchar(11)) + '||' + CAST( ID2 as varchar(11)) + '|'

                  ELSE @idlist  + '|' + CAST( ID2 as varchar(11)) + '|'

                  END 

select * FROM #cid

_____________________________________________
"The only true wisdom is in knowing you know nothing"
"O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
(So many miracle inventions provided by MS to us...)

How to post your question to get the best and quick help[/url]

andrei.burdun SSC Rookie Points: 31 More actions · Answer 6

Thanks, but if I insert into tbl1 the following row in addition to your rows, the approach will not work anymore:

insert into tbl1 values

(null, '10', '1');

This row should get CL1 as well as 1 is already clustered with CL1. The thing is that it is not me who creates this table. I receive it from a data entry department. they are doing manual checks and provide me companies that are duplicates and I have to generate an unique ID for them.

Not every time the value in the column ID1 < ID2.

Best regards

Andrei

Sean Lange SSC Guru Points: 286562 More actions · Answer 7

Sounds like you have a nightmare to deal with. It sounds like your "cluster" is based on the first word of the company name. Is that correct? How do you guarantee that will always work. What if you added Alpha Romeo Motor company? They are no longer part of your cluster because it isn't a dupe. Or you have Alpha Graphics (printing company) and Alpha Company (military supply)? We are willing and able to help but there just isn't much clarity to the problem still.

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

andrei.burdun SSC Rookie Points: 31 More actions · Answer 8

Sean, if the Data Entry department makes mistakes in finding matches, this is their problem, I don't care. My job is to create Cluster IDs based on what they provide me with.

I have written the below code. It works fine and it is what I was searching for. The only small issue I have now is that its performance is low. I have to tell you that I expect a table of ~ 50000 records to process. nice, huh ... :-(.

The code:

------------------------------

drop table dbo.tbl1

go

create table dbo.tbl1

(

ClusterID VARCHAR(20) null,

ID1 VARCHAR(20) null,

ID2 VARCHAR(20) null

)

GO

delete from tbl1;

insert into tbl1 values

(null, '1', '2');

insert into tbl1 values

(null, '2', '3');

insert into tbl1 values

(null, '1', '3');

insert into tbl1 values

(null, '4', '5');

insert into tbl1 values

(null, '5', '6');

insert into tbl1 values

(null, '6', '7');

insert into tbl1 values

(null, '7', '8');

insert into tbl1 values

(null, '9', '10');

insert into tbl1 values

(null, '11', '12');

insert into tbl1 values

(null, '13', '14');

insert into tbl1 values

(null, '14', '1');

insert into tbl1 values

(null, '14', '13');

GO

update dbo.tbl1 set clusterid=null

go

DECLARE @id1 VARCHAR(85)

DECLARE @id2 VARCHAR(85)

declare @found_cluster int

declare @max_cluster int

DECLARE cluster_cursor CURSOR FOR SELECT id1, id2 FROM tbl1

OPEN cluster_cursor

FETCH NEXT FROM cluster_cursor INTO @id1, @id2

WHILE @@FETCH_STATUS = 0

BEGIN

set @max_cluster = (select MAX(clusterid) from dbo.tbl1)

if @max_cluster is null

set @max_cluster = 0

set @found_cluster = (select min(clusterID) from dbo.tbl1

where id1 = @id1 or id2 = @id2 or id1 = @id2 or id2 = @id1)

if @found_cluster is not null

update dbo.tbl1 set clusterID = @found_cluster where id1 = @id1 or id2 = @id2 or id1 = @id2 or id2 = @id1

else

update dbo.tbl1 set clusterid = @max_cluster + 1 where id1 = @id1 and id2=@id2

FETCH NEXT FROM cluster_cursor INTO @id1, @id2

END

CLOSE cluster_cursor

DEALLOCATE cluster_cursor

GO

select * from dbo.tbl1

--------------------------------------

Any help in improving the completion time is much appreciated.

Thanks guys!

Andrei

Jamie Ashton Mr or Mrs. 500 Points: 587 More actions · Answer 9

OK, sounds like you have inherited some dodgy data, but what can you do.

My understanding of your problem is that you want to "cluster" CompanyNames where the first word is the same.

I'm going to break my approach down into a few steps just for simplicity.

So, first I would add a column, maybe called ShortName, and that would hold the first word of the CompanyName. You can always delete it afterwards. You can do that thus:

alter table #ClusterMe add ShortName varchar(10)

Then populate the shortnames:

UPDATE #ClusterMe SET ShortName = CASE CHARINDEX(' ', CompanyName, 1)

WHEN 0 THEN CompanyName

ELSE SUBSTRING(CompanyName, 1, CHARINDEX(' ', CompanyName, 1) - 1)

END

Coolio. Now we have a column with ShortNames in.

Now to create the clusters, I think you can use dense_rank for this like so:

select CompanyName, ShortName, 'CL' + cast(dense_rank() over ( order by ShortName)as varchar(5)) as ClusterID from #ClusterMe

Hope this helps,

Jamie

Eugene Elutin SSC Guru Points: 59322 More actions · Answer 10

andrei.burdun (7/26/2011)

Thanks, but if I insert into tbl1 the following row in addition to your rows, the approach will not work anymore:

insert into tbl1 values

(null, '10', '1');

This row should get CL1 as well as 1 is already clustered with CL1. The thing is that it is not me who creates this table. I receive it from a data entry department. they are doing manual checks and provide me companies that are duplicates and I have to generate an unique ID for them.

Not every time the value in the column ID1 < ID2.

Best regards

Andrei

I have mentioned it as important prerequisite, haven't I?

If the decision to group companies are made by the business, I would suggest that business should allocate some kind of "Company Group Name" and set it to whatever they like.

If companies are groupped by SQL, then you should allocate Cluster ID as part of this SQL, otherwise, I'm afraid, you will need to use cursor as you now do and it will be slow. However 50000 records doesn't sound a lot to me, even my iPad can handle it quite fast 😀

To make your cursor a bit faster, you can reduce number of unnecessary updates: you should only update records where clusterid is not already set.



....

if @found_cluster is not null

update dbo.tbl1 set clusterID = @found_cluster where (id1 = @id1 or id2 = @id2 or id1 = @id2 or id2 = @id1)

and clusterID is null

else

update dbo.tbl1 set clusterid = @max_cluster + 1 where (id1 = @id1 and id2=@id2) and clusterID is null

...

_____________________________________________
"The only true wisdom is in knowing you know nothing"
"O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
(So many miracle inventions provided by MS to us...)

How to post your question to get the best and quick help[/url]

Sean Lange SSC Guru Points: 286562 More actions · Answer 11

I am thinking there has to be a set based solution to this but my brain is too full with my own stuff today. 😉

At any rate I moved a few things around inside you cursor which should help a little bit. Unfortunately **cough** cursors **cough** are slow beasts.



DECLARE @id1 VARCHAR(85)

DECLARE @id2 VARCHAR(85)

declare @found_cluster int

--declare @max_cluster int --don't really need this

DECLARE cluster_cursor CURSOR FOR SELECT id1, id2 FROM tbl1

OPEN cluster_cursor

FETCH NEXT FROM cluster_cursor INTO @id1, @id2

WHILE @@FETCH_STATUS = 0

BEGIN

	set @found_cluster = (select min(clusterID) from dbo.tbl1

	where id1 in (@id1, @id2) or id2 in (@id1, @id2)

	if @found_cluster is not null

		update dbo.tbl1 set clusterID = @found_cluster where id1 in (@id1, @id2) or id2 in (@id1, @id2) and clusterID is null

	else

		begin

			--moved this inside so you don't do this every iteration.

			--also moved the if null and such to a single statement

			update dbo.tbl1 set clusterid = (select ISNULL(MAX(clusterid), 0) + 1 from dbo.tbl1)

			where id1 = @id1 and id2 = @id2 and clusterID is null

		end

	FETCH NEXT FROM cluster_cursor INTO @id1, @id2

END

CLOSE cluster_cursor

DEALLOCATE cluster_cursor

If I get caught up and have some time later this afternoon I will try to revisit this and see if there isn't a way to make this set based.

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

Eugene Elutin SSC Guru Points: 59322 More actions · Answer 12

Sean Lange (7/26/2011)

I am thinking there has to be a set based solution to this but my brain is too full with my own stuff today. 😉

...

There is...

At least one, and hope it will be faster than a cursor :hehe::



--drop table tbl1

create table tbl1

(

ClusterID VARCHAR(20) null,

ID1 VARCHAR(20) null,

ID2 VARCHAR(20) null

)

GO

insert into tbl1  values

(null, '1', '2');

insert into tbl1  values

(null, '1', '6');

insert into tbl1  values

(null, '1', '7');

insert into tbl1  values

(null, '2', '6');

insert into tbl1  values

(null, '2', '7');

insert into tbl1  values

(null, '3', '4');

insert into tbl1  values

(null, '3', '5');

insert into tbl1  values

(null, '4', '5');

insert into tbl1  values

(null, '6', '7');

GO

--drop table #cid

select cast(null as varchar(30)) as CID, ID1, ID2

into #cid

from tbl1 

create unique clustered index ixc_#cid on #cid(ID1, ID2)

declare @cid    int

declare @clcid  char(13)

declare @idlist varchar(max)

set @cid = 0

set @idlist = ''

UPDATE #cid

SET   @cid = CASE WHEN @idlist = '' OR

                       (CHARINDEX('|'+ CAST( ID1 as char(11))+'|', @idlist,0) <= 0 AND

                        CHARINDEX('|'+ CAST( ID2 as char(11))+'|', @idlist,0) <= 0)

                  THEN @cid + 1  ELSE @cid END

    ,   CID = rtrim(CASE WHEN CHARINDEX('|'+ CAST( ID1 as char(11))+'|', @idlist,0) > 0

                   THEN SUBSTRING(@idlist,

                                  CHARINDEX('|'+ CAST( ID1 as char(11))+'|', @idlist,0) + 14,

                                  13)

                   WHEN CHARINDEX('|'+ CAST( ID2 as varchar(11))+'|', @idlist,0) > 0

                   THEN SUBSTRING(@idlist,

                                  CHARINDEX('|'+ CAST( ID2 as char(11))+'|', @idlist,0) + 14,

                                  13)

                   ELSE 'CL' + CAST(@cid as VARCHAR(11))

               END)

    ,   @clcid = 'CL' + CAST(@cid as VARCHAR(11))

    ,   @idlist = CASE WHEN @idlist = '' OR

                       (CHARINDEX('|'+ CAST( ID1 as char(11))+'|', @idlist,0) <= 0 AND

                        CHARINDEX('|'+ CAST( ID2 as char(11))+'|', @idlist,0) <= 0)

                  THEN @idlist  + '|' + CAST( ID1 as char(11)) + '|~' + @clcid +'~|' + CAST( ID2 as char(11)) + '|~' +@clcid + '~'

                  ELSE @idlist  + '|' + CAST( ID2 as char(11)) + '|~' + @clcid + '~'

                  END 

select * FROM #cid

_____________________________________________
"The only true wisdom is in knowing you know nothing"
"O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
(So many miracle inventions provided by MS to us...)

How to post your question to get the best and quick help[/url]

Sean Lange SSC Guru Points: 286562 More actions · Answer 13

Sweet. I am guessing that is way faster than that old nasty cursor. Nice work! Will have to see if this meets the criteria for the OP but it looks like it does to the best of my understanding anyway.

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

andrei.burdun SSC Rookie Points: 31 More actions · Answer 14

andrei.burdun

SSC Rookie

Points: 31

January 25, 2012 at 1:19 pm

#1438419

Thank you to all those involved in this brainstorming. I really appreciate your help. Will go on with the last suggested approach. It seems correct and fast.

Andrei