Improving string comparisons that use like

Question

Post reply

Improving string comparisons that use like

DamianC

SSCrazy Eights

Points: 8006
More actions
January 8, 2016 at 2:12 pm

#323862

Hello

I have a table (policy) with approximately 1.5 million rows in it.

I have another table (userpolicy) with approximately 100k rows.

The first table has a column called policyid

The second table has a column called userpolicyid

Policyid is a set code such as abc123

userpolicyid has a code that could contain the policyid e.g AAabc1232015

I want to select all the records from policy that exist in userpolicy along with some other info, so joining.

I've tried where policyid like %userpolicyid% but it's very slow.

What are the alternatives?

Thanks

- Damian

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply

pietlinden SSC Guru Points: 63198 More actions · Answer 1

pietlinden

SSC Guru

Points: 63198

January 8, 2016 at 2:27 pm

#1850697

if you could create a column in the UserPolicy table that contains the 6 characters or whatever you're matching on in a single column (or at the column values start with the values you're matching on), it would be a LOT faster, because the search could use an index on the matching fields.

Eric M Russell SSC Guru Points: 125255 More actions · Answer 2

... userpolicyid has a code that could contain the policyid e.g AAabc1232015 ...

Could contain? One thing you could do is add a non-persisted computed column, index that, and then use it for the join. However, it would be more reliable if there were some consistent formatting pattern that could be applied to the column userpolicyid.



create table UserPolicy

(

	userpolicyid varchar(20) not null,

	policyid as substring(userpolicyid,3,len(userpolicyid)-6)

);

insert into UserPolicy ( userpolicyid ) values ('AAabc1232015');

create index ix_policyid on UserPolicy ( policyid );

select * from UserPolicy where policyid = 'abc123';

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Hugo Kornelis SSC Guru Points: 64748 More actions · Answer 3

DamianC (1/8/2016)

Hello

I have a table (policy) with approximately 1.5 million rows in it.

I have another table (userpolicy) with approximately 100k rows.

The first table has a column called policyid

The second table has a column called userpolicyid

Policyid is a set code such as abc123

userpolicyid has a code that could contain the policyid e.g AAabc1232015

I want to select all the records from policy that exist in userpolicy along with some other info, so joining.

I've tried where policyid like %userpolicyid% but it's very slow.

What are the alternatives?

Thanks

The best altern ative is to implement a better database design.

You are violating one of the rules of First Normal Form, which is to ensure that each colum is atomic - e.g. never smash sepeparet pieces of information together in a single column. The AAabc1232015 value is obviously a combination of at least 3 separate elements (AA, abc123, and 2015); these should be stored in three separate columns.

With your current database design, it will be either impossible to get good performance, or perhaps it is possible with some very awkward workarounds - but the far better choice is to revisit the design.

Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
Visit my SQL Server blog: https://sqlserverfast.com/blog/
SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

Eric M Russell SSC Guru Points: 125255 More actions · Answer 4

This smells like healthcare claims data. If so, then "userpolicyid" is something like PayerID + SubscriberID + DependentID, and the format is not coded consistently between external sources. That's the way it is with healthcare IT; consolidating data from 100 different independently designed source systems and then you have some poor guy trying to integrate or perform analytics on it.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Hugo Kornelis SSC Guru Points: 64748 More actions · Answer 5

Eric M Russell (1/11/2016)

This smells like healthcare claims data. If so, then "userpolicyid" is something like PayerID + SubscriberID + DependentID, and the format is not coded consistently between external sources. That's the way it is with healthcare IT; consolidating data from 100 different independently designed source systems and then you have some poor guy trying to integrate or perform analytics on it.

And if data is delivered into the system like that (as often happens with data coming from external sources), then this makes it all the more important to clean up and separate out the individual elements in the data. When importing the data you can have custom logic for sources that have a different format than others. Once all data is in a single table, that becomes much harder.

Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
Visit my SQL Server blog: https://sqlserverfast.com/blog/
SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

Eric M Russell SSC Guru Points: 125255 More actions · Answer 6

Hugo Kornelis (1/11/2016)

Eric M Russell (1/11/2016)

This smells like healthcare claims data. If so, then "userpolicyid" is something like PayerID + SubscriberID + DependentID, and the format is not coded consistently between external sources. That's the way it is with healthcare IT; consolidating data from 100 different independently designed source systems and then you have some poor guy trying to integrate or perform analytics on it.

And if data is delivered into the system like that (as often happens with data coming from external sources), then this makes it all the more important to clean up and separate out the individual elements in the data. When importing the data you can have custom logic for sources that have a different format than others. Once all data is in a single table, that becomes much harder.

I've been in this spot, and I can tell you that type of rational approach doesn't work with healthcare claims data. In addition to dealing with 100s of external sources, the policy numbers coming from the same source, within the same source file, are coded inconsistently. There can be multiple variations of vendor systems used in parallel at the same client site, the IDs are entered manually, and the input operators have little sticky notes on their monitor giving them directions about how to code data. So, the internal consistency of multi-valued fields are mangled beyond hope before it reaches the downstream datamart or reporting system, and no, you can't ask the client to fix their data entry or ETL process. If you have the resources, predictive analytics may get you to 99.5% reliability, but it's essentially not a downstream data modeling and analytics problem, it's a source data quality problem.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Hugo Kornelis SSC Guru Points: 64748 More actions · Answer 7

Eric, I guess I didn't make my point well enough. What I was trying to stress is the importance of normalizing data as early as possible. If you can normalize and sanitize, do so *before* combining it with other data - afterwards it only get harder, never easier.

If the data is so ill-formatted that you cannot sanitize, they there's nothing to do except to manage expectations of your managers - in regards to both expected correctness and expected speed of information delivery. (And that speed refers not only to execution speed of a query, but even more to development effort)

Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
Visit my SQL Server blog: https://sqlserverfast.com/blog/
SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

Sergiy SSC Guru Points: 110192 More actions · Answer 8

Eric M Russell (1/11/2016)

In addition to dealing with 100s of external sources, the policy numbers coming from the same source, within the same source file, are coded inconsistently. There can be multiple variations of vendor systems used in parallel at the same client site, the IDs are entered manually, and the input operators have little sticky notes on their monitor giving them directions about how to code data. So, the internal consistency of multi-valued fields are mangled beyond hope before it reaches the downstream datamart or reporting system

I can only support Hugo here - parse the data on the inbound point.

Never let "unstructured data" in the database.

Because data with no structure is just white noise.

Think about it:

if you parsing on inbound:

- you do it once per each entity, when it's inserted, not hundreds times per day for all entries ever inserted;

- you are sure that everything which has been loaded is correctly recognised, all reports become easy, quick and consistent;

- all unrecognised records are sitting in Inbound Staging waiting for someone to check it.

It's easy to setup an alert for such entries and probably add another reading pattern into inbound process which will successfully parse the data.

_____________
Code for TallyGenerator

Alan Burstein SSC Guru Points: 61136 More actions · Answer 9

As already stated. atomic values would be ideal but sometimes you have to play the schema your dealt. One solution would be Full Text indexing then, instead of WHERE policyid LIKE userpolicyid%, you could do WHERE CONTAINS(policyid, userpolicyid). You could also go with an N-Grams style solution using a tally table and indexed view.

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

Sergiy SSC Guru Points: 110192 More actions · Answer 10

Alan.B (1/11/2016)

As already stated. atomic values would be ideal but sometimes you have to play the schema your dealt. One solution would be Full Text indexing then, instead of WHERE policyid LIKE userpolicyid%, you could do WHERE CONTAINS(policyid, userpolicyid). You could also go with an N-Grams style solution using a tally table and indexed view.

Full text idex would not help here, as all the parts are in one word.

_____________
Code for TallyGenerator

DamianC SSCrazy Eights Points: 8006 More actions · Answer 11

Thanks for the response

It's not healthcare but the scenario described fits what I am facing i.e. in certain circumstances, data is inconsistently stored in the ERP

Agree with the principles of normalisation and pretty much everything I do follows this; this is a bit of an exception

I really don't know why people design systems in this way (this column is actually a mix of optional free text and an actually policy id), but that's a separate issue

I think the only thing I can do is try and work out a pattern in the data and create some staging tables that split the column up

Just wondered if 'contains', 'text indexing', a better approach to the join, or something else may have worked

Reading your replies, seems not and I suspect I am just paper over cracks if I don't try and normalise this

Thanks again for the replies

Damian.

- Damian

Ed Wagner SSC Guru Points: 287015 More actions · Answer 12

I don't see how FT indexing would help you. Piling on with what you and others have said, find the pattern in the data, split it out into individual columns and create a normalized structure. You can reassemble the strings if you really need to, but get the work done up front in the staging area before storing it. You may encounter some pain at first, but at least the pain will be limited to the staging tables and not live production data.

Tally Tables - Performance Personified
String Splitting with True Performance
Best practices on how to ask questions