Improving string comparisons that use like

  • Hello

    I have a table (policy) with approximately 1.5 million rows in it.

    I have another table (userpolicy) with approximately 100k rows.

    The first table has a column called policyid

    The second table has a column called userpolicyid

    Policyid is a set code such as abc123

    userpolicyid has a code that could contain the policyid e.g AAabc1232015

    I want to select all the records from policy that exist in userpolicy along with some other info, so joining.

    I've tried where policyid like %userpolicyid% but it's very slow.

    What are the alternatives?

    Thanks

    - Damian

  • if you could create a column in the UserPolicy table that contains the 6 characters or whatever you're matching on in a single column (or at the column values start with the values you're matching on), it would be a LOT faster, because the search could use an index on the matching fields.

  • ... userpolicyid has a code that could contain the policyid e.g AAabc1232015 ...

    Could contain? One thing you could do is add a non-persisted computed column, index that, and then use it for the join. However, it would be more reliable if there were some consistent formatting pattern that could be applied to the column userpolicyid.

    create table UserPolicy

    (

    userpolicyid varchar(20) not null,

    policyid as substring(userpolicyid,3,len(userpolicyid)-6)

    );

    insert into UserPolicy ( userpolicyid ) values ('AAabc1232015');

    create index ix_policyid on UserPolicy ( policyid );

    select * from UserPolicy where policyid = 'abc123';

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • DamianC (1/8/2016)


    Hello

    I have a table (policy) with approximately 1.5 million rows in it.

    I have another table (userpolicy) with approximately 100k rows.

    The first table has a column called policyid

    The second table has a column called userpolicyid

    Policyid is a set code such as abc123

    userpolicyid has a code that could contain the policyid e.g AAabc1232015

    I want to select all the records from policy that exist in userpolicy along with some other info, so joining.

    I've tried where policyid like %userpolicyid% but it's very slow.

    What are the alternatives?

    Thanks

    The best altern ative is to implement a better database design.

    You are violating one of the rules of First Normal Form, which is to ensure that each colum is atomic - e.g. never smash sepeparet pieces of information together in a single column. The AAabc1232015 value is obviously a combination of at least 3 separate elements (AA, abc123, and 2015); these should be stored in three separate columns.

    With your current database design, it will be either impossible to get good performance, or perhaps it is possible with some very awkward workarounds - but the far better choice is to revisit the design.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • This smells like healthcare claims data. If so, then "userpolicyid" is something like PayerID + SubscriberID + DependentID, and the format is not coded consistently between external sources. That's the way it is with healthcare IT; consolidating data from 100 different independently designed source systems and then you have some poor guy trying to integrate or perform analytics on it.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell (1/11/2016)


    This smells like healthcare claims data. If so, then "userpolicyid" is something like PayerID + SubscriberID + DependentID, and the format is not coded consistently between external sources. That's the way it is with healthcare IT; consolidating data from 100 different independently designed source systems and then you have some poor guy trying to integrate or perform analytics on it.

    And if data is delivered into the system like that (as often happens with data coming from external sources), then this makes it all the more important to clean up and separate out the individual elements in the data. When importing the data you can have custom logic for sources that have a different format than others. Once all data is in a single table, that becomes much harder.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Hugo Kornelis (1/11/2016)


    Eric M Russell (1/11/2016)


    This smells like healthcare claims data. If so, then "userpolicyid" is something like PayerID + SubscriberID + DependentID, and the format is not coded consistently between external sources. That's the way it is with healthcare IT; consolidating data from 100 different independently designed source systems and then you have some poor guy trying to integrate or perform analytics on it.

    And if data is delivered into the system like that (as often happens with data coming from external sources), then this makes it all the more important to clean up and separate out the individual elements in the data. When importing the data you can have custom logic for sources that have a different format than others. Once all data is in a single table, that becomes much harder.

    I've been in this spot, and I can tell you that type of rational approach doesn't work with healthcare claims data. In addition to dealing with 100s of external sources, the policy numbers coming from the same source, within the same source file, are coded inconsistently. There can be multiple variations of vendor systems used in parallel at the same client site, the IDs are entered manually, and the input operators have little sticky notes on their monitor giving them directions about how to code data. So, the internal consistency of multi-valued fields are mangled beyond hope before it reaches the downstream datamart or reporting system, and no, you can't ask the client to fix their data entry or ETL process. If you have the resources, predictive analytics may get you to 99.5% reliability, but it's essentially not a downstream data modeling and analytics problem, it's a source data quality problem.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric, I guess I didn't make my point well enough. What I was trying to stress is the importance of normalizing data as early as possible. If you can normalize and sanitize, do so *before* combining it with other data - afterwards it only get harder, never easier.

    If the data is so ill-formatted that you cannot sanitize, they there's nothing to do except to manage expectations of your managers - in regards to both expected correctness and expected speed of information delivery. (And that speed refers not only to execution speed of a query, but even more to development effort)


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Eric M Russell (1/11/2016)


    In addition to dealing with 100s of external sources, the policy numbers coming from the same source, within the same source file, are coded inconsistently. There can be multiple variations of vendor systems used in parallel at the same client site, the IDs are entered manually, and the input operators have little sticky notes on their monitor giving them directions about how to code data. So, the internal consistency of multi-valued fields are mangled beyond hope before it reaches the downstream datamart or reporting system

    I can only support Hugo here - parse the data on the inbound point.

    Never let "unstructured data" in the database.

    Because data with no structure is just white noise.

    Think about it:

    if you parsing on inbound:

    - you do it once per each entity, when it's inserted, not hundreds times per day for all entries ever inserted;

    - you are sure that everything which has been loaded is correctly recognised, all reports become easy, quick and consistent;

    - all unrecognised records are sitting in Inbound Staging waiting for someone to check it.

    It's easy to setup an alert for such entries and probably add another reading pattern into inbound process which will successfully parse the data.

    _____________
    Code for TallyGenerator

  • As already stated. atomic values would be ideal but sometimes you have to play the schema your dealt. One solution would be Full Text indexing then, instead of WHERE policyid LIKE userpolicyid%, you could do WHERE CONTAINS(policyid, userpolicyid). You could also go with an N-Grams style solution using a tally table and indexed view.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Alan.B (1/11/2016)


    As already stated. atomic values would be ideal but sometimes you have to play the schema your dealt. One solution would be Full Text indexing then, instead of WHERE policyid LIKE userpolicyid%, you could do WHERE CONTAINS(policyid, userpolicyid). You could also go with an N-Grams style solution using a tally table and indexed view.

    Full text idex would not help here, as all the parts are in one word.

    _____________
    Code for TallyGenerator

  • Thanks for the response

    It's not healthcare but the scenario described fits what I am facing i.e. in certain circumstances, data is inconsistently stored in the ERP

    Agree with the principles of normalisation and pretty much everything I do follows this; this is a bit of an exception

    I really don't know why people design systems in this way (this column is actually a mix of optional free text and an actually policy id), but that's a separate issue

    I think the only thing I can do is try and work out a pattern in the data and create some staging tables that split the column up

    Just wondered if 'contains', 'text indexing', a better approach to the join, or something else may have worked

    Reading your replies, seems not and I suspect I am just paper over cracks if I don't try and normalise this

    Thanks again for the replies

    Damian.

    - Damian

  • I don't see how FT indexing would help you. Piling on with what you and others have said, find the pattern in the data, split it out into individual columns and create a normalized structure. You can reassemble the strings if you really need to, but get the work done up front in the staging area before storing it. You may encounter some pain at first, but at least the pain will be limited to the staging tables and not live production data.

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply