Tough problem

  • http://www.developmentnow.com/g/113_2005_5_0_0_527840/TRANSPOSE-QUERY-fat-fingers.htm

    i have been trying to figure out a way of doing this but havnt had much luck. Any ideas?

    www.sql-library.com[/url]

  • Here's the text:

    "I would like a query that pulls back pairs of numbers that should be the

    same, but one was mistyped.

    Specifically, I am looking for numbers where two neighboring digits were

    transposed in one of the numbers."

    Hmmm, I think I have some ideas...

     

    Tim Wilkinson

    "If it doesn't work in practice, you're using the wrong theory"
    - Immanuel Kant

  • Well, I can tell you very rapidly if any two numbers have a pair of digits transposed. The difference will always be divisible by 9. (Old accounting trick, I can show you why it works if you want.) But, that won't tell you which digits were transposed. It's also not a guaranteed check, as two totally different numbers can also have a difference that's divisible by 9.

    The Levenshtein algorithm mentioned in a reply to that post would seem to work, but, again, you can't guarantee what difference is triggering the count.

    My solution, off the top of my head, would be separate the digits into an array, and run a couple of loops for comparison. It would be extremely clunky, but, properly done, would have the advantage of maximum information (it would be able to tell you which two digits were transposed), and minimum false positives.

  • Yes, I found that mod 9 rule empirically using excel. there had to be something useful about the similarity and 'reverseness' of the numbers.

    That is definitely the way to go.  I think I've nearly(!) got a reasonable solution worked out... It's making my laptop creak a bit though at the moment.

    Tim Wilkinson

    "If it doesn't work in practice, you're using the wrong theory"
    - Immanuel Kant

  • Right I think I have it. I was hoping not to use any string functions, but I did in the end rather than using log10() a hundred times. I've used arithmetic to limit the rows as much as I can, then the string functions to do the final check. code follows...

    There is a problem that leading zeros are not ahndled, but we are talking about integer literals, I have assumed.

    Tim Wilkinson

    "If it doesn't work in practice, you're using the wrong theory"
    - Immanuel Kant

  • /***find pairs of numbers equal bar one transposition of adjacent digits**/

    /*************************************************************************/
    /*******code copyright (c) tim wilkinson (stax68) 2006********************/
    /*******free for non-commercial use***************************************/
    /*******compatibility: MS SQL Server v9.0 only ***************************/
    /*******other versions can be done...*************************************/
    /*************************************************************************/

    SET

    ANSI_NULLS ON

    SET ANSI_PADDING ON
    SET ANSI_WARNINGS ON
    SET ARITHABORT ON
    SET CONCAT_NULL_YIELDS_NULL ON
    SET NUMERIC_ROUNDABORT OFF
    SET QUOTED_IDENTIFIER ON
    SET NOCOUNT ON
    go

    /********************DDL and population********************/

    if exists(select * from sysobjects where name = 'someintegers' and xtype = 'U')
    drop table someintegers
    go

    create

    table someintegers(

    integer int identity

    ,

    char1 char(1)

    ,hash as power(10,(floor(integer/1000000000.0) % 10)-1)
    + power(10,(floor(integer/100000000.0) % 10)-1)

    + power(10,(floor(integer/10000000.0) % 10)-1)

    + power(10,(floor(integer/1000000.0) % 10)-1)

    + power(10,(floor(integer/100000.0) % 10)-1)

    + power(10,(floor(integer/10000.0) % 10)-1)

    + power(10,(floor(integer/1000.0) % 10)-1)

    + power(10,(floor(integer/100.0) % 10)-1)

    + power(10,(floor(integer/10.0) % 10)-1)

    + power(10,(floor(integer/1.0) % 10)-1)

    persisted

    ,

    magnitude as cast(floor(log10(integer)) + 1 as tinyint) persisted

    )

    go
    create clustered index IXC_someintegers on someintegers(magnitude,hash,integer)
    go
    insert someintegers(char1) select top 50000 null from syscolumns s1 cross join syscolumns s2
    go
    declare @rows_vc varchar(20)

    select

    @rows_vc = cast(power(cast(count(*) as bigint),2) as varchar(20)) from someintegers

    select

    'starting query over ' + @rows_vc + ' permutations: ' + convert(varchar,getdate(),113)

    START__________________________________________________________________

    /********************The query:*************************/

    select

    s1.integer, s2.integer

    from

    someintegers s1

    join

    someintegers s2

    on

    s1.magnitude = s2.magnitude and s1.hash = s2.hash and
    s1.integer < s2.integer

    and

    abs(s1.integer-s2.integer)%9 = 0

    --subsequent correction to fault as reported below

    and

    abs(s1.integer-s2.integer)/9.0/power(10,floor(log10(abs(s1.integer-s2.integer)/9)))

    between 1 and 8

    and

    stuff(s1.integer,cast(floor(log10(s1.integer))-floor(log10(abs(s1.integer-s2.integer)/9)) as int),2,'')

    = stuff(s2.integer,cast(floor(log10(s2.integer))-floor(log10(abs(s1.integer-s2.integer)/9)) as int),2,'')

    and substring(cast(s1.integer as varchar),cast(floor(log10(s1.integer))-floor(log10(abs(s1.integer-s2.integer)/9)) as int),2)

    = reverse(substring(cast(s2.integer as varchar),cast(floor(log10(s2.integer))-floor(log10(abs(s1.integer-s2.integer)/9)) as int),2))

    where

    s1.magnitude > 1

    and

    s2.magnitude > 1

    --and (s1.integer = 54687 and s2.integer = 54867) --test condition as below

    order

    by s1.magnitude, s1.integer

    select

    'query ends: ' + convert(varchar,getdate(),113)

    END____________________________________________________________________
    go

    drop

    table someintegers

    Tim Wilkinson

    "If it doesn't work in practice, you're using the wrong theory"
    - Immanuel Kant

  • OMG!

    Thats ill. Ill, yet effective. Am still trying to get my head round how it works but somehow it does.

    Nice one.

    Thanks,

    www.sql-library.com[/url]

  • sorry that it's v9 only. It's only because of the performance enhancements used, which could be removed.

    Tim Wilkinson

    "If it doesn't work in practice, you're using the wrong theory"
    - Immanuel Kant

  • Using the hash as a computed column is very nice. That should refine your result set pretty nicely.

    ---

    I'm not sure what the point of the magnitude column is. I can't come up with a pair of numbers which would have the same hash, but different magnitudes. Especially as you then check the same condition again, with:

    and floor(log10(s1.integer)) = floor(log10(s2.integer))

    It would be easy enough to replace the reference to magnitude in the WHERE clause with:

    s1.integer > 10 AND s2.integer > 10

    ---

    With this line:

    and abs(s1.integer-s2.integer)/9/power(10,floor(log10(abs(s1.integer-s2.integer)))) between 1 and 8

    First of all, I think you should add a set of parens. I'm not sure what the order of precedence is for those divisions. But, I do know that (a/b)/c a/(b/c). Better to be explicit.

    I'm also not sure what this line is supposed to accomplish. Suppose:

    s1 = 54867, s2 = 58467

    s2 - s1 = 3600

    (s2 - s1)/9 = 400

    floor(log10(s2 - s1)) = 3

    power(10,floor(log10(s2 - s1))) = 1000

    So, unless I'm calculating it wrong, the value comes out to .4. Which is not between 1 and 8. And, yet, this pair of numbers clearly should be included.

    Even if the problem is being off by an order of magnitude, I'm still not sure what property you are trying to test for here. What would it prove if the value came out to 8.1?

    ---

    All in all, a very tight solution. I like it. How long did it take to run on your laptop?

  • >I'm not sure what the point of the magnitude column is.
    >I can't come up with a pair of numbers which would have the same hash, but different magnitudes.

    zero isn't represented in the hash value (not enough digits in a 32b int), so hash(180)=hash(18)=hash(80100).

    >Especially as you then check the same condition again, with:

    >and floor(log10(s1.integer)) = floor(log10(s2.integer))

     
    yes - that was left over from development, could be deleted, though see below re optimiser.

    >It would be easy enough to replace the reference to magnitude in the WHERE clause with:

    >s1.integer > 10 AND s2.integer > 10

    yes, but I deliberately tried to use magnitude, e.g. where s1.magnitude > 1 and s2.magnitude > 1 as it is the leading key of the clustered index - and I (half) thought it might be useful for the optimiser to use as a coarser hashing key, along with the quite low cardinality hash value. I haven't really had time to tune it so just tried what I thought would work best.

    >With this line:

    >and abs(s1.integer-s2.integer)/9/power(10,floor(log10(abs(s1.integer-s2.integer)))) between 1 and 8

    >First of all, I think you should add a set of parens. I'm not sure what the order of precedence is for >those divisions. But, I do know that (a/b)/c <> a/(b/c). Better to be explicit.

     
    It's left to right for identical operators (and all those of the same precedence?) - I don't like too many parenths, just as I allow SQL to use the default top-to-bottom order of parsing join clauses. Still you're probably right.

    >I'm also not sure what this line is supposed to accomplish. Suppose:

    >s1 = 54867, s2 = 58467

    >s2 - s1 = 3600

    >(s2 - s1)/9 = 400

    >floor(log10(s2 - s1)) = 3

    >power(10,floor(log10(s2 - s1))) = 1000

    >So, unless I'm calculating it wrong, the value comes out to .4.

    >Which is not between 1 and 8.
    >And, yet, this pair of numbers clearly should be included.

    >Even if the problem is being off by an order of magnitude,

    >I'm still not sure what property you are trying to test for here.
    >What would it prove if the value came out to 8.1?
    Yes, the problem is the order of magnitude. I need to use floor(log10(s2 - s1)) - 1.
    This clause was intended to eliminate those pairs where although the difference is a multiple of 9, it represents, even modulo 10, 9 multiplied by more than 9. Our target numbers will have the property that the digits in the appropriate positions never represent a figure > 9*8. There is a bit of redundancy in the join predicate, as you noted - but I like to give the optimiser as many options as possible before getting rid of anything.
    >All in all, a very tight solution. I like it.

    Thanks

    >How long did it take to run on your laptop?
     
    14s overall, 6.5s for the query - that's with 'top 50000'.
    500000 and you're talking a few minutes.

    Tim Wilkinson

    "If it doesn't work in practice, you're using the wrong theory"
    - Immanuel Kant

  • Ah. See, when I was working out the hash in Excel, 0 was represented by the first digit to the right of the decimal point (the tenths place). I didn't realize your hash was an integer.

  • I thought the hash was granular enough that it wasn't worth either using a bigint or devising some fiendishly economical but unreadable hash algorithm.

    BTW, I've updated the code correcting the error you spotted (which also occurred elsewhere in the code). I needed to divide the difference by 9 to ensure that there would be only 1 significant digit, so that I could reliably calculate the possible location of a switched digit pair. I must say I'm quite pleased with this one, and I might smugly get down to some idle tinkering with execution plans - cheaper than turtlewaxing a Ferrari!

    Tim Wilkinson

    "If it doesn't work in practice, you're using the wrong theory"
    - Immanuel Kant

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply