LIKE Pattern Matching

  • I have Base-62 ID's that we use for some of our data. During data loading, we have a check constraint that uses a function that evaluates the ID to see if it is made up of a-z, A-Z or 0-9.

    The problem that I have is that the function evaluates the ID '40065*0â007d' as correct even thought it contains the character 'â'.

    What is supposed to happen is that this ID is evaluated to be invalid and therefore rejected into an error table for correction.

    The problem appears to be with the use of the LIKE SUBSTRING code in the function :

    DECLARE @ID CHAR(12)

    DECLARE @CharPattern CHAR (70)

    SELECT @ID = '40065*0â007d'

    SELECT @CharPattern = '[-+*][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9]'

    BEGIN

    IF (ISNUMERIC (SUBSTRING (@ID, 1, 5)) = 1) AND (SUBSTRING (@ID, 6, 7) LIKE @CharPattern)

    BEGIN

    SELECT @ID AS 'OK'

    END

    ELSE

    BEGIN

    SELECT @ID AS 'Not OK'

    END

    END

    In the case above, I am expecting it to return 'Not OK' as the ID is invalid, but it returns 'OK' and therefore allows the data to be laoded into the database.

    The database is Case Sensitive and the ID contains either a '-', '+' or '*' as the separator between the first and second parts of the ID.

    I have tried various combinations of patterns for the matching but nothing seems to work.

    Any help would be greatly appreciated!

  • Sounds to me it related to the collation your are using in the database. Can you try to use unique code like following?

    DECLARE @ID NCHAR(12)

    DECLARE @CharPattern NCHAR (70)

    SELECT @ID = N'40065*0â007d'

    ....

     

  • If you use binary comparation, it works. Depends on your collation:

    Change the line

    IF (ISNUMERIC (SUBSTRING (@ID, 1, 5)) = 1) AND (SUBSTRING (@I

    D, 6, 7) LIKE @CharPattern)

    to the following if your collation is Latin,

    IF (ISNUMERIC (SUBSTRING (@ID, 1, 5)) = 1) AND (SUBSTRING (@I

    D, 6, 7) LIKE @CharPattern collate Latin1_General_BIN)

    or if it's Romanian:

    IF (ISNUMERIC (SUBSTRING (@ID, 1, 5)) = 1) AND (SUBSTRING (@I

    D, 6, 7) LIKE @CharPattern collate Romanian_BIN)

     

  • I have tried :

    DECLARE @ID NCHAR(12)

    DECLARE @CharPattern NCHAR (70)

    SELECT @ID = N'40065*0â007d'

    It still returns that the value is valid when it is not.

    The collation on the database is Latin1_General_CS_AS. We only have Service Pack 3 at present but there is no listing for an update to LIKE for this particular situation in the bug fixes in Service Pack 4.

    Is it something to do with either of the following:

    1) The 'â' is a latin character based on another latin character 'a' and is therefore seen as 'a' in the SQL Server environment?

    2) There is a problem (bug) using too many ranges with the LIKE clause, i.e., [a-zA-Z0-9] is too complex for use with LIKE?

    Many thanks.

  • I have tried :

    IF (ISNUMERIC (SUBSTRING (@ID, 1, 5)) = 1) AND (SUBSTRING (@ID, 6, 7) LIKE @CharPattern collate Latin1_General_BIN)

    This returns the correct result in that it states the ID '40065*0â007d' is invalid and it will therefore be rejected.

    BUT, if you change the ID to a valid ID, e.g., '40065*00007d' it still says that it's invalid using the 'collate Latin1_General_BIN' which is incorrect as it is a valid ID.

     

  • DECLARE @ID CHAR(12)

    DECLARE @CharPattern CHAR (71)

    SELECT @ID = '40065*01007d'

    SELECT @CharPattern = '[-+*][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9]'

    BEGIN

    IF (ISNUMERIC (SUBSTRING (@ID, 1, 5)) = 1) AND (SUBSTRING (@ID, 6, 7) LIKE @CharPattern collate Latin1_General_BIN )

    BEGIN

    SELECT @ID AS 'OK'

    END

    ELSE

    BEGIN

    SELECT @ID AS 'Not OK'

    END

    END

    Use the Collation as specified earlier, but increase the length of @CharPattern by 1 ( the final ] is being truncated )

     

     

  • Let's try an entirely different solution... it looks a bit strange in the beginning, but it works fine. There should be no problems with collation, and it makes rather easy to allow or disallow ANY character (like if you decide not to allow "Q" and "8" for some reason, or allow "é"). Maybe you won't like it, and maybe it won't have good enough performance for you - it's just an idea how to solve this problem, and it gives correct results.

    DECLARE @ID CHAR(12)

    DECLARE @ID_pattern CHAR(12)

    DECLARE @allowed VARCHAR(100)

    DECLARE @transform VARCHAR(100)

    DECLARE @pos INT

    SET @allowed =   'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789*+-'

    SET @transform = 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA+++'

    SET @pos = 1

    SET @ID = '40065-00007d' /*'40065*0â007d'*/

    SET @ID_pattern = @ID

    WHILE @Pos <= LEN(@allowed)

     BEGIN

      SELECT @ID_pattern = REPLACE(@ID_pattern, SUBSTRING(@allowed, @pos, 1), SUBSTRING(@transform, @pos, 1))

      SELECT @pos = @pos + 1

     END

    -- SELECT @ID_pattern /* for testing */

    IF @ID_pattern = 'AAAAA+AAAAAA'

     BEGIN

     SELECT @ID AS 'OK'

     END

    ELSE

     BEGIN

     SELECT @ID AS 'Not OK'

     END

    EDIT : Oops, I re-read the original SQL and realized that the first part only allows numbers, not characters. Well, then you'd need to make a cross of both SQL, using ISNUMERIC for the first part and my approach for the second, or come up with yet another solution. BTW, ISNUMERIC is not safe, too - it allows charater 'e' where it can be interpreted as exponent, and a few other...

  • I looked at it once again and found a workaround, more reliable than using ISNUMERIC. Corrected SQL that should do the task:

    DECLARE @ID CHAR(12)

    DECLARE @ID_pattern CHAR(12)

    DECLARE @allowed VARCHAR(100)

    DECLARE @transform VARCHAR(100)

    DECLARE @pos INT

    SET @allowed =   'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789*+-'

    SET @transform = 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA9999999999+++'

    SET @pos = 1

    SET @ID = '40065-00007d'

    SET @ID_pattern = @ID

    WHILE @Pos <= LEN(@allowed)

     BEGIN

      SELECT @ID_pattern = REPLACE(@ID_pattern, SUBSTRING(@allowed, @pos, 1), SUBSTRING(@transform, @pos, 1))

      SELECT @pos = @pos + 1

     END

    SELECT @ID_pattern = LEFT(@ID_pattern,6)+ REPLACE(RIGHT(@ID_pattern,6),'9','A')/*this converts number placeholder to character placeholder in the last 6 places, effectively allowing both numbers and characters*/

    IF @ID_pattern = '99999+AAAAAA'

     BEGIN

     SELECT @ID AS 'OK'

     END

    ELSE

     BEGIN

     SELECT @ID AS 'Not OK'

     END

  • Many thanks for all your suggestions - the outcome I have decided to take is to add the 'collate Latin1_General_BIN' and increase the size of @CharPattern as I believe that this is the most performant method for performing this check.

    The original developer is about to have a crash course in QA!

    I hope that I can return the favour to you one day 😉

  • You're welcome 🙂

    just want to warn you - if I didn't overlook something, this ID will be reported as correct by the SQL you are using : '+00e5*00007d'.... not sure about what upper/lower case of the "e" does on a case sensitive DB, maybe you need to change it to "E". Anyway, this is the problem with ISNUMERIC ... ISNUMERIC(@text) = 1 does not mean that @text contains only numbers! It means rather "SQL Server would be able to convert @text to a number"... it can contain signs like + or -, scientific notation (characters "e" or "d) in certain places. To check whether all positions in the string are occupied by numbers, use either LIKE, or the abovementioned process of replacing.

  • That's a very good point Vladan - it will allow '400e5' and also '400d5' as I believe ISNUMERIC has limitations in it which are derived from a previous programming language (FORTRAN I think).

    I will therefore have 2 patters to match, one for the numeric portion of the ID and one for the character portion of the ID. That way, should any non-numeric data end up in the numeric portion of the ID, it will become invalid and be moved to an error table leaving the production database in a consistent state.

    Even having two LIKE statements still allows the function to run very efficiently.

    Many thanks for bringing that to my attention!

    The original developer is really going to hate me when I bring all these issues to him!

  • Do you want to explicitly check each portion? If not, you could just use one LIKE pattern to get your results:

    DECLARE @ID CHAR(12)

    DECLARE @CharPattern CHAR (96)

    ---SELECT @ID = '40065*0â007d'

    SELECT @ID = '40065*01007d'

    SELECT @CharPattern = '[0-9][0-9][0-9][0-9][0-9][-+*][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9]'

    BEGIN

    IF ( @ID LIKE @CharPattern collate Latin1_General_BIN )

    BEGIN

    SELECT @ID AS 'OK'

    END

    ELSE

    BEGIN

    SELECT @ID AS 'Not OK'

    END

    END

  • Glenn,

    I believe that they do need to do them separately - the original code I posted was the 'fundamentals' of the function which were simplified to highlight the validation error.

    If they decide otherwise, your suggestion will go to the top of the pile 😉

    Many thanks.

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply