GROUP BY vs DISTINCT -- SQL SERVER BUG ?

  • SQL SERVER BUG ?

    Is this is a bug or am I missing something...  I'm a fairly experienced SQL user and this one has me (and a few colleagues) puzzled.  I have seen some bug reports involving subqueries and aggregation, but I couldn't find anything like this.  Any help or pointers would be greatly appreciated.

    I am looking for a count of the occurrences of each distinct set of council and oldcouncil values from a larger list of values. 

    I initially used GROUP BY but the results were not what I expected.  In this case it appears that SQL Server returns the oldcouncil column twice rather than oldcouncil and council.  When I used DISTINCT I get the correct column values.    Both queries return the same number of rows.

    I haven't designed a trivial sample that reproduces the behavior, but it seems that the logic of the two queries is identical and they aren't that complex.  The execution plans appear to be identical. 

    I have a small clue to what's going on when I tried to run my very first attempt and used the function calls in the ORDER BY clause.  It gave me the message: Server: Msg 169, Level 15, State 1, Line 1

    A column has been specified more than once in the order by list. Columns in the order by list must be unique.

    (I tried using the same database on Oracle and I got the expected results in all cases.)

    -- Expected results from

    select Distinct

    X.OLDCOUNCIL, X.COUNCIL 

    from (

    SELECT

      sr.eid,

      dbo.getGeoValue( sr.eid, 'COUNCIL')  OLDCOUNCIL,

      dbo.getGeoValue( sr.eid, 'COUNCL04') COUNCIL

     FROM tickets sr

     WHERE

      sr.type_code = 'TCMFOOTW'

      AND sr.status_code NOT LIKE '%DUP%'

    ) X

    --group by X.OLDCOUNCIL, X.COUNCIL

    order by X.OLDCOUNCIL, X.COUNCIL

    -- Unexpected results from

    select --Distinct

    X.OLDCOUNCIL, X.COUNCIL

    from (

    SELECT

      sr.eid,

      dbo.getGeoValue( sr.eid, 'COUNCIL')  OLDCOUNCIL,

      dbo.getGeoValue( sr.eid, 'COUNCL04') COUNCIL

     FROM tickets sr

     WHERE

      sr.type_code = 'TCMFOOTW'

      AND sr.status_code NOT LIKE '%DUP%'

    ) X

    group by X.OLDCOUNCIL, X.COUNCIL

    order by X.OLDCOUNCIL, X.COUNCIL

    -- Very first attempt, seems logical to me

    SELECT

      dbo.getGeoValue( eid, 'COUNCIL') AS OLDCOUNCIL,

      dbo.getGeoValue( eid, 'COUNCL04') AS COUNCIL,

      count(0)

     FROM tickets

     WHERE

      type_code = 'TCMFOOTW'

      AND status_code NOT LIKE '%DUP%'

    group by dbo.getGeoValue( eid, 'COUNCIL'), dbo.getGeoValue( eid, 'COUNCL04')

    order by OLDCOUNCIL, COUNCIL

  • - Which sqlserver version is this being run on ?

    - Can you post ddl with some data ?

    - Can you post some (small) "correct" and "false" result sets ?

    - Does this only occur with large sets of data, or also with small sets ?

    - What's the result if you materialize the nested table expression ?

    - are there nulls involved ?

     

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • Hey. Why are you using Group By? The only time you need to use Group by is when you are using Aggregate Functions like SUM() COUNT() MIN() or MAX().

    If your table looks like this:

    oldcouncil,newcouncil

    1,1

    1,2

    1,1

    2,1

    select distinct * would return

    1,1

    1,2

    2,1

    because 1,1 is repeated. Distinct only returns UNIQUE rows.

    if you use group by, without an aggregate function, it's the same result as not using group by. BUT! to do something like:

    SELECT count(*), oldcouncil, newcouncil you will need a group by telling SQL Server what groupings you want a count of.

    Read up more, and follow the examples from the SQL Books Online documentation.


    Julian Kuiters
    juliankuiters.id.au

  • Hi Peter,

    I don't think that Julian's claim is correct. For example

    select left (filename, 1) as drive from master.dbo.sysaltfiles

    group by left (filename, 1)

    order by 1 

    will yield the same result as

    select distinct left (filename, 1) as drive from master.dbo.sysaltfiles

    --group by left (filename, 1)

    order by 1

    If one omits the group by clause in the first example one will get one line for each file on the drive.

    So, if you say, both of your queries provide the same number of records what is the difference anyway? Obviously I do miss the point of your problem. It seems to have to do with the function you are using. SQL Server does not detect any difference when you call them with different string constants as you have tried unsuccesful even though the select will provide different return values of the function. Try to test a different set of function rather than controling their behaviour by string constants. In this case your attempt should work.

    Karsten

  • The advantage of using group-by is that it can use an index if available and suitable. 

    distinct is done by adding a sort-step before returning the data.

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • Thanks for the feedback so far.  I'll try to clarify the situation.

    We are running SQL Server 2000 Enterprise Edition SP3.  I used Query Analyzer to run these queries.

    The original GROUP BY query did have COUNT(0) to get the count of tickets for each distinct combination of council and old council districts.  I used DISTINCT to check my results with a logically equivalent query, but it doesn't allow using aggregates.

    Even though there are different results, the explain-plan is the same for DISTINCT as for GROUP BY. 

    I think SQL is confused about the column names due to the use of the UDF and a failure to take into account the column alias. That would explain the odd message when using the function call in the ORDER BY clause.  Also it would explain the results...

    GROUP BY returns

    1,1

    1,1

    1,1

    2,2

    2,2

    3,3

    ...

    DISTINCT returns

    1,11

    1,13

    1,2

    2,6

    2,7

    3,9

    ...

  • I get similar funny results using Enterprise Manager to run the query from the Open -> Query menu option.  OSQL too.

    I tried removing the order by clauses and that made no difference.

    So it's not a client presentation issue, though it really looks as if the first column is just being repeated.

  • Multiprocessor box ? try adding with MAXDOP  = 1 (ckeck BOL)

    (there is an issue with count(*) with sql2ksp3a)

     

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • the unexpected results are because you are returning 3 fields in the (from recordset) - these are only grouped by 2 fields, one row for each sr.eid is being returned.

    distinct will include EID in the group and hence the number of rows is less as the duplicate rows are removed.

    MVDBA

  • Let me emphasize that the number of rows returned is the same and the counts match in the GROUP BY example, the DISTINCT example, and in the temp table example.  Its the column values that don't seem correct.  I don't think Mike's explanation is consistent with these results.

    To be specific... The second column, new council district, seems to be a repeat of the first column, council district, value. 

    A colleague reversed the function calls and saw that it copied the council district rather than the new council district into the second column.

    The bug is essentially that the two columns have the same value when using GROUP BY, when all other logically similar queries show the proper values. 

  • DOH!

    just ignore everything i said - let me get more coffee and start again.

    MVDBA

  • Why does this query return an error message.  It is clear to me that I have not repeated any columns !  Really I haven't

    This is a clue to what is going on, but I don't see any reported problems quite like this.  Does it check the first 26 characters or so before assuming the two columns are identical?

    SELECT

      dbo.getGeoValue( eid, 'COUNCIL') AS OLDCOUNCIL,

      dbo.getGeoValue( eid, 'COUNCL04') AS COUNCIL,

      count(0)

     FROM tickets

     WHERE

      type_code = 'TCMFOOTW'

      AND status_code NOT LIKE '%DUP%'

    group by dbo.getGeoValue( eid, 'COUNCIL'), dbo.getGeoValue( eid, 'COUNCL04')

    order by dbo.getGeoValue( eid, 'COUNCIL'), dbo.getGeoValue( eid, 'COUNCL04')

    Return this error message:

    Server: Msg 169, Level 15, State 1, Line 1

    A column has been specified more than once in the order by list. Columns in the order by list must be unique.

  • why are you doing a count(0) ?

    MVDBA

  • This is exactly what I mentioned earlier. SQL Server does not detect differences in function calls when they differ only in string constants though select shows the correct and differing values. Try to use different functions rather than to control function behaviour by string constants.

    A similar thing happens when the output of a stored procedure is used as DTS datapump source. SQL Server does not check the outcoming columns by running the sp but by analyzing the code of the sp in a rather stupid manner. The first select statement it will find marks the set of column it think will be the output. So if you control behaviour of a function or a stored procedure by string constants using a case or if then else structure, SQL Server will not detect any difference in analyzing the code.

    It seems to me, these two issues are not only similar but have exact the same cause. SQL Server should not think it could guess the outcome of stored procedures or functions by analyzing the code in that arbitrary manner but it does.

    Karsten

  • I read a good explanation that COUNT(0) is a bit faster than COUNT(*), at least in Oracle and now I'm in the habit.  I think that it has to do with avoiding the check for nulls.

    It doesn't affect these results.

    It's a bit off-topic but do you know if there is a difference in SQL Server?

Viewing 15 posts - 1 through 15 (of 28 total)

You must be logged in to reply to this topic. Login to reply