trouble with aggegate functions

  • I'm working on a project where clients can require a number of types of employee training. Each type of training has a number of hours associated with it and can be grouped so a client can say they require (type1 OR type2 OR type3) AND (type4 OR type5) AND ...

    What i need is a query that will find the minimum number of hours an employee must complete for a specific client.

    One solution is to run:

    SELECT sum(a) FROM (SELECT min(hrs) a FROM trainreqs GROUP BY grp) list

    but putting a subselect in a from clause like that is bad for efficiency reasons...any other ideas on how to do this? everything else i've can think of to try yields the server throwing the error "Cannot perform an aggregate function on an expression containing an aggregate or a subquery."

    bonus points for helping me out for the next query: employee shifts are saved in another table with an hours field...given an employee and client, figure out how far along their training is

    Thanks in advance for any help!

  • More detail would help.  Can you post the tables DDL along with some sample data and results?

    Thanks

  • sure

    trainreqs table looks like:

    cliet-char(50) type-char(10) grp(int) hrs(int)

    -------------------------------------------------

    CLI1 type1 1 4

    CLI1 type2 1 3

    CLI1 type3 2 5

    CLI1 type4 2 2

    CLI1 type5 2 3

    CLI1 type6 3 4

    the min hrs required in each group for CLI1 are (easy enough to get with SELECT min(hrs) FROM trainreqs GROUP BY grp):

    group 1: 3

    group 2: 2

    group 3: 4

    then the sum of these is 9, which is the final number i need, but can't very well use SELECT sum(min(hrs)) FROM trainreqs GROUP BY grp...

    if this problem has an elegant solution, the other query (dealing with shifts and an employee's training progress) becomes easy

  • Start with this and let me know how it works:

    --declare @trainreqs table (

    create table dbo.trainreqs (

        client      char(50),

        typecode    char(10),

        grp         int,

        hrs         int

    );

    insert into dbo.trainreqs values ('CLI1', 'type1', 1, 4);

    insert into dbo.trainreqs values ('CLI1', 'type2', 1, 3);

    insert into dbo.trainreqs values ('CLI1', 'type3', 2, 5);

    insert into dbo.trainreqs values ('CLI1', 'type4', 2, 2);

    insert into dbo.trainreqs values ('CLI1', 'type5', 2, 3);

    insert into dbo.trainreqs values ('CLI1', 'type6', 3, 4);

    go

    create function dbo.mintrainreqs ()

    returns @minreqs table (

     grp int,

     minhrs int

    ) as begin

     insert into @minreqs

     select

      grp,

      min(hrs) as minhrs

     from

      dbo.trainreqs

     group by

      grp

     return

    end;

    go

    select sum(minhrs) from dbo.mintrainreqs();

    go

    drop table dbo.trainreqs;

    go

    drop function dbo.mintrainreqs;

    go

  • Just checking to see if you have an update?

  • Hi, thanks for your post, sorry about the delay in mine.

    i just have two concerns:

    1) how would i run this type of function from a query? for example, if i wanted all clients and their respective number of required hours?

    2) i'm curious if you have any idea how this solution's speed compares to the query:

    SELECT sum(a) FROM (SELECT min(hrs) a FROM trainreqs GROUP BY grp) list

    they both run two select statements.. but i don't know if ms sql server optimizes for functions better than nested querys?

  • Sounds like you are adding requirments to your original request.  Can you provide more details as to what you are asking?  It also sounds like you need to expand you sample data.

  • I took the table def supplied by Lynn and created a primary key on (client, typecode, grp). That at least turned a table scan into a clustered index scan. Then I created a view from the following query:

    select Client, grp, Min(hrs) as MinHrs, Max(hrs) as MaxHrs, Avg(hrs) as AvgHrs
    from trainreqs
    group by client, grp
    

    This gives you the info (plus more for future needs) and it greatly simplifies your outside query. However, it provides this info for all clients and you will probably only want it for specific clients. The best answer in that case is to use it as a derived table for your outside query with the addition of a "where client = ..." clause so records for other clients aren't processed. This brings us pretty much back to your original query, with the slight modification of turning the subquery into a derived table.

    select client, Sum(MinHrs)
    from (select Client, grp, Min(hrs) as MinHrs, Max(hrs) as MaxHrs, Avg(hrs) as AvgHrs
        from trainreqs
        where client = @Client
        group by client, grp) as AggData
    group by client;
    

    It may look ugly, but only the SELECT that creates the derived table is actually hitting your physical table. The outside query only selects from the result set of the inside query.

    There are various ways to simplify the query (functions or views) but I really don't think you're going to be able to squeeze much more performance out of it.

    Tomm Carr
    --
    Version Normal Form -- http://groups.google.com/group/vrdbms

  • Thanks Tomm ...but that's what i was afraid of...

    as i said in my original post, i'd rather not use a subselect in my from clause, but if that's the only way this is possible, then i consider slow performance is much better than no performance.

    more details, as requested by lynn:

    i have a complicated SQL query that will return a list of numbers i want to add up, but can't use "SELECT sum(foo) ..." because the numbers are yielded by aggregate functions and it is illegal to perform an aggregate on another aggregate or on a subquery.

    This needs to be a query which i can plug into an update statement to set the value of a calculated field, IE:

    UPDATE orientation_table SET ...,hours_required=(foodoo),...

    the orientation_table is a temp table created by a stored procedure which will hold a record for every orientation each employee begins...it will include lots of fields, the only two of which giving me a hard time are the min # of hours the client requires, and # of hours the employee has left to complete.

    clients can set up requirements using any combination of the following options:

    a group where all requirements must be filled

    a group that will contain subgroups of requirements, where everything from at least one subgroup must be met.

    a group that will contain subgroups of requirements, where at least one item in each subgroup must be met.

    where requirement types contain information like how many hours must be worked in order to fulfill the requirement.

    confusing enough?

    here's an example:

    GROUP1 [all these are required]

    {type1, type2}

    GROUP2 [every requirement from at least one subgroup is required]

    {

    group2_a:{type3, type4}

    group2_b:{type5, type6}

    }

    GROUP3 [at least one requirement from every subgroup is required]

    {

    group3_a:{type7, type8}

    group3_b:{type9, typeA}

    }

    so to have a full orientation with this setup, an employee must complete the following:

    (

    type1 && type2

    ) &&

    (

    (type3 && type4) ||

    (type5 && type6)

    ) && (

    (type7 || type8) &&

    (type9 || typeA)

    )

    ideally, i'd be able to do the following:

    UPDATE table_of_all_orientations SET [...other fields...],

    hours_remaining=(

    --SUM of hours left for all non-grouped requirements

    SUM(

    SELECT

    MAX((hours_required_by_type-SUM(hours_employee_completed_at_type)),0)

    FROM requirements_table,shift_table

    WHERE [foo linking requirements, shifts, and the table being updated]

    GROUP BY requirements_table.requirement_type

    )

    +

    --the SUM(the min hours left from each group where one requirement from every group is required)

    [foo]

    +

    --the MIN(the sum hours left from each group where every requirment from one group is required)

    [foo]

    )

    obviously this would take a lot more code to get the subselects returning accurate results, but the underlying problem is that i cannot perform an aggregate on a subselect statement or another aggregate.

    another solution would be to use:

    hours_remaining=(

    SELECT sum(hours_to_sum) FROM (

    (SELECT (case (hours_required_by_type-sum(hours_employee_completed))<0) then 0 else (hours_required_by_type-sum(hours_employee_completed))

    FROM requirements_table,shift_table

    WHERE [foo]

    GROUP BY requirements_table.requirement_type

    )

    ) as list

    but i've heard putting subselects inside a FROM clause like that kills efficiency (as the optimizer has difficulty working with things like that)

    if all that made sense, congrats...any ideas how this can be optimized with a more efficient query?

    thanks to all!

  • Actually, this clears up a lot -- if nothing else, it makes me more appreciative of the scope of your problem.

    You say your "underlying problem is that i cannot perform an aggregate on a subselect statement or another aggregate." Well, you can if you turn the subquery into a derived table like in my previous post. However, this is not a big help if you need to use the results in several places.

    This is where a table variable comes in handy. When you need the results of a complicated or resource-consuming query in more than one area of your main query, read the results into a table variable so the subqueries hit it rather the the physical db. The challenge then is to design the result set so it contains all the info in the forms (aggregate subtotals, etc) you will need for each subquery.

    Have fun!

    Tomm Carr
    --
    Version Normal Form -- http://groups.google.com/group/vrdbms

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply