Post reply

Min Max Quartile

Kelvin Phayre

Ten Centuries

Points: 1358
More actions
December 9, 2008 at 5:03 am

#73124

I have a table with thousands of sales history records. Some of the records have skewed data i.e products being sold for £250k or £0.00 (company fudges etc). I want to exclude these from my min max.

Table format

product

qty

value

date

So I want to select by product the min (qty/value) and the max(qty/value) over the last 12 months. I would also like the average and median if possible. The min also must be greater then £0.00.

Please note that I have used the Quartile example here that return LowerRange and UpperRange and it returns the £250k price. I can't see what I'm doing wrong.
Garadin

One Orange Chip

Points: 29613
More actions
December 9, 2008 at 6:19 am

#909585

Kelvin Phayre (12/9/2008)

I have a table with thousands of sales history records. Some of the records have skewed data i.e products being sold for £250k or £0.00 (company fudges etc). I want to exclude these from my min max.

Table format

product

qty

value

date

So I want to select by product the min (qty/value) and the max(qty/value) over the last 12 months. I would also like the average and median if possible. The min also must be greater then £0.00.

Please note that I have used the Quartile example here that return LowerRange and UpperRange and it returns the £250k price. I can't see what I'm doing wrong.

And we can't see what your'e doing at all. Please provide sample data / table schema and if you'd like to know why your query doesn't work, post that too. If you just want a solution to your problem, sample data / table schema and expected results should be enough. Please read the post in my signature for an example of *how* to post proper sample data here.

Seth Phelabaum
Consistency is only a virtue if you're not a screwup. 😉

Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]
Kelvin Phayre

Ten Centuries

Points: 1358
More actions
December 9, 2008 at 7:45 am

#909646

CREATE TABLE [dbo].[Products] (

[ProductId] char(20) NOT NULL,

[UnitPrice] float NULL

)

GO

INSERT INTO [dbo].[Products]([ProductId], [UnitPrice])

VALUES('214700 ', 0.65)

GO

INSERT INTO [dbo].[Products]([ProductId], [UnitPrice])

VALUES('214700 ', 0.65)

GO

INSERT INTO [dbo].[Products]([ProductId], [UnitPrice])

VALUES('214700 ', 0.65)

GO

INSERT INTO [dbo].[Products]([ProductId], [UnitPrice])

VALUES('214700 ', 0.65)

GO

INSERT INTO [dbo].[Products]([ProductId], [UnitPrice])

VALUES('214700 ', 0.78)

GO

INSERT INTO [dbo].[Products]([ProductId], [UnitPrice])

VALUES('214700 ', 0.55)

GO

INSERT INTO [dbo].[Products]([ProductId], [UnitPrice])

VALUES('214700 ', 0.55)

GO

INSERT INTO [dbo].[Products]([ProductId], [UnitPrice])

VALUES('214700 ', 0.78)

GO

INSERT INTO [dbo].[Products]([ProductId], [UnitPrice])

VALUES('214700 ', 1.1)

GO

INSERT INTO [dbo].[Products]([ProductId], [UnitPrice])

VALUES('214700 ', 1.02)

GO

INSERT INTO [dbo].[Products]([ProductId], [UnitPrice])

VALUES('214700 ', 0.9)

GO

INSERT INTO [dbo].[Products]([ProductId], [UnitPrice])

VALUES('214700 ', 0.9)

GO

INSERT INTO [dbo].[Products]([ProductId], [UnitPrice])

VALUES('214700 ', 1.85)

GO

INSERT INTO [dbo].[Products]([ProductId], [UnitPrice])

VALUES('214700 ', 0.78)

GO

INSERT INTO [dbo].[Products]([ProductId], [UnitPrice])

VALUES('214700 ', 0.78)

GO

INSERT INTO [dbo].[Products]([ProductId], [UnitPrice])

VALUES('214700 ', 236170.0)

GO

INSERT INTO [dbo].[Products]([ProductId], [UnitPrice])

VALUES('214700 ', 0.0)

GO

INSERT INTO [dbo].[Products]([ProductId], [UnitPrice])

VALUES('214700 ', 0.0)

GO

INSERT INTO [dbo].[Products]([ProductId], [UnitPrice])

VALUES('214700 ', 0.0)

GO

INSERT INTO [dbo].[Products]([ProductId], [UnitPrice])

VALUES('214700 ', 0.0)

GO

Then I copied the Quartile example below and get Lowerrange=0 and UpperRange=236170

SELECT MIN(UnitPrice) AS LowerRange , MAX(UnitPrice) AS UpperRange

FROM (

SELECT D1.ProductId

FROM

( SELECT TOP 25 PERCENT ProductId

FROM Products

ORDER BY UnitPrice ASC

) AS D1

UNION

SELECT D2.ProductId

FROM

( SELECT TOP 25 PERCENT ProductId

FROM Products

ORDER BY UnitPrice DESC

) AS D2

) AS DT RIGHT JOIN Products ON DT.ProductId = Products.ProductId
Garadin

One Orange Chip

Points: 29613
More actions
December 9, 2008 at 8:14 am

#909683

What are you trying to do with the TOP 25 Percent's?

I'm not quite getting what you want returned here, can you provide your expected results based on that sample data?

If all you want is the min/max without the outliers, you can just do something like this:

SELECT MIN(UnitPrice) AS LowerRange , MAX(UnitPrice) AS UpperRange
FROM Products
WHERE UnitPrice BETWEEN 0 AND 10000 -- Hard Coded value, this could be attained based on a top % of the table

Seth Phelabaum
Consistency is only a virtue if you're not a screwup. 😉

Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]
Kelvin Phayre

Ten Centuries

Points: 1358
More actions
December 9, 2008 at 8:20 am

#909690

I'm trying to achieve what this article says Quartile will do.

http://qa.sqlservercentral.com/articles/Miscellaneous/sqlserverstatisticalfunctions/1255/

Specifically and I quote from the article.

"Another way of looking at extremes is to use something called the inter-quartile range

The inter-quartile range is worked out by ordering the values that you want to evaluate and splitting them into 4 equal parts. The inter-quartile range is the range of values represented by the middle two parts."

My problem is that the example appears to be returning the outer quartile and not the inner.
Kelvin Phayre

Ten Centuries

Points: 1358
More actions
December 9, 2008 at 8:47 am

#909712

Having looked closer at the example I have now realised I have miss read what it is doing. So to make things really simple. If I have the following sales prices for a product.

P1,0

P1,5

P1,10

P1,20

Where P1 is my product code I want min to return 5 and max to return 10.

Is this possible. Obviously there will be more than 4 records for P1, in my case I have thousands.
Garadin

One Orange Chip

Points: 29613
More actions
December 9, 2008 at 8:48 am

#909715

Ok, I thought that might be what you were attempting, but 25% chunks seemed a bit large, so I wanted to verify ;). David's examples come out of the Northwind database. In that database, ProductID refers to an actual IDENTITY primary key. In your sample table, ProductID is not unique, and thus the theory behind this code does not work. This will allow you to use that code:

[font="Courier New"]     ALTER TABLE Products
ADD ProductIdent INT IDENTITY(1,1)      SELECT MIN(UnitPrice) AS LowerRange , MAX(UnitPrice) AS UpperRange FROM (     SELECT D1.ProductIdent     FROM         ( SELECT TOP 25 PERCENT ProductIdent         FROM Products         ORDER BY UnitPrice ASC         ) AS D1     UNION     SELECT D2.ProductIdent     FROM         ( SELECT TOP 25 PERCENT ProductIdent         FROM Products         ORDER BY UnitPrice DESC         ) AS D2     ) AS DT RIGHT JOIN Products ON DT.ProductIdent = Products.ProductIdent
    WHERE DT.ProductIdent IS NULL[/font]

Seth Phelabaum
Consistency is only a virtue if you're not a screwup. 😉

Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]
Kelvin Phayre

Ten Centuries

Points: 1358
More actions
December 9, 2008 at 9:05 am

#909727

I think this example is returning the min and max price across all products from the Products table. I want the min max for each product but I also want to exclude the extremes.
Garadin

One Orange Chip

Points: 29613
More actions
December 9, 2008 at 9:05 am

#909728

I think this will be more along the lines of what you're looking for to get all your products. Please note, that this may be very slow, but I don't know of a good way to do something like this in SQL2K.

[font="Courier New"]DECLARE @Products TABLE(
ProductIdent INT IDENTITY(1,1), ProductID VARCHAR(10), UnitPrice money) INSERT INTO @Products(ProductID, UnitPrice) SELECT 'P1',0 UNION ALL SELECT 'P1',5 UNION ALL SELECT 'P1',10 UNION ALL SELECT 'P1',20 UNION ALL SELECT 'P2',0 UNION ALL SELECT 'P2',50 UNION ALL SELECT 'P2',100 UNION ALL SELECT 'P2',200 SELECT ProductID, MIN(UnitPrice), MAX(UnitPrice) FROM @Products P WHERE ProductIDent NOT IN ( (SELECT TOP 25 PERCENT ProductIdent FROM @Products WHERE ProductID = P.ProductID ORDER BY UnitPrice)) AND ProductIDent NOT IN ( (SELECT TOP 25 PERCENT ProductIdent FROM @Products WHERE ProductID = P.ProductID ORDER BY UnitPrice DESC))
GROUP BY ProductID[/font]

Seth Phelabaum
Consistency is only a virtue if you're not a screwup. 😉

Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]
Kelvin Phayre

Ten Centuries

Points: 1358
More actions
December 9, 2008 at 11:01 am

#909814

I put 1.5m rows into a table and then ran your query. I then stopped the query before it crashed the server.

I then took a different tac. I deleted from my SalesProduct table the top 25% and bottom 25% for all products and then just did a min max on what was left. This returned the correct result. Also was quite fast.

If you come up with a better way of discarding records from a sample then please send in the solution.

Thanks for your help.
ChrisM@Work

SSC Guru

Points: 186127
More actions
December 9, 2008 at 11:21 am

#909827

Kelvin Phayre (12/9/2008)

I put 1.5m rows into a table and then ran your query. I then stopped the query before it crashed the server.

I then took a different tac. I deleted from my SalesProduct table the top 25% and bottom 25% for all products and then just did a min max on what was left. This returned the correct result. Also was quite fast.

If you come up with a better way of discarding records from a sample then please send in the solution.

Thanks for your help.

This is a slight variation on Garadin's solution which is probably more performant (Thanks Seth for providing the excellent core code)

SELECT ProductID,
MIN(UnitPrice), MAX(UnitPrice) FROM @Products P LEFT JOIN (SELECT TOP 25 PERCENT ProductIdent, ProductID FROM @Products ORDER BY UnitPrice) h ON h.ProductIdent = P.ProductIdent AND h.ProductID = P.ProductID LEFT JOIN (SELECT TOP 25 PERCENT ProductIdent, ProductID FROM @Products ORDER BY UnitPrice DESC) l ON l.ProductIdent = P.ProductIdent AND l.ProductID = P.ProductID WHERE h.ProductIdent IS NULL AND l.ProductIdent IS NULL
GROUP BY ProductID

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Garadin

One Orange Chip

Points: 29613
More actions
December 9, 2008 at 11:46 am

#909839

Kelvin Phayre (12/9/2008)

I put 1.5m rows into a table and then ran your query. I then stopped the query before it crashed the server.

I then took a different tac. I deleted from my SalesProduct table the top 25% and bottom 25% for all products and then just did a min max on what was left. This returned the correct result. Also was quite fast.

If you come up with a better way of discarding records from a sample then please send in the solution.

Thanks for your help.

Heh, yeah, I was worried about that.

When you say you deleted the top and bottom 25% of your table, was that the top 25% of the records for each ProductID, or just the top 25% of the table. Because they are drastically different things.

Seth Phelabaum
Consistency is only a virtue if you're not a screwup. 😉

Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]
Garadin

One Orange Chip

Points: 29613
More actions
December 9, 2008 at 11:51 am

#909845

Chris Morris (12/9/2008)

Kelvin Phayre (12/9/2008)

I put 1.5m rows into a table and then ran your query. I then stopped the query before it crashed the server.

I then took a different tac. I deleted from my SalesProduct table the top 25% and bottom 25% for all products and then just did a min max on what was left. This returned the correct result. Also was quite fast.

If you come up with a better way of discarding records from a sample then please send in the solution.

Thanks for your help.

This is a slight variation on Garadin's solution which is probably more performant (Thanks Seth for providing the excellent core code)

SELECT ProductID,
MIN(UnitPrice), MAX(UnitPrice) FROM @Products P LEFT JOIN (SELECT TOP 25 PERCENT ProductIdent, ProductID FROM @Products ORDER BY UnitPrice) h ON h.ProductIdent = P.ProductIdent AND h.ProductID = P.ProductID LEFT JOIN (SELECT TOP 25 PERCENT ProductIdent, ProductID FROM @Products ORDER BY UnitPrice DESC) l ON l.ProductIdent = P.ProductIdent AND l.ProductID = P.ProductID WHERE h.ProductIdent IS NULL AND l.ProductIdent IS NULL
GROUP BY ProductID

Chris,

Although it does seem like it might be more efficient due to the derived tables, the execution plans are virtually identical (once I add the aliases to your outer ProductID's). I actually had it this way and then went the other route. Check out this post from last week, particularly Jeff's post towards the bottom evaluating the 3 methods of doing this. (Left Join/NULL, Exists and IN)

http://qa.sqlservercentral.com/Forums/Topic611416-338-1.aspx

Seth Phelabaum
Consistency is only a virtue if you're not a screwup. 😉

Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]
Garadin

One Orange Chip

Points: 29613
More actions
December 9, 2008 at 4:21 pm

#910006

Garadin (12/9/2008)

Chris,

Although it does seem like it might be more efficient due to the derived tables, the execution plans are virtually identical (once I add the aliases to your outer ProductID's). I actually had it this way and then went the other route. Check out this post from last week, particularly Jeff's post towards the bottom evaluating the 3 methods of doing this. (Left Join/NULL, Exists and IN)

http://qa.sqlservercentral.com/Forums/Topic611416-338-1.aspx

Which would have probably been true if I didn't have outer references in my IN's. :crazy: After testing, Chris is absolutely right, try his out, with my tests on my Dev box, his version completes in 34 seconds for 1.4M rows.

Seth Phelabaum
Consistency is only a virtue if you're not a screwup. 😉

Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]
Kelvin Phayre

Ten Centuries

Points: 1358
More actions
December 10, 2008 at 8:37 am

#910423

I've run Chris' code and yes it runs real fast. Problem is that I have 944 unique product codes in the 1.5m records at the start but when I run the code it does not return a min max for 365 ProductID's. Please can you check your data and see if all products are accounted for?

Analyzing one of the missing products UnitPrice I have 731 records in my prices table(SalesProducts).

select UnitPrice, count(*) from SalesProducts where ProductID='123456'

group by UnitPrice

UnitPrice column2

------------ ----------

0.11 29

2.1 16

0.9 686

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply