How to query a Denormalised table

Question

Post reply

How to query a Denormalised table

Viewing 10 posts - 16 through 24 (of 24 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 1

sushantkatte (2/20/2016)

@

Jeff Moden. this is not the actual table that was given to me. i was given the question in this form and was asked to write queries looking at this structure

Customer_id Order_id Order_line_id Product_type Order_Date

C1000 O1000 L100 Textbook 1/1/2016

C1000 O1000 L101 Ebook 1/1/2016

C1000 O1000 L102 Texbook 1/1/2016

C1000 O1001 L103 Subscription 1/2/2016

C1001 O1002 L104 Ebook 1/2/2016

C1001 O1002 L105 Ebook 1/2/2016

C1002 O1003 L106 Textbook 1/3/2016

C1002 O1003 L107 Textbook 1/3/2016

I created the table from this form

Understood. Turning this into a learning session...

First, it's never a good idea to store a date in a table as a character based date.

It's also not a good idea to use NVARCHAR if you don't need to because it doubles the memory, disk, backup, restore, index, and other space/time.

Also, it takes 2 extra bytes (remembers the size) to store variable length strings so it's usually better to store short, even length strings as NCHAR or CHAR.

Last but not least, it's almost always better if you don't allow NULLs in your table.

We also need some other entries before and after Jan 2016 to make sure the queries work as intended.

With all of that in mind, this is what I used for your table and test data.



--===== If the test table already exists, drop it to make reruns easier.

     IF OBJECT_ID('tempdb..#OrderTable','U') IS NOT NULL

        DROP TABLE #OrderTable

;

GO

--===== Create the test table in a nice, safe place that everyone has (TempDB)

 CREATE TABLE #OrderTable

        ( --Avoid NULLable columns anytime you can

         Customer_ID     CHAR(5)     NOT NULL --No need for NVARCHAR(50)

        ,Order_ID        CHAR(5)     NOT NULL --No need for NVARCHAR(50)

        ,Order_Line_ID   CHAR(5)     NOT NULL --No need for NVARCHAR(50)

        ,Product_Type    VARCHAR(50) NOT NULL --No need for NVARCHAR

        ,Order_Date      DATE        NOT NULL --Dates should never be stored as strings

        )

;

--===== Because the dates in the data are ambiguous, force the correct format.

    SET DATEFORMAT MDY

;

--===== Populate the table with test data

 INSERT #OrderTable

        (Customer_ID, Order_ID, Order_Line_ID, Product_Type, Order_Date)

 SELECT 'C1000','O0999','L096','Textbook'    ,'12/31/2015' UNION ALL --Added rows for Dec 2015

 SELECT 'C1000','O0999','L097','Ebook'       ,'12/31/2015' UNION ALL

 SELECT 'C1000','O0999','L098','Texbook'     ,'12/31/2015' UNION ALL

 SELECT 'C1000','O0999','L099','Subscription','12/31/2016' UNION ALL

 SELECT 'C1000','O1000','L100','Textbook'    ,'1/1/2016' UNION ALL --Original rows for Jan 2016

 SELECT 'C1000','O1000','L101','Ebook'       ,'1/1/2016' UNION ALL

 SELECT 'C1000','O1000','L102','Texbook'     ,'1/1/2016' UNION ALL

 SELECT 'C1000','O1001','L103','Subscription','1/2/2016' UNION ALL

 SELECT 'C1001','O1002','L104','Ebook'       ,'1/2/2016' UNION ALL

 SELECT 'C1001','O1002','L105','Ebook'       ,'1/2/2016' UNION ALL

 SELECT 'C1002','O1003','L106','Textbook'    ,'1/3/2016' UNION ALL

 SELECT 'C1002','O1003','L107','Textbook'    ,'1/3/2016' UNION ALL

 SELECT 'C1001','O1004','L108','Ebook'       ,'2/2/2016' UNION ALL --Added rows for Feb 2016

 SELECT 'C1001','O1004','L109','Ebook'       ,'2/2/2016'

;

Here's my take on possible solutions to the problems you've been given. I strongly recommend that you break out "Books Online" and study how they work. I also strongly recommend adopting the "bullet proof" code methodologies I've commented on in the code.



--========================================================================================

--      Problem #1

--      Total customers who ordered the products in Jan 2016.

--      (Modified to be date-flexible.  Can't assume only Jan 2016 will be in the table)

--========================================================================================

--===== This could be a parameter in a stored procedure. That's what the little "p" is for.

DECLARE @pMonthDate DATE;        --This could be a parameter in a stored procedure

 SELECT @pMonthDate = '20160115' --ANY date in Jan 2016

;

--===== Convert the input date to the first of whatever month was given.

     -- This is part of "bullet proofing code for dates >= 19000101.

 SELECT @pMonthDate = DATEADD(mm,DATEDIFF(mm,0,@pMonthDate),0)

;

--===== Select the rows for the given month using the bullet proof >=/< method

     -- to handle future possible changes in datatype.

 SELECT  MonthStartDate = @pMonthDate --Important to let them know what month it's for

        ,CustomerCount  = COUNT(DISTINCT Customer_ID)

   FROM #OrderTable

  WHERE Order_Date >= @pMonthDate

    AND Order_Date <  DATEADD(mm,1,@pMonthDate) --This makes your code bullet proof for datatype changes

;

--========================================================================================

--      Problem #2

--      Customers who placed more than one order in Jan 2016

--      (Modified to be date-flexible.  Can't assume only Jan 2016 will be in the table)

--========================================================================================

--===== This could be a parameter in a stored procedure. That's what the little "p" is for.

DECLARE @pMonthDate DATE;        --This could be a parameter in a stored procedure

 SELECT @pMonthDate = '20160115' --ANY date in Jan 2016

;

--===== Convert the input date to the first of whatever month was given.

     -- This is part of "bullet proofing code for dates > 1900.

 SELECT @pMonthDate = DATEADD(mm,DATEDIFF(mm,0,@pMonthDate),0)

;

--===== Select the rows for the given month using the bullet proof >=/< method

     -- to handle future possible changes in datatype.

 SELECT  MonthStartDate = @pMonthDate --Important to let them know what month it's for

        ,Customer_ID

        ,OrderCount = COUNT(DISTINCT Order_ID)

   FROM #OrderTable

  WHERE Order_Date >= @pMonthDate

    AND Order_Date <  DATEADD(mm,1,@pMonthDate) --This makes your code bullet proof for datatype changes

  GROUP BY Customer_ID

 HAVING COUNT(DISTINCT Order_ID) > 1

  ORDER BY Customer_ID --Optional for human consumption

;

--========================================================================================

--      Problem #3

--      Number of orders that have Ebook as well as Textbook ordered together as a part of

--      same order.

--      (No mention of date sensitivity for this problem.)

--========================================================================================

   WITH cteOrderHasBoth AS

(

 SELECT Order_ID

   FROM #OrderTable eb

  WHERE Product_Type IN ('Ebook','Textbook') --This would need to be split if a parameter.

  GROUP BY Order_ID

 HAVING COUNT(DISTINCT Order_ID+Product_Type) = 2

)

 SELECT OrdersHavingBoth = COUNT(*)

   FROM cteOrderHasBoth

;

--========================================================================================

--      Problem #4

--      Number of new customers who purchased Chegg product for the first time in Jan 2016.

--      (Assuming that "Chegg" is the company name, not the product type).

--========================================================================================

--===== This could be a parameter in a stored procedure. That's what the little "p" is for.

DECLARE @pMonthDate DATE;        --This could be a parameter in a stored procedure

 SELECT @pMonthDate = '20160115' --ANY date in Jan 2016

;

--===== Convert the input date to the first of whatever month was given.

     -- This is part of "bullet proofing code for dates > 1900.

 SELECT @pMonthDate = DATEADD(mm,DATEDIFF(mm,0,@pMonthDate),0)

;

 SELECT  MonthStartDate   = @pMonthDate --Important to let them know what month it's for

        ,NewCustomerCount = COUNT(DISTINCT Customer_ID)

   FROM #OrderTable

  WHERE Customer_ID NOT IN (SELECT Customer_ID FROM #OrderTable WHERE Order_Date < @pMonthDate)

;

Since you're just starting out, I also strongly recommend that you learn how to quickly build a million rows of constrained random data to really sharpen your T-SQL Programming skills. You can find out how to do that in the following articles.

http://qa.sqlservercentral.com/articles/Data+Generation/87901/

http://qa.sqlservercentral.com/articles/Test+Data/88964/

Please post back if you have any questions. This could be the beginning of an exciting career for you because, once you got around to it, you did a pretty good job of posting what we needed to help you.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

sushantkatte SSChasing Mays Points: 647 More actions · Answer 2

sushantkatte

SSChasing Mays

Points: 647

February 22, 2016 at 4:34 am

#1860041

@jeff Moden. Thank You very much for helping me to learn how to write queries. I still have to learn a lot and this gives me a little idea. could you please also explain me more about what "bullet proof" code methodologies are and where can i learn more about it. Once again Thank You for your kind replies

Ed Wagner SSC Guru Points: 287015 More actions · Answer 3

sushantkatte (2/22/2016)

@Jeff Moden. Thank You very much for helping me to learn how to write queries. I still have to learn a lot and this gives me a little idea. could you please also explain me more about what "bullet proof" code methodologies are and where can i learn more about it. Once again Thank You for your kind replies

First, he created his date column as a date instead of a string. Next, look at the way he handles his dates. He's using them in a way that's safe when filtering for date ranges. The details are in the comments.

Tally Tables - Performance Personified
String Splitting with True Performance
Best practices on how to ask questions

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 4

sushantkatte (2/22/2016)

@Jeff Moden. Thank You very much for helping me to learn how to write queries. I still have to learn a lot and this gives me a little idea. could you please also explain me more about what "bullet proof" code methodologies are and where can i learn more about it. Once again Thank You for your kind replies

Ed is correct. I used a DATETIME column instead of using VARCHAR to store dates in and I protected the user(s) by the way I handled the input and the criteria.

For example, the following code snippet from my examples protects the user by taking whatever date they provided and forcing it to be the first of the month by first converting to an integer that represents the number of month boundaries crossed since the first of January, 1900 (that's what the "0"s are) and then converting that back to a datetime, which neatly produces the first of the month for whatever date the user provided in the input.



--===== Convert the input date to the first of whatever month was given.

     -- This is part of "bullet proofing code for dates >= 19000101.

 SELECT @pMonthDate = DATEADD(mm,DATEDIFF(mm,0,@pMonthDate),0)

;

This next part is the best way to isolate the dates for a whole month.



  WHERE Order_Date >= @pMonthDate

    AND Order_Date <  DATEADD(mm,1,@pMonthDate) --This makes your code bullet proof for datatype changes

;

It checks order dates from the beginning of the previously calculated first of the month up to and NOT including the first of the next month. Right now, your Order_Date column doesn't have any times in it but that could change. If you use BETWEEN, you could miss out on times on the last day of the month or end up including whole dates (dates with a midnight time or "no time") on the first of the following month. Temporally speaking, this is known as a "Closed/Open" range of dates because the start date (and time) is inclusive and the end date and time is exclusive (not included in the reporting). Even if someone uses a DATE column, which can have no times, using the method I used is guaranteed to NOT fail and NOT produce incorrect answers even if someone changes the datatype to something with more precision.

Another name for "bullet proofing" is "anticipating currently unknown modifications". A less kind name is "protecting users from their own future mistakes". It takes virtually no extra time to do and people who are hiring and actually "in-the-know" highly cherish those that take the extra seconds to make their code bullet proof in this and other manners. The value of doing such a thing is intangible but extremely high in value if the worst should happen. And, no... it's not "pre-optimization" or "added scope" or any of the other negative names that code monkeys give such methods. It's good, rock solid programming that the best Developers do without even having to think about it.

Of course, if the requirement were a report from any date to any date, I'd do the same trick with the dates but at the day level instead of the month level. I may or may not even make sure that the starting date was less than the ending date by automatically switch them or simply by raising an error if they weren't in the correct order to protect the user from bad entries that might not report an error on their own. I'll also state that the use of TRY/CATCH isn't always the best way to do things especially if it does nothing else than rethrowing the error (and a lot of people do such rethrows incorrectly).

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Lynn Pettis SSC Guru Points: 442447 More actions · Answer 5

Jeff mentioned checking or swapping dates so that a StartDate is always less than (or equal to) an EndDate. Starting with SQL Server 2008 this is actually easy to achieve swapping dates and can be integrated with the query using them using a CTE as shown below:



d e c l a r e -- spaces added to allow code to post

        @StartDate datetime = '20160221',

        @EndDate datetime = '20160201';

select min(Date1) as StartDate, max(Date1) as EndDate

from (values (@StartDate),(@EndDate))dt(Date1);

with DateCheck as (

select min(Date1) as StartDate, max(Date1) as EndDate

from (values (@StartDate),(@EndDate))dt(Date1)

)

select StartDate, EndDate from DateCheck;

-- Swap the dates in the variables

select @StartDate, @EndDate;

select @StartDate = min(Date1), @EndDate = max(Date1)

from (values (@StartDate),(@EndDate))dt(Date1);

select @StartDate, @EndDate;

aaron.reese SSChampion Points: 13449 More actions · Answer 6

@sushantkatte

Kudos for seeking out assistance when you don't know the answers and well on finding SQLServerCentral forums. Jeff, Lynn, GilaMonster and Koren are all experts in this field and give a lot of their time and effort to assisting newbies and old-hands alike.

The first posts on this thread were a bit harsh! - we all had to start somewhere - and in my experience the challenge with SQL is knowing how to ask the right questions, not necessarily knowing the right answers. Experience will hopefully teach you how to test hypothesis and validate that the code you write is doing what you expect (when it doesn't is is more often than not that the data does not fit the given description of the domain)

Good luck in your SQL career and please don't let this experience put you off posting queries here.

Ben P Valued Member Points: 54 More actions · Answer 7

I'm intrigued with Jeff's approach to #3, using COUNT(DISTINCT Order_ID+Product_Type). After reading the initial post, I wrote this:

SELECT COUNT(Distinct oe.Order_ID) FROM Orders oe



    INNER JOIN Orders ot on oe.Order_ID = ot.Order_ID

    WHERE oe.Product_type = 'Ebook'

    AND ot.Product_type = 'Textbook'

Can anyone shed some light on the performance difference between using a CTE and GROUP BY rather than joining on a (hopefully) indexed column? Thanks!

Wayne West SSC-Insane Points: 22586 More actions · Answer 8

Bullet-Proofing is an attempt to prevent PEBKAC errors: Problem Exists Between Keyboard And Chair. Art Linkletter wrote a book called Kids Say The Darndest Things, I think if he were a DBA he would've written End Users Do The Darndest Things.

As has been suggested, get a copy of SQL Server Express or pony up $50US or so and get the Developer Edition: Developer has all of the features of Enterprise except it is not licensed for any production. A lot of DBAs buy the latest DevEd to get experience with the new edition as it's rare that our employers jump on to the next version bandwagon as soon as it comes out.

Give a little thought as to what you want to do. Database development is an excellent field, but it will frequently require skill sets in other programming languages or systems, like a Visual Studio language or Reporting Services. You can also look at the track to being a database administrator, and DBAs need a good understanding of the T-SQL programming and development side. So there's lots of overlap. And in small shops, the database developer is frequently also the DBA.

In addition to the Stairways articles series on this site, there's also a number of good ebooks at Redgate's site[/url], most are free and available in PDF or ePub for some. They make an excellent reference library.

-----
[font="Arial"]Knowledge is of two kinds. We know a subject ourselves or we know where we can find information upon it. --Samuel Johnson[/font]

richardsonkane75 Grasshopper Points: 14 More actions · Answer 9

Normalized tables are usually smaller and have a smaller foot-print because the data is divided vertically among many tables. This allows them to perform better as they are small enough to get fit into the buffer.

The updates are very fast because the data to be updated is located at a single place and there are no duplicates.

Similarly the inserts are very fast because the data has to be inserted at a single place and does not have to be duplicated.

The selects are fast in cases where data has to be fetched from a single table, because normally normalized tables are small enough to get fit into the buffer.

Because the data is not duplicated so there is less need for heavy duty group by or distinct queries.

Although there seems to be much in favor of normalized tables, with all the pros outlined above, but the main cause of concern with fully normalized tables is that normalized data means joins between tables. And this joining means that read operations have to suffer because indexing strategies do not go well with table joins.

Now lets have a look at the pros and cons of a denormalized database design.

Pros and cons of denormalized database design.

Denormalized databases fair well under heavy read-load and when the application is read intensive. This is because of the following reasons:

The data is present in the same table so there is no need for any joins, hence the selects are very fast.

A single table with all the required data allows much more efficient index usage. If the columns are indexed properly, then results can be filtered and sorted by utilizing the same index. While in the case of a normalized table, since the data would be spread out in different tables, this would not be possible.

Although for reasons mentioned above selects can be very fast on denormalized tables, but because the data is duplicated, the updates and inserts become complex and costly.

Having said that neither one of the approach can be entirely neglected, because a real world application is going to have both read-loads and write-loads. Hence the correct way would be to utilize both the normalized and denormalized approaches depending on situations.

Using normalized and denormalized approaches together.

The most common way of mixing denormalized and normalized approaches is to duplicate related columns from one table into another table. Let me show you by example:

Suppose you have a products table and an orders table.

The normalized approach would be to only have the product_id in the orders table and all the other product related information in the products table.

But that would make the query that filters by product_name and sorts by order_date inefficient because both are stored in different tables.

In a fully normalized schema, such a query would be performed in the following manner:

SELECT product_name, order_date

FROM orders INNER JOIN products USING(product_id)

WHERE product_name like 'A%'

ORDER by order_date DESC

ALTER TABLE [dbo].[Child] ADD CONSTRAINT [PK_Child] PRIMARY KEY CLUSTERED

([ChildID] ASC, [ChildDetailID] ASC)

DROP TABLE [dbo].[ChildDetail]

hadoop training in chennai | aws training in chennai | hadoop training in chennai

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 10

richardsonkane75 (2/26/2016)

The data is present in the same table so there is no need for any joins, hence the selects are very fast.

A single table with all the required data allows much more efficient index usage. If the columns are indexed properly, then results can be filtered and sorted by utilizing the same index. While in the case of a normalized table, since the data would be spread out in different tables, this would not be possible.

Just my 2 cents... that's a common misperception for the very reason you stated...

Although for reasons mentioned above selects can be very fast on denormalized tables, but because the data is duplicated, the updates and inserts become complex and costly.

Denormalized tables have a relatively huge amount of duplicated data, which also means that rows in a denormalized table are frequently quite wide, which means that much more data must be read to achieve the same results, which frequently means that the queries on denormalized tables are slower than their properly indexed and normalize counterparts.

Because of the misperception, people frequently materialize normalized data in denormalized tables in what they incorrectly refer to as a "Data Warehouse", which also adds to the nightly jobs, uses more resources, requires additional index maintenance, and frequently adds to backup duration, disk space, and tape space.

Don't get me wrong. I'm all for properly denormalized tables that contain pre-aggregated data instead of recalculating the aggregations every time someone wants to see the data but that's not normally how people build denomalized tables because of the notion that they also want to dig into atomic details.

The proper way to do all of this denormalization and Data Warehouse stuff would be to store the pre-aggregations in denormalized "report" tables and, when you want to drill down into the details, hit the "real" tables instead of not only duplicating the real data, but grossly duplicating it in denormalized tables.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions