Blog Post

Scripts to Detect Data Warehouse Issues

,

Standards and best practices are like flu shots you take before you're infected; Database best practices protect your databases from bad things. But, we all make mistakes. It could be because we're on a time crunch, or we're lazy (which I'm guilty of by the way), or maybe it's part of being a developer.

Common mistakes include: tables without a primary key, column name problems, missing foreign keys, etc., This is where I love LTD's very own SQLCop. I can quickly go on with my database development and rely on SQLCop to detect the issues. It saves time and ensures that database standards are met.

However, there are some issues explicit to data warehouses that SQLCop doesn't look for. I list those issues below and provide scripts to detect them. I use these scripts in conjunction with SQLCop.

Detect tables in a data warehouse that aren't prefixed with either Dim or Fact:

Tables in a warehouse are generally prefixed with Dim and Fact for dimensions and fact respectively, to easily distinguish them.

  1. SELECT  [schema_name] = s.name ,
  2.         table_name = t.name
  3. FROM    sys.tables t
  4.         INNER JOIN sys.schemas s ON s.schema_id = t.schema_id
  5. WHERE   t.name NOT LIKE 'Dim%'
  6.         AND t.name NOT LIKE 'Fact%'
  7.         AND t.TYPE = 'U';

Find tables in a data warehouse that don't have a primary key:

Like in OLTP databases, all tables in a data warehouse also should have a primary key defined.

  1. SELECT  schema_name = SCHEMA_NAME(schema_id) ,
  2.         table_name = name
  3. FROM    sys.tables
  4. WHERE   OBJECTPROPERTY(OBJECT_ID, 'TableHasPrimaryKey') = 0
  5. ORDER BY SCHEMA_NAME(schema_id) ,
  6.         name;

 

Detect dimension tables with a composite primary key:

A composite primary key on a dimension table causes degraded performance. It is best to create a single column primary key.

  1. SELECT  c.TABLE_NAME ,
  2.         COUNT(*)
  3. FROM    INFORMATION_SCHEMA.TABLE_CONSTRAINTS pk ,
  4.         INFORMATION_SCHEMA.KEY_COLUMN_USAGE c
  5. WHERE   CONSTRAINT_TYPE = 'PRIMARY KEY'
  6.         AND c.TABLE_NAME = pk.TABLE_NAME
  7.         AND c.CONSTRAINT_NAME = pk.CONSTRAINT_NAME
  8.         AND c.TABLE_NAME LIKE 'Dim%'
  9. GROUP BY c.TABLE_NAME
  10. HAVING  COUNT(*) > 1

 

Detect dimension tables that don't have Identity column as a primary key:

Usually, surrogate key is made the primary key of the dimension table. Surrogate key is an auto generated Identity value.

  1. SELECT  dim_table = t.name ,
  2.         primary_key = c.name ,
  3.         c.is_identity
  4. FROM    sys.tables t
  5.         INNER JOIN sys.key_constraints kc ON t.OBJECT_ID = kc.parent_object_id
  6.         INNER JOIN sys.indexes i ON i.OBJECT_ID = kc.parent_object_id
  7.                                     AND i.type_desc = 'CLUSTERED'
  8.         INNER JOIN sys.index_columns ic ON ic.OBJECT_ID = kc.parent_object_id
  9.                                            AND ic.index_id = 1
  10.         INNER JOIN sys.columns c ON c.OBJECT_ID = t.OBJECT_ID
  11.                                     AND c.column_id = ic.column_id
  12. WHERE   t.TYPE = 'U'
  13.         AND t.name LIKE 'Dim%'
  14.         AND kc.type_desc = 'PRIMARY_KEY_CONSTRAINT'
  15.         AND c.is_identity = 0

 

Detect primary keys that don't follow the naming convention:

  1. SELECT  dim_table = t.name ,
  2.         primary_key = c.name
  3. FROM    sys.tables t
  4.         INNER JOIN sys.key_constraints kc ON t.OBJECT_ID = kc.parent_object_id
  5.         INNER JOIN sys.indexes i ON i.OBJECT_ID = kc.parent_object_id
  6.                                     AND i.type_desc = 'CLUSTERED'
  7.         INNER JOIN sys.index_columns ic ON ic.OBJECT_ID = kc.parent_object_id
  8.                                            AND ic.index_id = 1
  9.         INNER JOIN sys.columns c ON c.OBJECT_ID = t.OBJECT_ID
  10.                                     AND c.column_id = ic.column_id
  11. WHERE   t.TYPE = 'U'
  12.         AND t.name LIKE 'Dim%'
  13.         AND kc.type_desc = 'PRIMARY_KEY_CONSTRAINT'
  14.         AND c.name <> REPLACE(t.name, 'Dim', '') + 'Key'

 

Detect fact tables that have no foreign keys:

Without a foreign key, a fact table isn't really a fact table.

  1. SELECT table_name = t.name
  2.         , fk_count = COUNT(*)
  3.     FROM sys.tables t
  4.     INNER JOIN
  5.     sys.foreign_keys fk ON t.OBJECT_ID = fk.parent_object_id
  6.     WHERE  t.name LIKE 'fact%'
  7.     GROUP BY t.name
  8.     HAVING COUNT(*) < 1

 

Detect fact tables that have foreign key(s) to another fact table:

It's unlikely to have a fact table related to another fact table.

  1. SELECT  foreign_key = fk.name ,
  2.         child_table = t.name ,
  3.         parent_name = rt.name
  4. FROM    sys.foreign_keys fk
  5.         INNER JOIN sys.tables rt ON rt.object_id = fk.referenced_object_id
  6.         INNER JOIN sys.tables t ON t.object_id = fk.parent_object_id
  7. WHERE   rt.name LIKE 'Fact%'

 

Detect missing foreign key(s) in fact tables - Columns suffixed with Key, but don't have foreign key constraint:

I stole the following query from here posted by George Mastros, and replaced ID with Key to use it for data warehouse scenario.

  1. SELECT  C.TABLE_SCHEMA,C.TABLE_NAME,C.COLUMN_NAME
  2.     FROM    INFORMATION_SCHEMA.COLUMNS C          
  3.             INNER Join INFORMATION_SCHEMA.TABLES T            
  4.               ON C.TABLE_NAME = T.TABLE_NAME    
  5.               And T.TABLE_TYPE = 'Base Table'
  6.               AND T.TABLE_SCHEMA = C.TABLE_SCHEMA        
  7.             LEFT Join INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE U            
  8.               ON C.TABLE_NAME = U.TABLE_NAME            
  9.               And C.COLUMN_NAME = U.COLUMN_NAME
  10.               And U.TABLE_SCHEMA = C.TABLE_SCHEMA
  11.     WHERE   U.COLUMN_NAME IS Null          
  12.             And C.COLUMN_NAME Like '%Key'
  13.     ORDER BY C.TABLE_SCHEMA, C.TABLE_NAME, C.COLUMN_NAME

 

Results of above queries aren't always issues. They are just rare, you've to look at them closely and make sure there is a reason for each choice. Also, you may use different naming conventions that make these queries void. In that case, I hope you're able to alter them to your needs.

Follow me on Twitter! @SamuelVanga

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating