Building a Data Warehouse: Populating and Using a Junk Dimension


By: Ken Adams
Sr. Business Intelligence Consultant at Key2 Consulting

A junk dimension is seen occasionally inside of data warehouses. This type of dimension can be thought of as a flag table, or a collection of attributes that have low-cardinality. This means that the values seen are not distinctive and are often duplicated.  According to the site 1keydata.com, a junk dimension is defined as follows:

In data warehouse design, frequently we run into a situation where there are yes/no indicator fields in the source system. If we keep all those indicator fields in the fact table, not only do we need to build many small dimension tables, but the amount of information stored in the fact table also increases tremendously, leading to possible performance and management issues. A junk dimension is the way to solve this problem. In a junk dimension, we combine these indicator fields into a single dimension. 

On a recent project, I ran into a situation where our client wanted to calculate different types of exceptions. These exceptions indicated whether an employee was in compliance and authorized to work under federal law. The actual values were to be calculated as part of the data integration process and the result was either a Yes (“Y”) or a No (“N”). We landed on seven separate flags (or attributes) and a junk dimension seemed like a viable entity to handle the work.

The alternative in this situation would have been to create separate dimensions and have seven foreign keys linking the fact table to each dimension. This not only would have increased the amount of data stored in the fact table, but it also seemed like a maintenance nightmare and performance pitfall in the future. Had I taken this path, the ERD would have looked something like this:

A – ERD with seven individual dimensions

In this particular scenario, each dimension would have had two rows and only a handful of columns. Additionally, a query written against this model would require seven joins to return all of the attributes. The data inside one of the dimensions would have looked something like the following:

Flag1Key IsFlag1Value
1 Y
2 N

B – Sample data from narrow dimension

My solution was to create a single table and incorporate the various combinations of flags. The nature of the data made it difficult, if not impossible, to have this dimension be driven off the source data. As I mentioned, I needed to calculate whether a row met certain conditions in order to be an exception. This logic required that I evaluate multiple columns during the data integration and ultimately use that result to look up a foreign key. Had the data allowed me to do so, I could have cross-joined the source tables in order to produce the product of the attribute combinations. The junk dimension and fact relationship is as follows:

C – ERD with single junk dimension

Now a single join between the tables can produce the same product as the seven joins in the previous example. The data inside the dimension looks like this:

D – Sample data from junk dimension

Populating this dimension might be an issue. In this particular scenario, the underlying data did not present in a manner that would allow this dimension to be derived from the data. That being the case, I could have done the work in Excel or with a lengthy INSERT…UNION ALL. Once you get past a handful of columns, say three or four, the task becomes quite arduous and prone to error. Rather than trying to do the work manually, I used the following query:

E – Script to create row junk dimension; 2^7 =128 rows

This simple CTE contains the possible values and the subsequent CROSS JOIN statements handle the possible combinations of those flags. You could add or remove levels based on the number of columns required. In my case, I took this code and created a stored procedure that truncated and repopulated the dimension table.

Now that you have a method of populating the junk dimension, it is now a matter of linking that to the fact table. This will likely be done during the data transformation phase where the data is moved from the source system into the star schema.  The flags values will be derived from the transaction – or explicitly stated in data – and joined to the junk dimension which was created above. I want to show an example of the end-to-end process using the AdventureWorksDW database and some flag attributes that I fabricated.

DimOrderFlag Example:

From the AdventureWorksDW, I chose the fact table FactInternetSales as my test subject. Then I came up with a series of flags that seemed to make sense in conjunction with the sales data. Those flags are IsPromotionalOrderFlag, IsLargeOrderFlag, IsPastDueFlag, IsPastDueFlag, and IsShippedFlag. The possible values for each of the flags is either Yes (“Y”) or No (“N”). The logic to derive the flags is as follows:

F – This query will become the source query in Integration Services.

Next, I want to create a secondary copy of the existing FactInternetSales table that includes a new foreign key reference to a new junk dimension. The junk dimension will be named DimOrderFlag and it will contain the flag values mentioned above as well as a surrogate key representing the combination of those values. I created the table FactInternetSalesTest – the DDL for that table is below:

G – The same schema as FactInternetSales with an additional column OrderFlagKey.

The related junk dimension will include a surrogate key named OrderFlagKey and the four flag columns.  The column OrderFlagKey is an identity and primary key for the table and as shown earlier is a foreign key in the FactInternetSalesTest table. The DDL for DimOrderFlag is seen below:

H – Note the identity specification and primary key constraint on OrderFlagKey.

Now comes the population of DimOrderFlag. We previously saw the guts of a script for creating the permutations for seven flags.  DimOrderFlag contains four flags so we can actually pare down CTE/CROSS JOIN to four levels and wrap it in the DML required for getting the output into the dimension table. The script for doing just that is as follows:

– set the unknown flag column for all known rows

I – This could easily be made into a stored procedure.

The result of that script looks like this:

J – A screenshot of the table data from inside Management Studio.

With these objects in place, you are ready to run a test load of the fact table FactInternetSalesTest. I put together a simple package in Integration Services that is fairly barebones. In fact, it is a full refresh of the FactInternetSalesTest table which is likely not acceptable in a production environment. I captured a few screenshots of the Integration Services package.

K – Control Flow (Execute SQL and Data Flow Task)


L – Variables Window

Here are the definitions for the variables seen above:

– DestinationTable = dbo.FactInternetSalesTest

– SourceTable = dbo.FactInternetSales

– SourceQuery =

 

M – Execute SQL Task

N – Execute SQL Task Query

O – Execute SQL Task Parameter Mapping

P – Data Flow Task

Q – OLEDB Source

R – Lookup Transformation

S – Lookup Transformation Relationships

T – Derived Column Task (NULL handling)

U – OLEDB Destination

Please note that this Integration Services’ package is a sample for illustration purposes only. It is highly unlikely that you would be pulling data from one fact table and landing that data in another fact table. Instead, your data will be extracted directly from the source or more likely than not from a set of staging tables. Additionally, your data transformation process might be incremental rather than a truncate and reload. This will depend on the size of the data.

My goal was to show you a method of populating a junk dimension as well as how that junk dimension can be incorporated into a data transformation process. Now that the DimOrderFlag and FactInternetSalesTest are populated, let us look at the data. Here is a sample of FactInternetSalesTest:

V – Note the OrderFlagKey

Now a single join between FactInternetSalesTest and DimOrderFlag exposes the flag attributes.

 

W – Report query and result set

I hope that you find this information useful and are able to make use of a junk dimension at some point down the road.  My aim was the to present an option to easily populate a junk dimension – particularly one that has a product of 2^N.  The number two being the possible values, Boolean or Yes/No, and N being the number of attributes. Furthermore, you should see how a junk dimension could be a clean method of storing those miscellaneous type attributes. It can help keep your data model simple and provide easier access to those attributes for the end user.

Leave a Reply

Your email address will not be published.