By: Ken Adams
Sr. Business Intelligence Consultant at Key2 Consulting
A junk dimension is seen occasionally inside of data warehouses. This type of dimension can be thought of as a flag table, or a collection of attributes that have low-cardinality. This means that the values seen are not distinctive and are often duplicated. According to the site 1keydata.com, a junk dimension is defined as follows:
In data warehouse design, frequently we run into a situation where there are yes/no indicator fields in the source system. If we keep all those indicator fields in the fact table, not only do we need to build many small dimension tables, but the amount of information stored in the fact table also increases tremendously, leading to possible performance and management issues. A junk dimension is the way to solve this problem. In a junk dimension, we combine these indicator fields into a single dimension.
On a recent project, I ran into a situation where our client wanted to calculate different types of exceptions. These exceptions indicated whether an employee was in compliance and authorized to work under federal law. The actual values were to be calculated as part of the data integration process and the result was either a Yes (“Y”) or a No (“N”). We landed on seven separate flags (or attributes) and a junk dimension seemed like a viable entity to handle the work.
The alternative in this situation would have been to create separate dimensions and have seven foreign keys linking the fact table to each dimension. This not only would have increased the amount of data stored in the fact table, but it also seemed like a maintenance nightmare and performance pitfall in the future. Had I taken this path, the ERD would have looked something like this:
A – ERD with seven individual dimensions
In this particular scenario, each dimension would have had two rows and only a handful of columns. Additionally, a query written against this model would require seven joins to return all of the attributes. The data inside one of the dimensions would have looked something like the following:
Flag1Key | IsFlag1Value |
1 | Y |
2 | N |
B – Sample data from narrow dimension
My solution was to create a single table and incorporate the various combinations of flags. The nature of the data made it difficult, if not impossible, to have this dimension be driven off the source data. As I mentioned, I needed to calculate whether a row met certain conditions in order to be an exception. This logic required that I evaluate multiple columns during the data integration and ultimately use that result to look up a foreign key. Had the data allowed me to do so, I could have cross-joined the source tables in order to produce the product of the attribute combinations. The junk dimension and fact relationship is as follows:
C – ERD with single junk dimension
Now a single join between the tables can produce the same product as the seven joins in the previous example. The data inside the dimension looks like this:
D – Sample data from junk dimension
Populating this dimension might be an issue. In this particular scenario, the underlying data did not present in a manner that would allow this dimension to be derived from the data. That being the case, I could have done the work in Excel or with a lengthy INSERT…UNION ALL. Once you get past a handful of columns, say three or four, the task becomes quite arduous and prone to error. Rather than trying to do the work manually, I used the following query:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
WITH cteFlags AS ( SELECT 'N' AS Value UNION ALL SELECT 'Y' ) SELECT Flag0.Value, Flag1.Value, Flag2.Value, Flag3.Value, Flag4.Value, Flag5.Value, Flag6.Value FROM cteFlags Flag0 CROSS JOIN cteFlags Flag1 CROSS JOIN cteFlags Flag2 CROSS JOIN cteFlags Flag3 CROSS JOIN cteFlags Flag4 CROSS JOIN cteFlags Flag5 CROSS JOIN cteFlags Flag6 ; |
E – Script to create row junk dimension; 2^7 =128 rows
This simple CTE contains the possible values and the subsequent CROSS JOIN statements handle the possible combinations of those flags. You could add or remove levels based on the number of columns required. In my case, I took this code and created a stored procedure that truncated and repopulated the dimension table.
Now that you have a method of populating the junk dimension, it is now a matter of linking that to the fact table. This will likely be done during the data transformation phase where the data is moved from the source system into the star schema. The flags values will be derived from the transaction – or explicitly stated in data – and joined to the junk dimension which was created above. I want to show an example of the end-to-end process using the AdventureWorksDW database and some flag attributes that I fabricated.
DimOrderFlag Example:
From the AdventureWorksDW, I chose the fact table FactInternetSales as my test subject. Then I came up with a series of flags that seemed to make sense in conjunction with the sales data. Those flags are IsPromotionalOrderFlag, IsLargeOrderFlag, IsPastDueFlag, IsPastDueFlag, and IsShippedFlag. The possible values for each of the flags is either Yes (“Y”) or No (“N”). The logic to derive the flags is as follows:
1 2 3 4 5 6 7 |
SELECT IIF(PromotionKey > 1, 'Y', 'N') AS IsPromotionalOrderFlag ,IIF(ExtendedAmount > 2000, 'Y', 'N') AS IsLargeOrderFlag ,IIF(ShipDateKey > DueDateKey, 'Y', 'N') AS IsPastDueFlag ,IIF(ShipDateKey < 20140101, 'Y', 'N') AS IsShippedFlag ,* FROM [AdventureWorksDW].[dbo].[FactInternetSales] |
F – This query will become the source query in Integration Services.
Next, I want to create a secondary copy of the existing FactInternetSales table that includes a new foreign key reference to a new junk dimension. The junk dimension will be named DimOrderFlag and it will contain the flag values mentioned above as well as a surrogate key representing the combination of those values. I created the table FactInternetSalesTest – the DDL for that table is below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
CREATE TABLE [dbo].[FactInternetSalesTest]( [ProductKey] [int] NOT NULL, [OrderDateKey] [int] NOT NULL, [DueDateKey] [int] NOT NULL, [ShipDateKey] [int] NOT NULL, [CustomerKey] [int] NOT NULL, [PromotionKey] [int] NOT NULL, [CurrencyKey] [int] NOT NULL, [SalesTerritoryKey] [int] NOT NULL, [OrderFlagKey] [int] NOT NULL, [SalesOrderNumber] [nvarchar](20) NOT NULL, [SalesOrderLineNumber] [tinyint] NOT NULL, [RevisionNumber] [tinyint] NOT NULL, [OrderQuantity] [smallint] NOT NULL, [UnitPrice] [money] NOT NULL, [ExtendedAmount] [money] NOT NULL, [UnitPriceDiscountPct] [float] NOT NULL, [DiscountAmount] [float] NOT NULL, [ProductStandardCost] [money] NOT NULL, [TotalProductCost] [money] NOT NULL, [SalesAmount] [money] NOT NULL, [TaxAmt] [money] NOT NULL, [Freight] [money] NOT NULL, [CarrierTrackingNumber] [nvarchar](25) NULL, [CustomerPONumber] [nvarchar](25) NULL, [OrderDate] [datetime] NULL, [DueDate] [datetime] NULL, [ShipDate] [datetime] NULL, CONSTRAINT [PK_FactInternetSalesTest_SalesOrderNumber_SalesOrderLineNumber] PRIMARY KEY CLUSTERED ( [SalesOrderNumber] ASC, [SalesOrderLineNumber] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY] |
G – The same schema as FactInternetSales with an additional column OrderFlagKey.
The related junk dimension will include a surrogate key named OrderFlagKey and the four flag columns. The column OrderFlagKey is an identity and primary key for the table and as shown earlier is a foreign key in the FactInternetSalesTest table. The DDL for DimOrderFlag is seen below:
1 2 3 4 5 6 7 8 9 10 11 12 |
CREATE TABLE [dbo].[DimOrderFlag]( [OrderFlagKey] [smallint] IDENTITY(1,1) NOT NULL, [IsShippedFlag] [char](1) NULL, [IsPastDueFlag] [char](1) NULL, [IsLargeOrderFlag] [char](1) NULL, [IsPromotionalOrderFlag] [char](1) NULL, [UnknownFlag] [char](1) NULL, CONSTRAINT [PK_DimOrderFlag_OrderFlagKey] PRIMARY KEY CLUSTERED ( [OrderFlagKey] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY] |
H – Note the identity specification and primary key constraint on OrderFlagKey.
Now comes the population of DimOrderFlag. We previously saw the guts of a script for creating the permutations for seven flags. DimOrderFlag contains four flags so we can actually pare down CTE/CROSS JOIN to four levels and wrap it in the DML required for getting the output into the dimension table. The script for doing just that is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
--truncate dimension TRUNCATE TABLE dbo.DimOrderFlag; --check for the presence of unknown row; then insert IF (SELECT COUNT(*) FROM dbo.DimOrderFlag WHERE UnknownFlag = 'Y') = 0 BEGIN SET IDENTITY_INSERT dbo.DimOrderFlag ON; INSERT INTO dbo.DimOrderFlag(OrderFlagKey, IsShippedFlag, IsPastDueFlag, IsLargeOrderFlag, IsPromotionalOrderFlag, UnknownFlag) SELECT -1, 'N', 'N', 'N', 'N', 'Y'; SET IDENTITY_INSERT dbo.DimOrderFlag OFF; END ;WITH cteFlags AS ( SELECT 'N' AS Value UNION ALL SELECT 'Y' ) INSERT INTO dbo.DimOrderFlag(IsShippedFlag, IsPastDueFlag, IsLargeOrderFlag, IsPromotionalOrderFlag) SELECT Flag0.Value, Flag1.Value, Flag2.Value, Flag3.Value FROM cteFlags Flag0 CROSS JOIN cteFlags Flag1 CROSS JOIN cteFlags Flag2 CROSS JOIN cteFlags Flag3 ; |
– set the unknown flag column for all known rows
1 |
UPDATE dbo.DimOrderFlag SET UnknownFlag = 'N' WHERE OrderFlagKey gt; 0; |
I – This could easily be made into a stored procedure.
The result of that script looks like this:
J – A screenshot of the table data from inside Management Studio.
With these objects in place, you are ready to run a test load of the fact table FactInternetSalesTest. I put together a simple package in Integration Services that is fairly barebones. In fact, it is a full refresh of the FactInternetSalesTest table which is likely not acceptable in a production environment. I captured a few screenshots of the Integration Services package.
K – Control Flow (Execute SQL and Data Flow Task)
L – Variables Window
Here are the definitions for the variables seen above:
– DestinationTable = dbo.FactInternetSalesTest
– SourceTable = dbo.FactInternetSales
– SourceQuery =
1 2 3 4 5 6 |
SELECT ProductKey, OrderDateKey, DueDateKey, ShipDateKey, CustomerKey, PromotionKey, CurrencyKey, SalesTerritoryKey, SalesOrderNumber, SalesOrderLineNumber, RevisionNumber, OrderQuantity, UnitPrice, ExtendedAmount, UnitPriceDiscountPct, DiscountAmount, ProductStandardCost, TotalProductCost, SalesAmount, TaxAmt, Freight, CarrierTrackingNumber, CustomerPONumber, OrderDate, DueDate, ShipDate ,IIF(PromotionKey > 1, 'Y', 'N') AS IsPromotionalOrderFlag ,IIF(ExtendedAmount > 2000, 'Y', 'N') AS IsLargeOrderFlag ,IIF(ShipDateKey > DueDateKey, 'Y', 'N') AS IsPastDueFlag ,IIF(ShipDateKey < 20140101, 'Y', 'N') AS IsShippedFlag FROM " + @[User::SourceTable] |
M – Execute SQL Task
N – Execute SQL Task Query
O – Execute SQL Task Parameter Mapping
P – Data Flow Task
Q – OLEDB Source
R – Lookup Transformation
S – Lookup Transformation Relationships
T – Derived Column Task (NULL handling)
U – OLEDB Destination
Please note that this Integration Services’ package is a sample for illustration purposes only. It is highly unlikely that you would be pulling data from one fact table and landing that data in another fact table. Instead, your data will be extracted directly from the source or more likely than not from a set of staging tables. Additionally, your data transformation process might be incremental rather than a truncate and reload. This will depend on the size of the data.
My goal was to show you a method of populating a junk dimension as well as how that junk dimension can be incorporated into a data transformation process. Now that the DimOrderFlag and FactInternetSalesTest are populated, let us look at the data. Here is a sample of FactInternetSalesTest:
V – Note the OrderFlagKey
Now a single join between FactInternetSalesTest and DimOrderFlag exposes the flag attributes.
1 2 3 4 5 |
SELECT B.IsLargeOrderFlag, B.IsPastDueFlag, B.IsPromotionalOrderFlag, B.IsShippedFlag ,A.* FROM dbo.FactInternetSalesTest A INNER JOIN dbo.DimOrderFlag B ON A.OrderFlagKey = B.OrderFlagKey |
W – Report query and result set
I hope that you find this information useful and are able to make use of a junk dimension at some point down the road. My aim was the to present an option to easily populate a junk dimension – particularly one that has a product of 2^N. The number two being the possible values, Boolean or Yes/No, and N being the number of attributes. Furthermore, you should see how a junk dimension could be a clean method of storing those miscellaneous type attributes. It can help keep your data model simple and provide easier access to those attributes for the end user.
This was exactly the insight I was looking for. You made my week… Thanks!
Only variation I used was that instead of the Lookup and Derived Column transformations, I just handled all of that logic within my staging schema, where all of the necessary data already exists.
Thanks for reading Dave and so glad the post helped you out.
How do you pick from a Junk dimension when browsing the cube? What does the user pick to select a row of the Junk Dimension? Is it the Junk Dimension Key (1, 2, 3, etc.)? That would not make sense to the user.
Hi Ow. If you’re browsing your cube take the example of the Order Flag dimension from above. You should be able to drag the individual Flag columns (IsShipped, IsPromotionalOrder, etc) as filters. The column names in the junk dimension should be meaningful enough so that the user understands those selections when accessing the cube via Excel or other tool. I hope that helps and thanks so much for reading.
Hi Ken, do we need to maintain multiple ‘unknown’ flags where some of the flag attributes are applicable while others aren’t?
Sohi. 1st off thanks for reading and for your comment. This is up to you and the business owners. Typically, for a miscellaneous dimension you want to limit the range of values – bit is a great data type for flag dimensions as your are limited to 0 and 1. A char(1) works well too. You could certainly have Y, N and U as possible combinations and build it out that way. I’d suggest working with the business owners to create the flags as they make sense. If a single flag is unknown for a row of data, or not applicable, you could certainly map that to N. This is one of those “it depends” answers. Hope that helps.
Hi Ken, Thank you so much, was great this published.
I’ve a question, in my fact table I can update multiple times the flagkey in the fact table for the same row.
Cheers!
what do we do when we need to add new column to the junk dimension? how do we update the existing junk dimensions with the new dimension without changing the surrogate key?
Hi Camilla,
First off thanks for reading! Adding a new column really changes the grain of that dimension. If you look at the sample code at the beginning adding another flag doubles the number of rows in the junk dimension –
WITH cteFlags AS
(
SELECT ‘N’ AS Value
UNION ALL
SELECT ‘Y’
)
SELECT
Flag0.Value,
Flag1.Value,
Flag2.Value,
Flag3.Value,
Flag4.Value,
Flag5.Value,
Flag6.Value,
Flag7.Value
FROM cteFlags Flag0
CROSS JOIN cteFlags Flag1
CROSS JOIN cteFlags Flag2
CROSS JOIN cteFlags Flag3
CROSS JOIN cteFlags Flag4
CROSS JOIN cteFlags Flag5
CROSS JOIN cteFlags Flag6
CROSS JOIN cteFlags Flag7
;
–128 combinations (6 flags)
–256 combinations (7 flags)
Junk dimensions can be inflexible in this respect. Ideally, you could identify all of those “yes/no” type values in your incoming data and model the junk dimension to include all the combinations upfront. In this case, I think you’re looking at a repopulation of the junk dimension and rebuilding of the associated fact table(s).
-Ken