By: Ken Adams

Sr. Business Intelligence Consultant at Key2 Consulting

A junk dimension is seen occasionally inside of data warehouses. This type of dimension can be thought of as a flag table, or a collection of attributes that have low-cardinality. This means that the values seen are not distinctive and are often duplicated. According to the site 1keydata.com, a junk dimension is defined as follows:

In data warehouse design, frequently we run into a situation where there are yes/no indicator fields in the source system. If we keep all those indicator fields in the fact table, not only do we need to build many small dimension tables, but the amount of information stored in the fact table also increases tremendously, leading to possible performance and management issues. A junk dimension is the way to solve this problem. In a junk dimension, we combine these indicator fields into a single dimension.

On a recent project, I ran into a situation where our client wanted to calculate different types of exceptions. These exceptions indicated whether an employee was in compliance and authorized to work under federal law. The actual values were to be calculated as part of the data integration process and the result was either a Yes (“Y”) or a No (“N”). We landed on seven separate flags (or attributes) and a junk dimension seemed like a viable entity to handle the work.

The alternative in this situation would have been to create separate dimensions and have seven foreign keys linking the fact table to each dimension. This not only would have increased the amount of data stored in the fact table, but it also seemed like a maintenance nightmare and performance pitfall in the future. Had I taken this path, the ERD would have looked something like this:

A – ERD with seven individual dimensions

In this particular scenario, each dimension would have had two rows and only a handful of columns. Additionally, a query written against this model would require seven joins to return all of the attributes. The data inside one of the dimensions would have looked something like the following:

Flag1Key	IsFlag1Value
1	Y
2	N

B – Sample data from narrow dimension

My solution was to create a single table and incorporate the various combinations of flags. The nature of the data made it difficult, if not impossible, to have this dimension be driven off the source data. As I mentioned, I needed to calculate whether a row met certain conditions in order to be an exception. This logic required that I evaluate multiple columns during the data integration and ultimately use that result to look up a foreign key. Had the data allowed me to do so, I could have cross-joined the source tables in order to produce the product of the attribute combinations. The junk dimension and fact relationship is as follows:

C – ERD with single junk dimension

Now a single join between the tables can produce the same product as the seven joins in the previous example. The data inside the dimension looks like this:

D – Sample data from junk dimension

Populating this dimension might be an issue. In this particular scenario, the underlying data did not present in a manner that would allow this dimension to be derived from the data. That being the case, I could have done the work in Excel or with a lengthy INSERT…UNION ALL. Once you get past a handful of columns, say three or four, the task becomes quite arduous and prone to error. Rather than trying to do the work manually, I used the following query:

WITH cteFlags AS
(
SELECT 'N' AS Value
UNION ALL
SELECT 'Y'
)
SELECT
Flag0.Value,
Flag1.Value,
Flag2.Value,
Flag3.Value,
Flag4.Value,
Flag5.Value,
Flag6.Value
FROM cteFlags Flag0
CROSS JOIN cteFlags Flag1
CROSS JOIN cteFlags Flag2
CROSS JOIN cteFlags Flag3
CROSS JOIN cteFlags Flag4
CROSS JOIN cteFlags Flag5
CROSS JOIN cteFlags Flag6
;

WITH cteFlags AS

(

SELECT 'N' AS Value

UNION ALL

SELECT 'Y'

)

SELECT

Flag0.Value,

Flag1.Value,

Flag2.Value,

Flag3.Value,

Flag4.Value,

Flag5.Value,

Flag6.Value

FROM cteFlags Flag0

CROSS JOIN cteFlags Flag1

CROSS JOIN cteFlags Flag2

CROSS JOIN cteFlags Flag3

CROSS JOIN cteFlags Flag4

CROSS JOIN cteFlags Flag5

CROSS JOIN cteFlags Flag6

;

E – Script to create row junk dimension; 2^7 =128 rows

This simple CTE contains the possible values and the subsequent CROSS JOIN statements handle the possible combinations of those flags. You could add or remove levels based on the number of columns required. In my case, I took this code and created a stored procedure that truncated and repopulated the dimension table.

Now that you have a method of populating the junk dimension, it is now a matter of linking that to the fact table. This will likely be done during the data transformation phase where the data is moved from the source system into the star schema. The flags values will be derived from the transaction – or explicitly stated in data – and joined to the junk dimension which was created above. I want to show an example of the end-to-end process using the AdventureWorksDW database and some flag attributes that I fabricated.

DimOrderFlag Example:

From the AdventureWorksDW, I chose the fact table FactInternetSales as my test subject. Then I came up with a series of flags that seemed to make sense in conjunction with the sales data. Those flags are IsPromotionalOrderFlag, IsLargeOrderFlag, IsPastDueFlag, IsPastDueFlag, and IsShippedFlag. The possible values for each of the flags is either Yes (“Y”) or No (“N”). The logic to derive the flags is as follows:

SELECT
IIF(PromotionKey &gt; 1, 'Y', 'N') AS IsPromotionalOrderFlag
,IIF(ExtendedAmount &gt; 2000, 'Y', 'N') AS IsLargeOrderFlag
,IIF(ShipDateKey &gt; DueDateKey, 'Y', 'N') AS IsPastDueFlag
,IIF(ShipDateKey &lt; 20140101, 'Y', 'N') AS IsShippedFlag
,*
FROM [AdventureWorksDW].[dbo].[FactInternetSales]

SELECT

IIF(PromotionKey > 1, 'Y', 'N') AS IsPromotionalOrderFlag

,IIF(ExtendedAmount > 2000, 'Y', 'N') AS IsLargeOrderFlag

,IIF(ShipDateKey > DueDateKey, 'Y', 'N') AS IsPastDueFlag

,IIF(ShipDateKey < 20140101, 'Y', 'N') AS IsShippedFlag

FROM [AdventureWorksDW].[dbo].[FactInternetSales]

F – This query will become the source query in Integration Services.

Next, I want to create a secondary copy of the existing FactInternetSales table that includes a new foreign key reference to a new junk dimension. The junk dimension will be named DimOrderFlag and it will contain the flag values mentioned above as well as a surrogate key representing the combination of those values. I created the table FactInternetSalesTest – the DDL for that table is below:

CREATE TABLE [dbo].[FactInternetSalesTest](
[ProductKey] [int] NOT NULL,
[OrderDateKey] [int] NOT NULL,
[DueDateKey] [int] NOT NULL,
[ShipDateKey] [int] NOT NULL,
[CustomerKey] [int] NOT NULL,
[PromotionKey] [int] NOT NULL,
[CurrencyKey] [int] NOT NULL,
[SalesTerritoryKey] [int] NOT NULL,
[OrderFlagKey] [int] NOT NULL,
[SalesOrderNumber] [nvarchar](20) NOT NULL,
[SalesOrderLineNumber] [tinyint] NOT NULL,
[RevisionNumber] [tinyint] NOT NULL,
[OrderQuantity] [smallint] NOT NULL,
[UnitPrice] [money] NOT NULL,
[ExtendedAmount] [money] NOT NULL,
[UnitPriceDiscountPct] [float] NOT NULL,
[DiscountAmount] [float] NOT NULL,
[ProductStandardCost] [money] NOT NULL,
[TotalProductCost] [money] NOT NULL,
[SalesAmount] [money] NOT NULL,
[TaxAmt] [money] NOT NULL,
[Freight] [money] NOT NULL,
[CarrierTrackingNumber] [nvarchar](25) NULL,
[CustomerPONumber] [nvarchar](25) NULL,
[OrderDate] [datetime] NULL,
[DueDate] [datetime] NULL,
[ShipDate] [datetime] NULL,
CONSTRAINT [PK_FactInternetSalesTest_SalesOrderNumber_SalesOrderLineNumber] PRIMARY KEY CLUSTERED
(
[SalesOrderNumber] ASC,
[SalesOrderLineNumber] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

CREATE TABLE [dbo].[FactInternetSalesTest](

[ProductKey] [int] NOT NULL,

[OrderDateKey] [int] NOT NULL,

[DueDateKey] [int] NOT NULL,

[ShipDateKey] [int] NOT NULL,

[CustomerKey] [int] NOT NULL,

[PromotionKey] [int] NOT NULL,

[CurrencyKey] [int] NOT NULL,

[SalesTerritoryKey] [int] NOT NULL,

[OrderFlagKey] [int] NOT NULL,

[SalesOrderNumber] [nvarchar](20) NOT NULL,

[SalesOrderLineNumber] [tinyint] NOT NULL,

[RevisionNumber] [tinyint] NOT NULL,

[OrderQuantity] [smallint] NOT NULL,

[UnitPrice] [money] NOT NULL,

[ExtendedAmount] [money] NOT NULL,

[UnitPriceDiscountPct] [float] NOT NULL,

[DiscountAmount] [float] NOT NULL,

[ProductStandardCost] [money] NOT NULL,

[TotalProductCost] [money] NOT NULL,

[SalesAmount] [money] NOT NULL,

[TaxAmt] [money] NOT NULL,

[Freight] [money] NOT NULL,

[CarrierTrackingNumber] [nvarchar](25) NULL,

[CustomerPONumber] [nvarchar](25) NULL,

[OrderDate] [datetime] NULL,

[DueDate] [datetime] NULL,

[ShipDate] [datetime] NULL,

CONSTRAINT [PK_FactInternetSalesTest_SalesOrderNumber_SalesOrderLineNumber] PRIMARY KEY CLUSTERED

(

[SalesOrderNumber] ASC,

[SalesOrderLineNumber] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

) ON [PRIMARY]

G – The same schema as FactInternetSales with an additional column OrderFlagKey.

The related junk dimension will include a surrogate key named OrderFlagKey and the four flag columns. The column OrderFlagKey is an identity and primary key for the table and as shown earlier is a foreign key in the FactInternetSalesTest table. The DDL for DimOrderFlag is seen below:

CREATE TABLE [dbo].[DimOrderFlag](
[OrderFlagKey] [smallint] IDENTITY(1,1) NOT NULL,
[IsShippedFlag] [char](1) NULL,
[IsPastDueFlag] [char](1) NULL,
[IsLargeOrderFlag] [char](1) NULL,
[IsPromotionalOrderFlag] [char](1) NULL,
[UnknownFlag] [char](1) NULL,
CONSTRAINT [PK_DimOrderFlag_OrderFlagKey] PRIMARY KEY CLUSTERED
(
[OrderFlagKey] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

CREATE TABLE [dbo].[DimOrderFlag](

[OrderFlagKey] [smallint] IDENTITY(1,1) NOT NULL,

[IsShippedFlag] [char](1) NULL,

[IsPastDueFlag] [char](1) NULL,

[IsLargeOrderFlag] [char](1) NULL,

[IsPromotionalOrderFlag] [char](1) NULL,

[UnknownFlag] [char](1) NULL,

CONSTRAINT [PK_DimOrderFlag_OrderFlagKey] PRIMARY KEY CLUSTERED

(

[OrderFlagKey] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

) ON [PRIMARY]

H – Note the identity specification and primary key constraint on OrderFlagKey.

Now comes the population of DimOrderFlag. We previously saw the guts of a script for creating the permutations for seven flags. DimOrderFlag contains four flags so we can actually pare down CTE/CROSS JOIN to four levels and wrap it in the DML required for getting the output into the dimension table. The script for doing just that is as follows:

--truncate dimension
TRUNCATE TABLE dbo.DimOrderFlag;
--check for the presence of unknown row; then insert
IF (SELECT COUNT(*) FROM dbo.DimOrderFlag WHERE UnknownFlag = 'Y') = 0
BEGIN
SET IDENTITY_INSERT dbo.DimOrderFlag ON;
INSERT INTO dbo.DimOrderFlag(OrderFlagKey, IsShippedFlag, IsPastDueFlag, IsLargeOrderFlag, IsPromotionalOrderFlag, UnknownFlag)
SELECT -1, 'N', 'N', 'N', 'N', 'Y';
SET IDENTITY_INSERT dbo.DimOrderFlag OFF;
END

;WITH cteFlags AS
(
SELECT 'N' AS Value
UNION ALL
SELECT 'Y'
)
INSERT INTO dbo.DimOrderFlag(IsShippedFlag, IsPastDueFlag, IsLargeOrderFlag, IsPromotionalOrderFlag)
SELECT
Flag0.Value,
Flag1.Value,
Flag2.Value,
Flag3.Value
FROM
cteFlags Flag0
CROSS JOIN
cteFlags Flag1
CROSS JOIN
cteFlags Flag2
CROSS JOIN
cteFlags Flag3
;

--truncate dimension

TRUNCATE TABLE dbo.DimOrderFlag;

--check for the presence of unknown row; then insert

IF (SELECT COUNT(*) FROM dbo.DimOrderFlag WHERE UnknownFlag = 'Y') = 0

BEGIN

SET IDENTITY_INSERT dbo.DimOrderFlag ON;

INSERT INTO dbo.DimOrderFlag(OrderFlagKey, IsShippedFlag, IsPastDueFlag, IsLargeOrderFlag, IsPromotionalOrderFlag, UnknownFlag)

SELECT -1, 'N', 'N', 'N', 'N', 'Y';

SET IDENTITY_INSERT dbo.DimOrderFlag OFF;

END

;WITH cteFlags AS

(

SELECT 'N' AS Value

UNION ALL

SELECT 'Y'

)

INSERT INTO dbo.DimOrderFlag(IsShippedFlag, IsPastDueFlag, IsLargeOrderFlag, IsPromotionalOrderFlag)

SELECT

Flag0.Value,

Flag1.Value,

Flag2.Value,

Flag3.Value

FROM

cteFlags Flag0

CROSS JOIN

cteFlags Flag1

CROSS JOIN

cteFlags Flag2

CROSS JOIN

cteFlags Flag3

;

– set the unknown flag column for all known rows

UPDATE dbo.DimOrderFlag SET UnknownFlag = 'N' WHERE OrderFlagKey gt; 0;

1	UPDATE dbo.DimOrderFlag SET UnknownFlag = 'N' WHERE OrderFlagKey gt; 0;

I – This could easily be made into a stored procedure.

The result of that script looks like this:

J – A screenshot of the table data from inside Management Studio.

With these objects in place, you are ready to run a test load of the fact table FactInternetSalesTest. I put together a simple package in Integration Services that is fairly barebones. In fact, it is a full refresh of the FactInternetSalesTest table which is likely not acceptable in a production environment. I captured a few screenshots of the Integration Services package.

K – Control Flow (Execute SQL and Data Flow Task)

L – Variables Window

Here are the definitions for the variables seen above:

– DestinationTable = dbo.FactInternetSalesTest

– SourceTable = dbo.FactInternetSales

– SourceQuery =

SELECT ProductKey, OrderDateKey, DueDateKey, ShipDateKey, CustomerKey, PromotionKey, CurrencyKey, SalesTerritoryKey, SalesOrderNumber, SalesOrderLineNumber, RevisionNumber, OrderQuantity, UnitPrice, ExtendedAmount, UnitPriceDiscountPct, DiscountAmount, ProductStandardCost, TotalProductCost, SalesAmount, TaxAmt, Freight, CarrierTrackingNumber, CustomerPONumber, OrderDate, DueDate, ShipDate
,IIF(PromotionKey &gt; 1, 'Y', 'N') AS IsPromotionalOrderFlag
,IIF(ExtendedAmount &gt; 2000, 'Y', 'N') AS IsLargeOrderFlag
,IIF(ShipDateKey &gt; DueDateKey, 'Y', 'N') AS IsPastDueFlag
,IIF(ShipDateKey &lt; 20140101, 'Y', 'N') AS IsShippedFlag
FROM " + @[User::SourceTable]

SELECT ProductKey, OrderDateKey, DueDateKey, ShipDateKey, CustomerKey, PromotionKey, CurrencyKey, SalesTerritoryKey, SalesOrderNumber, SalesOrderLineNumber, RevisionNumber, OrderQuantity, UnitPrice, ExtendedAmount, UnitPriceDiscountPct, DiscountAmount, ProductStandardCost, TotalProductCost, SalesAmount, TaxAmt, Freight, CarrierTrackingNumber, CustomerPONumber, OrderDate, DueDate, ShipDate

,IIF(PromotionKey > 1, 'Y', 'N') AS IsPromotionalOrderFlag

,IIF(ExtendedAmount > 2000, 'Y', 'N') AS IsLargeOrderFlag

,IIF(ShipDateKey > DueDateKey, 'Y', 'N') AS IsPastDueFlag

,IIF(ShipDateKey < 20140101, 'Y', 'N') AS IsShippedFlag

FROM " + @[User::SourceTable]

M – Execute SQL Task

N – Execute SQL Task Query

O – Execute SQL Task Parameter Mapping

P – Data Flow Task

Q – OLEDB Source

R – Lookup Transformation

S – Lookup Transformation Relationships

T – Derived Column Task (NULL handling)

U – OLEDB Destination

Please note that this Integration Services’ package is a sample for illustration purposes only. It is highly unlikely that you would be pulling data from one fact table and landing that data in another fact table. Instead, your data will be extracted directly from the source or more likely than not from a set of staging tables. Additionally, your data transformation process might be incremental rather than a truncate and reload. This will depend on the size of the data.

My goal was to show you a method of populating a junk dimension as well as how that junk dimension can be incorporated into a data transformation process. Now that the DimOrderFlag and FactInternetSalesTest are populated, let us look at the data. Here is a sample of FactInternetSalesTest:

V – Note the OrderFlagKey

Now a single join between FactInternetSalesTest and DimOrderFlag exposes the flag attributes.

SELECT B.IsLargeOrderFlag, B.IsPastDueFlag, B.IsPromotionalOrderFlag, B.IsShippedFlag
,A.*
FROM dbo.FactInternetSalesTest A
INNER JOIN dbo.DimOrderFlag B ON
A.OrderFlagKey = B.OrderFlagKey

SELECT B.IsLargeOrderFlag, B.IsPastDueFlag, B.IsPromotionalOrderFlag, B.IsShippedFlag

,A.*

FROM dbo.FactInternetSalesTest A

INNER JOIN dbo.DimOrderFlag B ON

A.OrderFlagKey = B.OrderFlagKey

W – Report query and result set

I hope that you find this information useful and are able to make use of a junk dimension at some point down the road. My aim was the to present an option to easily populate a junk dimension – particularly one that has a product of 2^N. The number two being the possible values, Boolean or Yes/No, and N being the number of attributes. Furthermore, you should see how a junk dimension could be a clean method of storing those miscellaneous type attributes. It can help keep your data model simple and provide easier access to those attributes for the end user.

9 Comments

Dave on July 18, 2019 at 7:52 pm

This was exactly the insight I was looking for. You made my week… Thanks!
Only variation I used was that instead of the Lookup and Derived Column transformations, I just handled all of that logic within my staging schema, where all of the necessary data already exists.
- Ken Adams, Key2 Consulting on July 23, 2019 at 1:48 pm
  
  Thanks for reading Dave and so glad the post helped you out.
ow on March 6, 2020 at 11:04 pm

How do you pick from a Junk dimension when browsing the cube? What does the user pick to select a row of the Junk Dimension? Is it the Junk Dimension Key (1, 2, 3, etc.)? That would not make sense to the user.
Ken Adams on March 10, 2020 at 1:56 pm

Hi Ow. If you’re browsing your cube take the example of the Order Flag dimension from above. You should be able to drag the individual Flag columns (IsShipped, IsPromotionalOrder, etc) as filters. The column names in the junk dimension should be meaningful enough so that the user understands those selections when accessing the cube via Excel or other tool. I hope that helps and thanks so much for reading.
sohi on May 26, 2020 at 4:42 am

Hi Ken, do we need to maintain multiple ‘unknown’ flags where some of the flag attributes are applicable while others aren’t?
- Ken Adams on June 15, 2020 at 1:14 pm
  
  Sohi. 1st off thanks for reading and for your comment. This is up to you and the business owners. Typically, for a miscellaneous dimension you want to limit the range of values – bit is a great data type for flag dimensions as your are limited to 0 and 1. A char(1) works well too. You could certainly have Y, N and U as possible combinations and build it out that way. I’d suggest working with the business owners to create the flags as they make sense. If a single flag is unknown for a row of data, or not applicable, you could certainly map that to N. This is one of those “it depends” answers. Hope that helps.
Ivan Alexander Urbina Madriz on December 16, 2020 at 5:14 pm

Hi Ken, Thank you so much, was great this published.

I’ve a question, in my fact table I can update multiple times the flagkey in the fact table for the same row.

Cheers!
camilla on March 23, 2022 at 12:23 am

what do we do when we need to add new column to the junk dimension? how do we update the existing junk dimensions with the new dimension without changing the surrogate key?
- Ken on April 13, 2022 at 3:08 pm
  
  Hi Camilla,
  First off thanks for reading! Adding a new column really changes the grain of that dimension. If you look at the sample code at the beginning adding another flag doubles the number of rows in the junk dimension –
  WITH cteFlags AS
  (
  SELECT ‘N’ AS Value
  UNION ALL
  SELECT ‘Y’
  )
  SELECT
  Flag0.Value,
  Flag1.Value,
  Flag2.Value,
  Flag3.Value,
  Flag4.Value,
  Flag5.Value,
  Flag6.Value,
  Flag7.Value
  FROM cteFlags Flag0
  CROSS JOIN cteFlags Flag1
  CROSS JOIN cteFlags Flag2
  CROSS JOIN cteFlags Flag3
  CROSS JOIN cteFlags Flag4
  CROSS JOIN cteFlags Flag5
  CROSS JOIN cteFlags Flag6
  CROSS JOIN cteFlags Flag7
  ;
  –128 combinations (6 flags)
  –256 combinations (7 flags)
  
  Junk dimensions can be inflexible in this respect. Ideally, you could identify all of those “yes/no” type values in your incoming data and model the junk dimension to include all the combinations upfront. In this case, I think you’re looking at a repopulation of the junk dimension and rebuilding of the associated fact table(s).
  -Ken

Trackbacks/Pingbacks

Why Data Warehouse Projects Fail | Key2 Consulting - […] people like us here at Key2 Consulting, as it can be easy to succumb to the excitement of creating…

Building a Data Warehouse: Populating and Using a Junk Dimension

A – ERD with seven individual dimensions

B – Sample data from narrow dimension

C – ERD with single junk dimension

D – Sample data from junk dimension

E – Script to create row junk dimension; 2^7 =128 rows

F – This query will become the source query in Integration Services.

G – The same schema as FactInternetSales with an additional column OrderFlagKey.

H – Note the identity specification and primary key constraint on OrderFlagKey.

I – This could easily be made into a stored procedure.

J – A screenshot of the table data from inside Management Studio.

K – Control Flow (Execute SQL and Data Flow Task)

L – Variables Window

M – Execute SQL Task

N – Execute SQL Task Query

O – Execute SQL Task Parameter Mapping

P – Data Flow Task

Q – OLEDB Source

R – Lookup Transformation

S – Lookup Transformation Relationships

T – Derived Column Task (NULL handling)

U – OLEDB Destination

V – Note the OrderFlagKey

W – Report query and result set

9 Comments

Trackbacks/Pingbacks

Recent Posts

Categories