Data Warehousing | Featured | SQL Server | Technical Information

Building a Data Warehouse: Populating and Using a Junk Dimension

February 27, 2017

A junk dimension is seen occasionally inside of data warehouses. This type of dimension can be thought of as a flag table, or a collection of attributes that have low-cardinality. In this post, we show you how to build a data warehouse by populating a junk dimension.

By: Ken Adams

Sr. Business Intelligence Consultant at Key2 Consulting

A junk dimension is seen occasionally inside of data warehouses. This type of dimension can be thought of as a flag table, or a collection of attributes that have low-cardinality. This means that the values seen are not distinctive and are often duplicated. According to the site 1keydata.com, a junk dimension is defined as follows:

In data warehouse design, frequently we run into a situation where there are yes/no indicator fields in the source system. If we keep all those indicator fields in the fact table, not only do we need to build many small dimension tables, but the amount of information stored in the fact table also increases tremendously, leading to possible performance and management issues. A junk dimension is the way to solve this problem. In a junk dimension, we combine these indicator fields into a single dimension.

On a recent project, I ran into a situation where our client wanted to calculate different types of exceptions. These exceptions indicated whether an employee was in compliance and authorized to work under federal law. The actual values were to be calculated as part of the data integration process and the result was either a Yes (“Y”) or a No (“N”). We landed on seven separate flags (or attributes) and a junk dimension seemed like a viable entity to handle the work.

The alternative in this situation would have been to create separate dimensions and have seven foreign keys linking the fact table to each dimension. This not only would have increased the amount of data stored in the fact table, but it also seemed like a maintenance nightmare and performance pitfall in the future. Had I taken this path, the ERD would have looked something like this:

A – ERD with seven individual dimensions

In this particular scenario, each dimension would have had two rows and only a handful of columns. Additionally, a query written against this model would require seven joins to return all of the attributes. The data inside one of the dimensions would have looked something like the following:

Flag1Key	IsFlag1Value
1	Y
2	N

B – Sample data from narrow dimension

My solution was to create a single table and incorporate the various combinations of flags. The nature of the data made it difficult, if not impossible, to have this dimension be driven off the source data. As I mentioned, I needed to calculate whether a row met certain conditions in order to be an exception. This logic required that I evaluate multiple columns during the data integration and ultimately use that result to look up a foreign key. Had the data allowed me to do so, I could have cross-joined the source tables in order to produce the product of the attribute combinations. The junk dimension and fact relationship is as follows:

C – ERD with single junk dimension

Now a single join between the tables can produce the same product as the seven joins in the previous example. The data inside the dimension looks like this:

D – Sample data from junk dimension

Populating this dimension might be an issue. In this particular scenario, the underlying data did not present in a manner that would allow this dimension to be derived from the data. That being the case, I could have done the work in Excel or with a lengthy INSERT…UNION ALL. Once you get past a handful of columns, say three or four, the task becomes quite arduous and prone to error. Rather than trying to do the work manually, I used the following query:

WITH cteFlags AS
(
SELECT 'N' AS Value
UNION ALL
SELECT 'Y'
)
SELECT
Flag0.Value,
Flag1.Value,
Flag2.Value,
Flag3.Value,
Flag4.Value,
Flag5.Value,
Flag6.Value
FROM cteFlags Flag0
CROSS JOIN cteFlags Flag1
CROSS JOIN cteFlags Flag2
CROSS JOIN cteFlags Flag3
CROSS JOIN cteFlags Flag4
CROSS JOIN cteFlags Flag5
CROSS JOIN cteFlags Flag6
;

WITH cteFlags AS

(

SELECT 'N' AS Value

UNION ALL

SELECT 'Y'

)

SELECT

Flag0.Value,

Flag1.Value,

Flag2.Value,

Flag3.Value,

Flag4.Value,

Flag5.Value,

Flag6.Value

FROM cteFlags Flag0

CROSS JOIN cteFlags Flag1

CROSS JOIN cteFlags Flag2

CROSS JOIN cteFlags Flag3

CROSS JOIN cteFlags Flag4

CROSS JOIN cteFlags Flag5

CROSS JOIN cteFlags Flag6

;

E – Script to create row junk dimension; 2^7 =128 rows

This simple CTE contains the possible values and the subsequent CROSS JOIN statements handle the possible combinations of those flags. You could add or remove levels based on the number of columns required. In my case, I took this code and created a stored procedure that truncated and repopulated the dimension table.

Now that you have a method of populating the junk dimension, it is now a matter of linking that to the fact table. This will likely be done during the data transformation phase where the data is moved from the source system into the star schema. The flags values will be derived from the transaction – or explicitly stated in data – and joined to the junk dimension which was created above. I want to show an example of the end-to-end process using the AdventureWorksDW database and some flag attributes that I fabricated.

DimOrderFlag Example:

From the AdventureWorksDW, I chose the fact table FactInternetSales as my test subject. Then I came up with a series of flags that seemed to make sense in conjunction with the sales data. Those flags are IsPromotionalOrderFlag, IsLargeOrderFlag, IsPastDueFlag, IsPastDueFlag, and IsShippedFlag. The possible values for each of the flags is either Yes (“Y”) or No (“N”). The logic to derive the flags is as follows:

SELECT
IIF(PromotionKey &gt; 1, 'Y', 'N') AS IsPromotionalOrderFlag
,IIF(ExtendedAmount &gt; 2000, 'Y', 'N') AS IsLargeOrderFlag
,IIF(ShipDateKey &gt; DueDateKey, 'Y', 'N') AS IsPastDueFlag
,IIF(ShipDateKey &lt; 20140101, 'Y', 'N') AS IsShippedFlag
,*
FROM [AdventureWorksDW].[dbo].[FactInternetSales]

SELECT

IIF(PromotionKey > 1, 'Y', 'N') AS IsPromotionalOrderFlag

,IIF(ExtendedAmount > 2000, 'Y', 'N') AS IsLargeOrderFlag

,IIF(ShipDateKey > DueDateKey, 'Y', 'N') AS IsPastDueFlag

,IIF(ShipDateKey < 20140101, 'Y', 'N') AS IsShippedFlag

FROM [AdventureWorksDW].[dbo].[FactInternetSales]

F – This query will become the source query in Integration Services.

Next, I want to create a secondary copy of the existing FactInternetSales table that includes a new foreign key reference to a new junk dimension. The junk dimension will be named DimOrderFlag and it will contain the flag values mentioned above as well as a surrogate key representing the combination of those values. I created the table FactInternetSalesTest – the DDL for that table is below:

CREATE TABLE [dbo].[FactInternetSalesTest](
[ProductKey] [int] NOT NULL,
[OrderDateKey] [int] NOT NULL,
[DueDateKey] [int] NOT NULL,
[ShipDateKey] [int] NOT NULL,
[CustomerKey] [int] NOT NULL,
[PromotionKey] [int] NOT NULL,
[CurrencyKey] [int] NOT NULL,
[SalesTerritoryKey] [int] NOT NULL,
[OrderFlagKey] [int] NOT NULL,
[SalesOrderNumber] [nvarchar](20) NOT NULL,
[SalesOrderLineNumber] [tinyint] NOT NULL,
[RevisionNumber] [tinyint] NOT NULL,
[OrderQuantity] [smallint] NOT NULL,
[UnitPrice] [money] NOT NULL,
[ExtendedAmount] [money] NOT NULL,
[UnitPriceDiscountPct] [float] NOT NULL,
[DiscountAmount] [float] NOT NULL,
[ProductStandardCost] [money] NOT NULL,
[TotalProductCost] [money] NOT NULL,
[SalesAmount] [money] NOT NULL,
[TaxAmt] [money] NOT NULL,
[Freight] [money] NOT NULL,
[CarrierTrackingNumber] [nvarchar](25) NULL,
[CustomerPONumber] [nvarchar](25) NULL,
[OrderDate] [datetime] NULL,
[DueDate] [datetime] NULL,
[ShipDate] [datetime] NULL,
CONSTRAINT [PK_FactInternetSalesTest_SalesOrderNumber_SalesOrderLineNumber] PRIMARY KEY CLUSTERED
(
[SalesOrderNumber] ASC,
[SalesOrderLineNumber] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

CREATE TABLE [dbo].[FactInternetSalesTest](

[ProductKey] [int] NOT NULL,

[OrderDateKey] [int] NOT NULL,

[DueDateKey] [int] NOT NULL,

[ShipDateKey] [int] NOT NULL,

[CustomerKey] [int] NOT NULL,

[PromotionKey] [int] NOT NULL,

[CurrencyKey] [int] NOT NULL,

[SalesTerritoryKey] [int] NOT NULL,

[OrderFlagKey] [int] NOT NULL,

[SalesOrderNumber] [nvarchar](20) NOT NULL,

[SalesOrderLineNumber] [tinyint] NOT NULL,

[RevisionNumber] [tinyint] NOT NULL,

[OrderQuantity] [smallint] NOT NULL,

[UnitPrice] [money] NOT NULL,

[ExtendedAmount] [money] NOT NULL,

[UnitPriceDiscountPct] [float] NOT NULL,

[DiscountAmount] [float] NOT NULL,

[ProductStandardCost] [money] NOT NULL,

[TotalProductCost] [money] NOT NULL,

[SalesAmount] [money] NOT NULL,

[TaxAmt] [money] NOT NULL,

[Freight] [money] NOT NULL,

[CarrierTrackingNumber] [nvarchar](25) NULL,

[CustomerPONumber] [nvarchar](25) NULL,

[OrderDate] [datetime] NULL,

[DueDate] [datetime] NULL,

[ShipDate] [datetime] NULL,

CONSTRAINT [PK_FactInternetSalesTest_SalesOrderNumber_SalesOrderLineNumber] PRIMARY KEY CLUSTERED

(

[SalesOrderNumber] ASC,

[SalesOrderLineNumber] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

) ON [PRIMARY]

G – The same schema as FactInternetSales with an additional column OrderFlagKey.

The related junk dimension will include a surrogate key named OrderFlagKey and the four flag columns. The column OrderFlagKey is an identity and primary key for the table and as shown earlier is a foreign key in the FactInternetSalesTest table. The DDL for DimOrderFlag is seen below:

CREATE TABLE [dbo].[DimOrderFlag](
[OrderFlagKey] [smallint] IDENTITY(1,1) NOT NULL,
[IsShippedFlag] [char](1) NULL,
[IsPastDueFlag] [char](1) NULL,
[IsLargeOrderFlag] [char](1) NULL,
[IsPromotionalOrderFlag] [char](1) NULL,
[UnknownFlag] [char](1) NULL,
CONSTRAINT [PK_DimOrderFlag_OrderFlagKey] PRIMARY KEY CLUSTERED
(
[OrderFlagKey] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

CREATE TABLE [dbo].[DimOrderFlag](

[OrderFlagKey] [smallint] IDENTITY(1,1) NOT NULL,

[IsShippedFlag] [char](1) NULL,

[IsPastDueFlag] [char](1) NULL,

[IsLargeOrderFlag] [char](1) NULL,

[IsPromotionalOrderFlag] [char](1) NULL,

[UnknownFlag] [char](1) NULL,

CONSTRAINT [PK_DimOrderFlag_OrderFlagKey] PRIMARY KEY CLUSTERED

(

[OrderFlagKey] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

) ON [PRIMARY]

H – Note the identity specification and primary key constraint on OrderFlagKey.

Now comes the population of DimOrderFlag. We previously saw the guts of a script for creating the permutations for seven flags. DimOrderFlag contains four flags so we can actually pare down CTE/CROSS JOIN to four levels and wrap it in the DML required for getting the output into the dimension table. The script for doing just that is as follows:

--truncate dimension
TRUNCATE TABLE dbo.DimOrderFlag;
--check for the presence of unknown row; then insert
IF (SELECT COUNT(*) FROM dbo.DimOrderFlag WHERE UnknownFlag = 'Y') = 0
BEGIN
SET IDENTITY_INSERT dbo.DimOrderFlag ON;
INSERT INTO dbo.DimOrderFlag(OrderFlagKey, IsShippedFlag, IsPastDueFlag, IsLargeOrderFlag, IsPromotionalOrderFlag, UnknownFlag)
SELECT -1, 'N', 'N', 'N', 'N', 'Y';
SET IDENTITY_INSERT dbo.DimOrderFlag OFF;
END

;WITH cteFlags AS
(
SELECT 'N' AS Value
UNION ALL
SELECT 'Y'
)
INSERT INTO dbo.DimOrderFlag(IsShippedFlag, IsPastDueFlag, IsLargeOrderFlag, IsPromotionalOrderFlag)
SELECT
Flag0.Value,
Flag1.Value,
Flag2.Value,
Flag3.Value
FROM
cteFlags Flag0
CROSS JOIN
cteFlags Flag1
CROSS JOIN
cteFlags Flag2
CROSS JOIN
cteFlags Flag3
;

--truncate dimension

TRUNCATE TABLE dbo.DimOrderFlag;

--check for the presence of unknown row; then insert

IF (SELECT COUNT(*) FROM dbo.DimOrderFlag WHERE UnknownFlag = 'Y') = 0

BEGIN

SET IDENTITY_INSERT dbo.DimOrderFlag ON;

INSERT INTO dbo.DimOrderFlag(OrderFlagKey, IsShippedFlag, IsPastDueFlag, IsLargeOrderFlag, IsPromotionalOrderFlag, UnknownFlag)

SELECT -1, 'N', 'N', 'N', 'N', 'Y';

SET IDENTITY_INSERT dbo.DimOrderFlag OFF;

END

;WITH cteFlags AS

(

SELECT 'N' AS Value

UNION ALL

SELECT 'Y'

)

INSERT INTO dbo.DimOrderFlag(IsShippedFlag, IsPastDueFlag, IsLargeOrderFlag, IsPromotionalOrderFlag)

SELECT

Flag0.Value,

Flag1.Value,

Flag2.Value,

Flag3.Value

FROM

cteFlags Flag0

CROSS JOIN

cteFlags Flag1

CROSS JOIN

cteFlags Flag2

CROSS JOIN

cteFlags Flag3

;

– set the unknown flag column for all known rows

UPDATE dbo.DimOrderFlag SET UnknownFlag = 'N' WHERE OrderFlagKey gt; 0;

1	UPDATE dbo.DimOrderFlag SET UnknownFlag = 'N' WHERE OrderFlagKey gt; 0;

I – This could easily be made into a stored procedure.

The result of that script looks like this:

J – A screenshot of the table data from inside Management Studio.

With these objects in place, you are ready to run a test load of the fact table FactInternetSalesTest. I put together a simple package in Integration Services that is fairly barebones. In fact, it is a full refresh of the FactInternetSalesTest table which is likely not acceptable in a production environment. I captured a few screenshots of the Integration Services package.

K – Control Flow (Execute SQL and Data Flow Task)

L – Variables Window

Here are the definitions for the variables seen above:

– DestinationTable = dbo.FactInternetSalesTest

– SourceTable = dbo.FactInternetSales

– SourceQuery =

SELECT ProductKey, OrderDateKey, DueDateKey, ShipDateKey, CustomerKey, PromotionKey, CurrencyKey, SalesTerritoryKey, SalesOrderNumber, SalesOrderLineNumber, RevisionNumber, OrderQuantity, UnitPrice, ExtendedAmount, UnitPriceDiscountPct, DiscountAmount, ProductStandardCost, TotalProductCost, SalesAmount, TaxAmt, Freight, CarrierTrackingNumber, CustomerPONumber, OrderDate, DueDate, ShipDate
,IIF(PromotionKey &gt; 1, 'Y', 'N') AS IsPromotionalOrderFlag
,IIF(ExtendedAmount &gt; 2000, 'Y', 'N') AS IsLargeOrderFlag
,IIF(ShipDateKey &gt; DueDateKey, 'Y', 'N') AS IsPastDueFlag
,IIF(ShipDateKey &lt; 20140101, 'Y', 'N') AS IsShippedFlag
FROM " + @[User::SourceTable]

SELECT ProductKey, OrderDateKey, DueDateKey, ShipDateKey, CustomerKey, PromotionKey, CurrencyKey, SalesTerritoryKey, SalesOrderNumber, SalesOrderLineNumber, RevisionNumber, OrderQuantity, UnitPrice, ExtendedAmount, UnitPriceDiscountPct, DiscountAmount, ProductStandardCost, TotalProductCost, SalesAmount, TaxAmt, Freight, CarrierTrackingNumber, CustomerPONumber, OrderDate, DueDate, ShipDate

,IIF(PromotionKey > 1, 'Y', 'N') AS IsPromotionalOrderFlag

,IIF(ExtendedAmount > 2000, 'Y', 'N') AS IsLargeOrderFlag

,IIF(ShipDateKey > DueDateKey, 'Y', 'N') AS IsPastDueFlag

,IIF(ShipDateKey < 20140101, 'Y', 'N') AS IsShippedFlag

FROM " + @[User::SourceTable]

M – Execute SQL Task

N – Execute SQL Task Query

O – Execute SQL Task Parameter Mapping

P – Data Flow Task

Q – OLEDB Source

R – Lookup Transformation

S – Lookup Transformation Relationships

T – Derived Column Task (NULL handling)

U – OLEDB Destination

Please note that this Integration Services’ package is a sample for illustration purposes only. It is highly unlikely that you would be pulling data from one fact table and landing that data in another fact table. Instead, your data will be extracted directly from the source or more likely than not from a set of staging tables. Additionally, your data transformation process might be incremental rather than a truncate and reload. This will depend on the size of the data.

My goal was to show you a method of populating a junk dimension as well as how that junk dimension can be incorporated into a data transformation process. Now that the DimOrderFlag and FactInternetSalesTest are populated, let us look at the data. Here is a sample of FactInternetSalesTest:

populated junk dimension

V – Note the OrderFlagKey

Now a single join between FactInternetSalesTest and DimOrderFlag exposes the flag attributes.

SELECT B.IsLargeOrderFlag, B.IsPastDueFlag, B.IsPromotionalOrderFlag, B.IsShippedFlag
,A.*
FROM dbo.FactInternetSalesTest A
INNER JOIN dbo.DimOrderFlag B ON
A.OrderFlagKey = B.OrderFlagKey

SELECT B.IsLargeOrderFlag, B.IsPastDueFlag, B.IsPromotionalOrderFlag, B.IsShippedFlag

,A.*

FROM dbo.FactInternetSalesTest A

INNER JOIN dbo.DimOrderFlag B ON

A.OrderFlagKey = B.OrderFlagKey

SQL report query and result set

W – Report query and result set

I hope that you find this information useful and are able to make use of a junk dimension at some point down the road. My aim was the to present an option to easily populate a junk dimension – particularly one that has a product of 2^N. The number two being the possible values, Boolean or Yes/No, and N being the number of attributes. Furthermore, you should see how a junk dimension could be a clean method of storing those miscellaneous type attributes. It can help keep your data model simple and provide easier access to those attributes for the end user.

Related Content

Rebuilding Indexes Based on Partition

Author: Austin Dolezal Many businesses are wasting time and money cleaning up their data when small fixes and tweaks could optimize their processing time and storage space. Database indexes and partitions are often created and...

5 Key Takeaways From PASS Data Community Summit 2022

The PASS Data Community Summit 2022 was a blast! A handful of people from our company attend the conference every year and this year was no exception. The knowledge gained, connections made, and fun enjoyed in Seattle is always worth the investment. Here are our five...

How to Delete Data From a Large Partitioned Table

By: Ken Adams The Problem I recently ran across an issue on a project that required a large amount of data to be removed from a table, specifically an issue with the loading of the table causing duplication of data. All non-clustered indexes had been removed...