How to Delete Data From a Large Partitioned Table

March 30, 2022

We were recently faced with a data duplication issue when loading a large table for a client. Here's the solution we developed to solve the problem!

By: Ken Adams

The Problem

I recently ran across an issue on a project that required a large amount of data to be removed from a table, specifically an issue with the loading of the table causing duplication of data.

All non-clustered indexes had been removed before the data load to speed loading. This included a unique index, so therefore we ended up with duplicates. The data loading portion of this could be a blog post for another day.

This table was too large ( > 10 billion rows) to flush the data and refill given the timeframe. The alternative was to try and remove any duplicate records. An example of this might be data archival. In this instance, you’d want to physically delete data that was older and possibly for a specific customer or group of customers.

In our situation, the table was partitioned. We were in a production environment and had to be very cognizant of competing resources. The primary concern was making sure the transaction log didn’t blow out the storage drive.

This can happen when you perform large delete operations in a single transaction. The answer to this problem was removing the data in smaller chunks, or batches, to lessen the impact on the transaction log.

Setup

For the purposes of this post, I used a copy of the AdventureWorksDW database which you can find here. I’m running the scripts on a fairly small Oracle VirtualBox machine with SQL Server 2019 installed.

The AdventureWorksDW that I had was an older backup, but I did change the compatibility level to 150 before running anything. This really shouldn’t matter but thought I’d point it out. The database was also set to simple recovery mode.

I chose one of the larger fact tables to use as my subject. The table is FactResellerSales. The table that came with the database had around 60K rows and that was not enough. I set up a new table and partitioned that table by year on the OrderDate.

To grow the data, I ran the following script below. This script loads FactResellerSales into the new partitioned and then executes 25 inserts of the same data randomized:

TRUNCATE TABLE dbo.FactResellerSalesPartitioned;
 
--initial insert
INSERT INTO dbo.FactResellerSalesPartitioned
	    (
	        ProductKey, OrderDateKey, DueDateKey, ShipDateKey, ResellerKey, EmployeeKey, PromotionKey, CurrencyKey, SalesTerritoryKey, SalesOrderNumber, SalesOrderLineNumber, RevisionNumber, OrderQuantity, UnitPrice
	        , ExtendedAmount, UnitPriceDiscountPct, DiscountAmount, ProductStandardCost, TotalProductCost, SalesAmount, TaxAmt, Freight, CarrierTrackingNumber, CustomerPONumber, OrderDate
	        , DueDate, ShipDate
	    )
	SELECT 
	        ProductKey, OrderDateKey, DueDateKey, ShipDateKey, ResellerKey, EmployeeKey, PromotionKey, CurrencyKey, SalesTerritoryKey, SalesOrderNumber, SalesOrderLineNumber, RevisionNumber, OrderQuantity, UnitPrice
	        , ExtendedAmount, UnitPriceDiscountPct, DiscountAmount, ProductStandardCost, TotalProductCost, SalesAmount, TaxAmt, Freight, CarrierTrackingNumber, CustomerPONumber, OrderDate
	        , DueDate, ShipDate
FROM dbo.FactResellerSales WITH (HOLDLOCK TABLOCKX)
;
GO
 
--now multiple inserts of randomized rows (based on work by Jonathan M. Kehayias - https://github.com/ktaranov/sqlserver-kit/blob/master/Sample_Databases/Create_Enlarged_Adventureworks_Tables.sql)
INSERT INTO dbo.FactResellerSalesPartitioned
	    (
	        ProductKey, OrderDateKey, DueDateKey, ShipDateKey, ResellerKey, EmployeeKey, PromotionKey, CurrencyKey, SalesTerritoryKey, SalesOrderNumber, SalesOrderLineNumber, RevisionNumber, OrderQuantity, UnitPrice
	        , ExtendedAmount, UnitPriceDiscountPct, DiscountAmount, ProductStandardCost, TotalProductCost, SalesAmount, TaxAmt, Freight, CarrierTrackingNumber, CustomerPONumber, OrderDate
	        , DueDate, ShipDate
	    )
SELECT
	        ProductKey, OrderDateKey, DueDateKey, ShipDateKey, ResellerKey, EmployeeKey, PromotionKey, CurrencyKey, SalesTerritoryKey, SalesOrderNumber, SalesOrderLineNumber, RevisionNumber, OrderQuantity, UnitPrice
	        , ExtendedAmount, UnitPriceDiscountPct, DiscountAmount, ProductStandardCost, TotalProductCost, SalesAmount, TaxAmt, Freight, CarrierTrackingNumber, CustomerPONumber, DATEADD(dd, number, OrderDate) AS OrderDate
	        , DATEADD(dd, number, DueDate) AS DueDate, DATEADD(dd, number, ShipDate) AS ShipDate
FROM dbo.FactResellerSales AS soh WITH (HOLDLOCK TABLOCKX)
CROSS JOIN (
	        SELECT number
	        FROM
	        (
	            SELECT TOP 10 number
	            FROM master.dbo.spt_values
	            WHERE type = N'P'
	            ORDER BY NEWID() DESC 
	            UNION
	            SELECT TOP 10 number
	            FROM master.dbo.spt_values
	            WHERE type = N'P'
	            ORDER BY NEWID() DESC 
	            UNION
	            SELECT TOP 10 number
	            FROM master.dbo.spt_values
	            WHERE type = N'P'
	            ORDER BY NEWID() DESC 
	            UNION
	            SELECT TOP 10 number
	            FROM master.dbo.spt_values
	            WHERE type = N'P'
	            ORDER BY NEWID() DESC 
	          ) AS tab
) AS Randomizer
ORDER BY OrderDate, number
GO 20

TRUNCATE TABLE dbo.FactResellerSalesPartitioned;

--initial insert

INSERT INTO dbo.FactResellerSalesPartitioned

(

ProductKey, OrderDateKey, DueDateKey, ShipDateKey, ResellerKey, EmployeeKey, PromotionKey, CurrencyKey, SalesTerritoryKey, SalesOrderNumber, SalesOrderLineNumber, RevisionNumber, OrderQuantity, UnitPrice

, ExtendedAmount, UnitPriceDiscountPct, DiscountAmount, ProductStandardCost, TotalProductCost, SalesAmount, TaxAmt, Freight, CarrierTrackingNumber, CustomerPONumber, OrderDate

, DueDate, ShipDate

)

SELECT

, ExtendedAmount, UnitPriceDiscountPct, DiscountAmount, ProductStandardCost, TotalProductCost, SalesAmount, TaxAmt, Freight, CarrierTrackingNumber, CustomerPONumber, OrderDate

, DueDate, ShipDate

FROM dbo.FactResellerSales WITH (HOLDLOCK TABLOCKX)

;

--now multiple inserts of randomized rows (based on work by Jonathan M. Kehayias - https://github.com/ktaranov/sqlserver-kit/blob/master/Sample_Databases/Create_Enlarged_Adventureworks_Tables.sql)

INSERT INTO dbo.FactResellerSalesPartitioned

(

, ExtendedAmount, UnitPriceDiscountPct, DiscountAmount, ProductStandardCost, TotalProductCost, SalesAmount, TaxAmt, Freight, CarrierTrackingNumber, CustomerPONumber, OrderDate

, DueDate, ShipDate

)

SELECT

, ExtendedAmount, UnitPriceDiscountPct, DiscountAmount, ProductStandardCost, TotalProductCost, SalesAmount, TaxAmt, Freight, CarrierTrackingNumber, CustomerPONumber, DATEADD(dd, number, OrderDate) AS OrderDate

, DATEADD(dd, number, DueDate) AS DueDate, DATEADD(dd, number, ShipDate) AS ShipDate

FROM dbo.FactResellerSales AS soh WITH (HOLDLOCK TABLOCKX)

CROSS JOIN (

SELECT number

FROM

(

SELECT TOP 10 number

FROM master.dbo.spt_values

WHERE type = N'P'

ORDER BY NEWID() DESC

UNION

SELECT TOP 10 number

FROM master.dbo.spt_values

WHERE type = N'P'

ORDER BY NEWID() DESC

UNION

SELECT TOP 10 number

FROM master.dbo.spt_values

WHERE type = N'P'

ORDER BY NEWID() DESC

UNION

SELECT TOP 10 number

FROM master.dbo.spt_values

WHERE type = N'P'

ORDER BY NEWID() DESC

) AS tab

) AS Randomizer

ORDER BY OrderDate, number

GO 20

Then I ran the delete script, which is made up of 3 primary parts: the identification of the partitions, creation of an outer cursor to loop through those partitions, and an inner delete filtered on the partition date range.

The inner delete filtered on a handful of ProductKeys, which I chose at random. Adding the ProductKey filter more closely mimicked the scenario I encountered at the client site.

Identify the Partitions:

--get partitions and load into temp table
DROP TABLE IF EXISTS #PartitionBoundaries;
SELECT 
	    CAST(p.partition_number AS BIGINT) AS [Partition]
	    , CASE p.partition_number WHEN 1 THEN '1900-01-01 00:00:00.000' ELSE NULL END AS LowerBound
	    , CASE WHEN prv.value IS NULL THEN '2079-06-06 00:00:00.000' ELSE CAST(prv.value AS DATETIME) END AS UpperBound
	    , prv.value AS PartitionBoundary
	    , stat.row_count AS RowCnt
INTO #PartitionBoundaries
FROM sys.tables t
INNER JOIN sys.schemas s
	    ON s.schema_id = t.schema_id
INNER JOIN sys.partitions p
	    ON t.object_id = p.object_id
	        AND p.index_id = 1
INNER JOIN sys.indexes i
	    ON p.object_id = i.object_id
	        AND p.index_id = i.index_id
INNER JOIN sys.data_spaces ds
	    ON i.data_space_id = ds.data_space_id
INNER JOIN sys.partition_schemes ps
	    ON ds.data_space_id = ps.data_space_id
INNER JOIN sys.partition_functions pf
	    ON ps.function_id = pf.function_id
LEFT JOIN sys.partition_range_values prv
	    ON pf.function_id = prv.function_id
	        AND p.partition_number = prv.boundary_id
LEFT JOIN sys.dm_db_partition_stats stat
	    ON p.partition_id = stat.partition_id
WHERE 1 = 1
	    AND is_ms_shipped = 0
	    AND EXISTS (
	        SELECT 1
	        FROM sys.partitions p2
	        WHERE t.object_id = p2.object_id
	            AND partition_number > 1
	        )
	    AND t.name = 'FactResellerSalesPartitioned'
	    AND s.name = 'dbo'
;

--get partitions and load into temp table

DROP TABLE IF EXISTS #PartitionBoundaries;

SELECT

CAST(p.partition_number AS BIGINT) AS [Partition]

, CASE p.partition_number WHEN 1 THEN '1900-01-01 00:00:00.000' ELSE NULL END AS LowerBound

, CASE WHEN prv.value IS NULL THEN '2079-06-06 00:00:00.000' ELSE CAST(prv.value AS DATETIME) END AS UpperBound

, prv.value AS PartitionBoundary

, stat.row_count AS RowCnt

INTO #PartitionBoundaries

FROM sys.tables t

INNER JOIN sys.schemas s

ON s.schema_id = t.schema_id

INNER JOIN sys.partitions p

ON t.object_id = p.object_id

AND p.index_id = 1

INNER JOIN sys.indexes i

ON p.object_id = i.object_id

AND p.index_id = i.index_id

INNER JOIN sys.data_spaces ds

ON i.data_space_id = ds.data_space_id

INNER JOIN sys.partition_schemes ps

ON ds.data_space_id = ps.data_space_id

INNER JOIN sys.partition_functions pf

ON ps.function_id = pf.function_id

LEFT JOIN sys.partition_range_values prv

ON pf.function_id = prv.function_id

AND p.partition_number = prv.boundary_id

LEFT JOIN sys.dm_db_partition_stats stat

ON p.partition_id = stat.partition_id

WHERE 1 = 1

AND is_ms_shipped = 0

AND EXISTS (

SELECT 1

FROM sys.partitions p2

WHERE t.object_id = p2.object_id

AND partition_number > 1

)

AND t.name = 'FactResellerSalesPartitioned'

AND s.name = 'dbo'

;

Declare a Partition Cursor:

--declare outer cursor for partitions
DECLARE PartitionCursor CURSOR STATIC 
FOR
--put them in an outer cursor
SELECT t1.PARTITION AS PartitionNumber
	    , ISNULL(CAST(t1.LowerBound AS DATETIME), CAST(t2.PartitionBoundary AS DATETIME)) AS LowerBound
	    , CAST(t1.UpperBound AS DATETIME) AS UpperBound
FROM #PartitionBoundaries t1
LEFT JOIN #PartitionBoundaries t2
	    ON t1.PARTITION = t2.PARTITION + 1
WHERE t1.RowCnt > 0 --narrowed down to partitions with data
ORDER BY 1
;
 
OPEN PartitionCursor;
FETCH NEXT FROM PartitionCursor INTO @PartitionNumber, @LowerBound, @UpperBound;
 
WHILE @@FETCH_STATUS = 0
BEGIN
          
	    --inner DELETE is here
	 
	    --grab the next partition for processing
	    FETCH NEXT FROM PartitionCursor INTO @PartitionNumber, @LowerBound, @UpperBound;
END
 
--close and deallocate the cursor once all batches for a given table have been processed
CLOSE PartitionCursor;
DEALLOCATE PartitionCursor;

--declare outer cursor for partitions

DECLARE PartitionCursor CURSOR STATIC

FOR

--put them in an outer cursor

SELECT t1.PARTITION AS PartitionNumber

, ISNULL(CAST(t1.LowerBound AS DATETIME), CAST(t2.PartitionBoundary AS DATETIME)) AS LowerBound

, CAST(t1.UpperBound AS DATETIME) AS UpperBound

FROM #PartitionBoundaries t1

LEFT JOIN #PartitionBoundaries t2

ON t1.PARTITION = t2.PARTITION + 1

WHERE t1.RowCnt > 0 --narrowed down to partitions with data

ORDER BY 1

;

OPEN PartitionCursor;

FETCH NEXT FROM PartitionCursor INTO @PartitionNumber, @LowerBound, @UpperBound;

WHILE @@FETCH_STATUS = 0

BEGIN

--inner DELETE is here

--grab the next partition for processing

FETCH NEXT FROM PartitionCursor INTO @PartitionNumber, @LowerBound, @UpperBound;

END

--close and deallocate the cursor once all batches for a given table have been processed

CLOSE PartitionCursor;

DEALLOCATE PartitionCursor;

Inner DELETE With Very Simple Logging and Error Handling:

--use an inner loop to batch DELETE the rows within a partition
SET @Rows = 1;
 
WHILE @Rows > 0
 
BEGIN
 
	    BEGIN TRY
	        --start a log record        
	        INSERT INTO dbo.Delete_Log(PartitionNumber, LowerBound, UpperBound, StartDateTime)
	        SELECT @PartitionNumber, @LowerBound, @UpperBound, GETDATE()
	        ;
	        SET @RowID = SCOPE_IDENTITY()
	 
	        DELETE TOP (500000) --this number can be changed
	        FROM dbo.FactResellerSalesPartitioned
	        WHERE OrderDate >= @LowerBound
	            AND OrderDate < @UpperBound
	            AND ProductKey IN --selected list of products
	            (
	                471
	                , 458
	                , 221
	                , 234
	                , 222
	            ) 
	        SET @Rows = @@ROWCOUNT;
	 
	        --after deleting batch then update the ViewAlertTrackingRecipient_Delete_Log table
	        UPDATE dbo.Delete_Log
	        SET EndDateTime = GETDATE()
	            , RowCnt = @Rows
	        WHERE PartitionNumber = @PartitionNumber
	            AND RowID = @RowID
	        ;
	 
	    END TRY
	 
	    BEGIN CATCH
	        DECLARE @Message VARCHAR(4000)
	            , @Severity INT
	            , @State INT;
	 
	        SELECT @Message = error_message()
	            , @Severity = error_severity()
	            , @State = error_state();
	 
	        RAISERROR (
	                @Message
	                , @Severity
	                , @State
	                );
	 
	        RETURN;
	    END CATCH;
END

--use an inner loop to batch DELETE the rows within a partition

SET @Rows = 1;

WHILE @Rows > 0

BEGIN

BEGIN TRY

--start a log record

INSERT INTO dbo.Delete_Log(PartitionNumber, LowerBound, UpperBound, StartDateTime)

SELECT @PartitionNumber, @LowerBound, @UpperBound, GETDATE()

;

SET @RowID = SCOPE_IDENTITY()

DELETE TOP (500000) --this number can be changed

FROM dbo.FactResellerSalesPartitioned

WHERE OrderDate >= @LowerBound

AND OrderDate < @UpperBound

AND ProductKey IN --selected list of products

(

471

, 458

, 221

, 234

, 222

)

SET @Rows = @@ROWCOUNT;

--after deleting batch then update the ViewAlertTrackingRecipient_Delete_Log table

UPDATE dbo.Delete_Log

SET EndDateTime = GETDATE()

, RowCnt = @Rows

WHERE PartitionNumber = @PartitionNumber

AND RowID = @RowID

;

END TRY

BEGIN CATCH

DECLARE @Message VARCHAR(4000)

, @Severity INT

, @State INT;

SELECT @Message = error_message()

, @Severity = error_severity()

, @State = error_state();

RAISERROR (

@Message

, @Severity

, @State

);

RETURN;

END CATCH;

END

Execution

The size of the FactResellerSales after the INSERT script was around 60 million records. With the table in place, I also shrank the transaction log to reset the size. The goal is to measure the impact of a single transaction versus a batched delete. I also captured the run time of each. The results are below:

So what does it all mean? Well, the delete in a single transaction took half the time of the 500K batched delete and 1/6 of the time 100K batched delete. That is not insignificant, although the goal was not geared toward performance.

In our situation, we were seeking the option that minimized impact on the transaction log in the most performant manner possible. You’ll note that the transaction log impact was fairly significant for the single transaction delete – it was about 4 times greater versus both batched deletes. While this might not scale linearly, it would be very significant when you’re talking about a table that is billions of rows.

The impact of the table being partitioned means that less data is scanned for each delete, which should help the performance of the batched delete. You do need to make sure that the data type of the filter and partition column match – see below (OrderDate is a DATETIME):

DECLARE @Rows AS INT
	    ,@PartitionNumber INT
	    ,@LowerBound DATETIME
	    ,@UpperBound DATETIME
	    ,@RowID SMALLINT;
	 
WHERE OrderDate >= @LowerBound
	    AND OrderDate < @UpperBound

DECLARE @Rows AS INT

,@PartitionNumber INT

,@LowerBound DATETIME

,@UpperBound DATETIME

,@RowID SMALLINT;

WHERE OrderDate >= @LowerBound

AND OrderDate < @UpperBound

Conclusion

While you might look at the performance in the example above and jump all over that, be sure to consider the impact on your SQL Server environment. As mentioned, this was key in our case.

We could not risk a huge impact on storage due to a growing transaction log on a server where there were ongoing ETL processes. We chose to batch but also used the existing partitions to help minimize the amount of data being scanned. In regards to the appropriate batch size, you will need to do some experimenting.

Is this a scenario you have run across at your job? How did you handle it? If it is something you have struggled with or have questions about let us know. Also, feel free to speak up in the comments.

Thanks for Reading! Questions?

Thanks for reading! We hope you found this blog post useful. Feel free to let us know if you have any questions about this article by simply leaving a comment below. We will reply as quickly as we can.

Keep Your Data Analytics Knowledge Sharp

Get fresh Key2 content and more delivered right to your inbox!

About Us

Key2 Consulting is a boutique data analytics consultancy that helps business leaders make better business decisions. We are a Microsoft Gold-Certified Partner and are located in Atlanta, Georgia. Learn more here.

Related Content

5 Key Takeaways From PASS Data Community Summit 2022

The PASS Data Community Summit 2022 was a blast! A handful of people from our company attend the conference every year and this year was no exception. The knowledge gained, connections made, and fun enjoyed in Seattle is always worth the investment. Here are our five...

How to Use SQL Server Management Studio for Quick and Easy CSV Data Export

By: Phillip Sharpless The need to export data from a database to some other format is a pretty common scenario for many data professionals. SQL Server Management Studio provides several different ways to accomplish this, some more complicated than others. If...

Quick Overview of SQL Server on Azure Virtual Machines

By: Matt Wisor SQL Server on Azure Virtual Machines is one of Microsoft’s cloud options for customers looking to move SQL Server to the Azure platform. As the name implies, it means hosting SQL Server in the cloud on an Azure Virtual...