Ken Adams, Senior Business Intelligence Consultant at Key2 Consulting

By: Ken Adams

 

The Problem

I recently ran across an issue on a project that required a large amount of data to be removed from a table, specifically an issue with the loading of the table causing duplication of data.

All non-clustered indexes had been removed before the data load to speed loading. This included a unique index, so therefore we ended up with duplicates. The data loading portion of this could be a blog post for another day.

This table was too large ( > 10 billion rows) to flush the data and refill given the timeframe. The alternative was to try and remove any duplicate records. An example of this might be data archival. In this instance, you’d want to physically delete data that was older and possibly for a specific customer or group of customers.

In our situation, the table was partitioned. We were in a production environment and had to be very cognizant of competing resources. The primary concern was making sure the transaction log didn’t blow out the storage drive.

This can happen when you perform large delete operations in a single transaction. The answer to this problem was removing the data in smaller chunks, or batches, to lessen the impact on the transaction log.
 

Setup

For the purposes of this post, I used a copy of the AdventureWorksDW database which you can find here. I’m running the scripts on a fairly small Oracle VirtualBox machine with SQL Server 2019 installed.

The AdventureWorksDW that I had was an older backup, but I did change the compatibility level to 150 before running anything. This really shouldn’t matter but thought I’d point it out. The database was also set to simple recovery mode.

I chose one of the larger fact tables to use as my subject. The table is FactResellerSales. The table that came with the database had around 60K rows and that was not enough. I set up a new table and partitioned that table by year on the OrderDate.

To grow the data, I ran the following script below. This script loads FactResellerSales into the new partitioned and then executes 25 inserts of the same data randomized:

Then I ran the delete script, which is made up of 3 primary parts: the identification of the partitions, creation of an outer cursor to loop through those partitions, and an inner delete filtered on the partition date range.

The inner delete filtered on a handful of ProductKeys, which I chose at random. Adding the ProductKey filter more closely mimicked the scenario I encountered at the client site.
 
 

Identify the Partitions:

 
 

Declare a Partition Cursor:

 
 

Inner DELETE With Very Simple Logging and Error Handling:

 

Execution

The size of the FactResellerSales after the INSERT script was around 60 million records. With the table in place, I also shrank the transaction log to reset the size. The goal is to measure the impact of a single transaction versus a batched delete. I also captured the run time of each. The results are below:
 

 
So what does it all mean? Well, the delete in a single transaction took half the time of the 500K batched delete and 1/6 of the time 100K batched delete. That is not insignificant, although the goal was not geared toward performance.

In our situation, we were seeking the option that minimized impact on the transaction log in the most performant manner possible. You’ll note that the transaction log impact was fairly significant for the single transaction delete – it was about 4 times greater versus both batched deletes. While this might not scale linearly, it would be very significant when you’re talking about a table that is billions of rows.

The impact of the table being partitioned means that less data is scanned for each delete, which should help the performance of the batched delete. You do need to make sure that the data type of the filter and partition column match – see below (OrderDate is a DATETIME):

 

Conclusion

While you might look at the performance in the example above and jump all over that, be sure to consider the impact on your SQL Server environment. As mentioned, this was key in our case.

We could not risk a huge impact on storage due to a growing transaction log on a server where there were ongoing ETL processes. We chose to batch but also used the existing partitions to help minimize the amount of data being scanned. In regards to the appropriate batch size, you will need to do some experimenting.

Is this a scenario you have run across at your job? How did you handle it? If it is something you have struggled with or have questions about let us know. Also, feel free to speak up in the comments.
 

Thanks for Reading! Questions?

Thanks for reading! We hope you found this blog post useful. Feel free to let us know if you have any questions about this article by simply leaving a comment below. We will reply as quickly as we can.
 

Keep Your Data Analytics Knowledge Sharp

Get fresh Key2 content and more delivered right to your inbox!
 

 
 

About Us

Key2 Consulting is a boutique data analytics consultancy that helps business leaders make better business decisions. We are a Microsoft Gold-Certified Partner and are located in Atlanta, Georgia. Learn more here.

Our Most Recent Blog Content
 
How to Group Power BI Measures

How to Group Power BI Measures

Learn how to group Power BI measures to better organize what you’re working with in Power BI and enhance your productivity!