Using Polybase for Improved Data Loading

February 23, 2022

Learn about Polybase, a feature in SQL Server 2016 that can help improve data loading processes. We share a client example as well.

By: Brad Harris

Are you looking for an effective data loading strategy from on-premise to Azure Data Lake services? Well we might have the solution for you.

In a recent project that I was working on, our team undertook discovery to validate a couple of different options for loading data from on-premise to a data lake. Out of the three solutions we implemented, one of those solutions involved using SQL PolyBase on-premise to load data to the data lake.

What is Polybase?

For those that have been around the SQL world for a while, Polybase has been around since SQL server version 2016. It allowed the setup of external data sources that could communicate with SQL Server without the need to install complicated third party adapters or linked servers.

Additionally, Polybase allowed the communication to happen in pure transact SQL language. So in essence, if you wanted SQL to talk to Hadoop or Oracle or anything else, you would just install Polybase on SQL Server and set up via scripts and external data sources like Hadoop, Oracle, and so on.

After that, you would run a couple of other scripts to set up your external file formats and wa lah! You would be able to start writing Transact SQL statements to create your external tables and insert data into those tables.

The data contained in the external tables is not written to SQL. In actuality, SQL Server becomes the metadata layer and makes the translation to write the data to your external data source in the format that you specified. Easy right?

An Example – Our Polybase Solution

Just to lay out the solution in a nutshell: in our environment, we had a large on-premise SQL server instance for which we had many databases that we wanted to transfer data to Azure Cloud (specifically to an Azure Data Lake Services storage account). The Polybase solution would involve setting up external tables within each database and then inserting the data from the non-external tables to the external tables.

Inserting this data into the external tables would effectively write the data to Azure Data Lake Services storage account in whatever format we chose. In our case, we chose the parquet format. All was good and well until we really started digging into the features and functionality of Polybase.

Although Polybase has its advantages in certain situations, it doesn’t come without its pitfalls. So, continue reading, as I have listed out what I like to call The Good, The Bad and the Ugly of Polybase ( queue the Good The Bad and the Ugly theme song)…

The Good

Allows for the creation of external data sources that point to the data lake

Allows for multiple file formats like Parquet, Avro, and CSV just to name a few. Those file formats allow support for different compression algorithms i.e. snappy, gzip etc…

Creation and insertion of data can be achieved with pure Transact SQL scripts

Insertion of data doesn’t require the creation of new external tables for each insert, meaning you can run the same insert statement against the same external table and you will get an additional file for that insert but not an additional table

External tables are stored per database, which in our case each database would have its own set of external tables to manage (which in my thinking would make it easier for DBAs to manage for deployments as they are not managing those external tables server-wide and some databases might not have external tables)

As previously stated, each insert statement into the external table creates a new file in the external source (in our case a new parquet file)

A very important consideration in our solution was that Polybase would eliminate the need to copy files to the file system and then design an orchestration piece that captures those files and copies them to the data lake. The write process takes place solely within the construct of inserting the data.

You would have instant access to data in the external source for any kind of on-premise reporting. If you were to use this external data for on-premise reporting , Polybase does allow support for querying data from external tables and joining to non-external data using Transact SQL.

Allows for scalability with installation of scale out nodes, allowing you to scale compute to optimize query performance.

The Bad

Does not allow for partitioning of the external tables. When creating an external table, the table is written to a particular location in the external source identified by the location clause. I have seen a couple of workarounds for this but it might be more advantageous just to load data to a landing zone and then create a pipeline that would further partition the data based on chosen partitioning scheme.

When an external table is dropped, the data remains within the external source. Not a big issue but does orphan the data in the external data source and would in our case turn a data lake into a data swamp.

Security for Polybase was another concern of ours. The Polybase setup requires a database-scoped-credential, which is effectively the authenticator to the external source. Within our testing, the database-scoped-credential was identity-based and tied to the secret key within the storage account. This credential would therefore have full access to the storage account. Ideally you would have a service account group that would allow some kind of pass through authentication from SQL Server to the data lake that would allow you to lock down the data lake to certain actions.

Polybase requires TCP/IP protocol and shared memory protocol to be enabled on the SQL server instance.

The Ugly

Installation requires the breakdown of the cluster to install. Cluster would have to be re-established and instance restarted. Installation of scale out groups might have further effects on the installation. This is not really a problem for a standalone instance as the installation just requires an additional add-on from the SQL Server installation media.
- Also to note: if Polybase is installed from the installation media to a SQL Server instance that has been patched to a certain level, the SQL Server instance will have to be re-patched in order to bring the SQL Polybase services up to the same level.
Certain data types are not allowed within Polybase external tables. I am not going to provide a full list of limitations but just to name a few:

XML is not supported and would have to be converted to nvarchar(xxxx).

Nvarchar(max) and Varchar(max) are not supported in Polybase and would have to be converted to nvarchar(4000) and varchar(8000) respectively.

This point became a real showstopper for our team. But for many of you reading, you could probably consider splitting your large fields into multiple fields and then putting them back together once they reach your external source (your milage may vary). For those of you that are curious about the fields that don’t meet the Polybase standards, I have copied a simple SQL statement here that will list out those fields:

Declare
 
@Counter INT
,@DatabaseName VarChar(100)
,@SQL nvarchar(max)
 
Set @Counter = 1
 
If Exists(Select [name] from tempdb.sys.tables where [name] like '#tmpDatabaseList%') BEGIN 
Drop table #tmpDatabaseList
END;
 
Select
ROW_NUMBER() OVER(Order by [name] ASC) as ID
,[Name]        
Into  #tmpDatabaseList
from sys.databases as a
Where
([Name] NOT IN (<Database Names to Ignore>) 
)
 
/*******************************************************
Loop through database to get tables with those datatypes
********************************************************/
Set @Counter = (SELECT Max(ID) from #tmpDatabaseList )
 
If Exists(Select [name] from tempdb.sys.tables where [name] like '#tmpDataTypeTable%') BEGIN 
Drop table #tmpDataTypeTable
END;
Create Table #tmpDataTypeTable 
([DatabaseName] varchar(300)
,SchemaName varchar(300)
,[TableName] varchar(300)
,[ColumnName] varchar(500)
,[ColumnSystemTypeID] int
,[ColumnUserTypeID] int
,[ColumnMaxLength] int
,[Is_User_Defined] bit
,[DataType] varchar(200)
)
 
WHILE @counter is not null
BEGIN
 
SET @DatabaseName = (Select [Name] from #tmpDatabaseList where ID = @Counter)
Set @SQL = 'Insert INTO #tmpDataTypeTable
SELECT  DatabaseName = ''' + @DatabaseName + ''' 
,s.[name],t.[name]
, c.[name]
, c.[system_type_id]
, c.[user_type_id]
,c.max_length
, y.[is_user_defined]
, Case When c.max_length = -1 then y.[name] + ''(max)'' Else y.[name] END
FROM [' + @DatabaseName + '].sys.tables  t
JOIN [' + @DatabaseName + '].sys.Schemas s on t.[Schema_ID] = s.[Schema_ID]
JOIN  [' + @DatabaseName + '].sys.columns c on t.[object_id]    = c.[object_id]
JOIN  [' + @DatabaseName + '].sys.types   y on c.[user_type_id] = y.[user_type_id]
WHERE 
(y.[name] IN (''geography'',''geometry'',''hierarchyid'',''image'',''text'',''ntext'',''sql_variant'',''timestamp'',''xml''))
 OR
 (y.[is_user_defined] = 1)
 OR
(y.name in (''varchar'',''nvarchar'') 
 AND c.max_length =-1
 );'
 
exec sp_executesql @SQL
 
SET @counter = (Select Max(ID) from #tmpDatabaseList where ID < @counter)
END
Select * from #tmpDataTypeTable

Declare

@Counter INT

,@DatabaseName VarChar(100)

,@SQL nvarchar(max)

Set @Counter = 1

If Exists(Select [name] from tempdb.sys.tables where [name] like '#tmpDatabaseList%') BEGIN

Drop table #tmpDatabaseList

END;

Select

ROW_NUMBER() OVER(Order by [name] ASC) as ID

,[Name]

Into #tmpDatabaseList

from sys.databases as a

Where

([Name] NOT IN (<Database Names to Ignore>)

)

/*******************************************************

Loop through database to get tables with those datatypes

********************************************************/

Set @Counter = (SELECT Max(ID) from #tmpDatabaseList )

If Exists(Select [name] from tempdb.sys.tables where [name] like '#tmpDataTypeTable%') BEGIN

Drop table #tmpDataTypeTable

END;

Create Table #tmpDataTypeTable

([DatabaseName] varchar(300)

,SchemaName varchar(300)

,[TableName] varchar(300)

,[ColumnName] varchar(500)

,[ColumnSystemTypeID] int

,[ColumnUserTypeID] int

,[ColumnMaxLength] int

,[Is_User_Defined] bit

,[DataType] varchar(200)

)

WHILE @counter is not null

BEGIN

SET @DatabaseName = (Select [Name] from #tmpDatabaseList where ID = @Counter)

Set @SQL = 'Insert INTO #tmpDataTypeTable

SELECT DatabaseName = ''' + @DatabaseName + '''

,s.[name],t.[name]

, c.[name]

, c.[system_type_id]

, c.[user_type_id]

,c.max_length

, y.[is_user_defined]

, Case When c.max_length = -1 then y.[name] + ''(max)'' Else y.[name] END

FROM [' + @DatabaseName + '].sys.tables t

JOIN [' + @DatabaseName + '].sys.Schemas s on t.[Schema_ID] = s.[Schema_ID]

JOIN [' + @DatabaseName + '].sys.columns c on t.[object_id] = c.[object_id]

JOIN [' + @DatabaseName + '].sys.types y on c.[user_type_id] = y.[user_type_id]

WHERE

(y.[name] IN (''geography'',''geometry'',''hierarchyid'',''image'',''text'',''ntext'',''sql_variant'',''timestamp'',''xml''))

(y.[is_user_defined] = 1)

(y.name in (''varchar'',''nvarchar'')

AND c.max_length =-1

);'

exec sp_executesql @SQL

SET @counter = (Select Max(ID) from #tmpDatabaseList where ID < @counter)

END

Select * from #tmpDataTypeTable

In Closing

As you can see, PolyBase does have its advantages and on the forefront looked like it was the solution for us. In the end however, because of the data type issue, we had to go a different route.

If you find yourself looking for a solution to effectively load on-premise data to the cloud, we here at Key2 have the professionals in-house that can guide you in making the best decisions to achieve the results you are looking for.

Thanks for Reading! Questions?

Thanks for reading! We hope you found this blog post useful. Feel free to let us know if you have any questions about this article by simply leaving a comment below. We will reply as quickly as we can.

Keep Your Data Analytics Knowledge Sharp

Get fresh Key2 content and more delivered right to your inbox!

About Us

Key2 Consulting is a boutique data analytics consultancy that helps business leaders make better business decisions. We are a Microsoft Gold-Certified Partner and are located in Atlanta, Georgia. Learn more here.

Related Content

Delta Lake – What it is and 7 Key Functionalities

By: Ken Adams Introduction Delta Lake is an open source storage layer that sits on top of cloud storage technology such as Azure Data Lake Storage or Amazon S3. The technology was introduced by Databricks in 2019, and all of the code is available here....

COVID-19’s Impact on the Data Industry and the Best BI Tools for 2020

By: Brad Lathrop How the COVID-19 Pandemic Is Impacting Data Industries We wrote an article earlier this year that discusses three of the most important business intelligence trends for 2020, which were derived from an excellent annual research...

What is Azure Databricks?

By: Phillip Sharpless Introduction to Azure Databricks Finding the right tools to manage your big data ecosystem can be a daunting task, as there seem to be a myriad of options, all advertising impressive-sounding features. One analytics platform that is...

1 Comment

Jeff Levy on February 25, 2022 at 4:59 pm

Interesting Article Brad! I had no idea Polybase could be used with an on-premise Solution….And also using it to extract data out of SQL Server…I have only used it to load data from ADLS into Azure Synapse.

How does the speed compare between what you’re doing here and loading Azure Synapse?