By: Brad Harris
Are you looking for an effective data loading strategy from on-premise to Azure Data Lake services? Well we might have the solution for you.
In a recent project that I was working on, our team undertook discovery to validate a couple of different options for loading data from on-premise to a data lake. Out of the three solutions we implemented, one of those solutions involved using SQL PolyBase on-premise to load data to the data lake.
What is Polybase?
For those that have been around the SQL world for a while, Polybase has been around since SQL server version 2016. It allowed the setup of external data sources that could communicate with SQL Server without the need to install complicated third party adapters or linked servers.
Additionally, Polybase allowed the communication to happen in pure transact SQL language. So in essence, if you wanted SQL to talk to Hadoop or Oracle or anything else, you would just install Polybase on SQL Server and set up via scripts and external data sources like Hadoop, Oracle, and so on.
After that, you would run a couple of other scripts to set up your external file formats and wa lah! You would be able to start writing Transact SQL statements to create your external tables and insert data into those tables.
The data contained in the external tables is not written to SQL. In actuality, SQL Server becomes the metadata layer and makes the translation to write the data to your external data source in the format that you specified. Easy right?
An Example – Our Polybase Solution
Just to lay out the solution in a nutshell: in our environment, we had a large on-premise SQL server instance for which we had many databases that we wanted to transfer data to Azure Cloud (specifically to an Azure Data Lake Services storage account). The Polybase solution would involve setting up external tables within each database and then inserting the data from the non-external tables to the external tables.
Inserting this data into the external tables would effectively write the data to Azure Data Lake Services storage account in whatever format we chose. In our case, we chose the parquet format. All was good and well until we really started digging into the features and functionality of Polybase.
Although Polybase has its advantages in certain situations, it doesn’t come without its pitfalls. So, continue reading, as I have listed out what I like to call The Good, The Bad and the Ugly of Polybase ( queue the Good The Bad and the Ugly theme song)…
The Good
- Allows for the creation of external data sources that point to the data lake
- Allows for multiple file formats like Parquet, Avro, and CSV just to name a few. Those file formats allow support for different compression algorithms i.e. snappy, gzip etc…
- Creation and insertion of data can be achieved with pure Transact SQL scripts
- Insertion of data doesn’t require the creation of new external tables for each insert, meaning you can run the same insert statement against the same external table and you will get an additional file for that insert but not an additional table
- External tables are stored per database, which in our case each database would have its own set of external tables to manage (which in my thinking would make it easier for DBAs to manage for deployments as they are not managing those external tables server-wide and some databases might not have external tables)
- As previously stated, each insert statement into the external table creates a new file in the external source (in our case a new parquet file)
- A very important consideration in our solution was that Polybase would eliminate the need to copy files to the file system and then design an orchestration piece that captures those files and copies them to the data lake. The write process takes place solely within the construct of inserting the data.
- You would have instant access to data in the external source for any kind of on-premise reporting. If you were to use this external data for on-premise reporting , Polybase does allow support for querying data from external tables and joining to non-external data using Transact SQL.
- Allows for scalability with installation of scale out nodes, allowing you to scale compute to optimize query performance.
The Bad
- Does not allow for partitioning of the external tables. When creating an external table, the table is written to a particular location in the external source identified by the location clause. I have seen a couple of workarounds for this but it might be more advantageous just to load data to a landing zone and then create a pipeline that would further partition the data based on chosen partitioning scheme.
- When an external table is dropped, the data remains within the external source. Not a big issue but does orphan the data in the external data source and would in our case turn a data lake into a data swamp.
- Security for Polybase was another concern of ours. The Polybase setup requires a database-scoped-credential, which is effectively the authenticator to the external source. Within our testing, the database-scoped-credential was identity-based and tied to the secret key within the storage account. This credential would therefore have full access to the storage account. Ideally you would have a service account group that would allow some kind of pass through authentication from SQL Server to the data lake that would allow you to lock down the data lake to certain actions.
- Polybase requires TCP/IP protocol and shared memory protocol to be enabled on the SQL server instance.
The Ugly
- Installation requires the breakdown of the cluster to install. Cluster would have to be re-established and instance restarted. Installation of scale out groups might have further effects on the installation. This is not really a problem for a standalone instance as the installation just requires an additional add-on from the SQL Server installation media.
- Also to note: if Polybase is installed from the installation media to a SQL Server instance that has been patched to a certain level, the SQL Server instance will have to be re-patched in order to bring the SQL Polybase services up to the same level.
- Certain data types are not allowed within Polybase external tables. I am not going to provide a full list of limitations but just to name a few:
- XML is not supported and would have to be converted to nvarchar(xxxx).
- Nvarchar(max) and Varchar(max) are not supported in Polybase and would have to be converted to nvarchar(4000) and varchar(8000) respectively.
This point became a real showstopper for our team. But for many of you reading, you could probably consider splitting your large fields into multiple fields and then putting them back together once they reach your external source (your milage may vary). For those of you that are curious about the fields that don’t meet the Polybase standards, I have copied a simple SQL statement here that will list out those fields:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
Declare @Counter INT ,@DatabaseName VarChar(100) ,@SQL nvarchar(max) Set @Counter = 1 If Exists(Select [name] from tempdb.sys.tables where [name] like '#tmpDatabaseList%') BEGIN Drop table #tmpDatabaseList END; Select ROW_NUMBER() OVER(Order by [name] ASC) as ID ,[Name] Into #tmpDatabaseList from sys.databases as a Where ([Name] NOT IN (<Database Names to Ignore>) ) /******************************************************* Loop through database to get tables with those datatypes ********************************************************/ Set @Counter = (SELECT Max(ID) from #tmpDatabaseList ) If Exists(Select [name] from tempdb.sys.tables where [name] like '#tmpDataTypeTable%') BEGIN Drop table #tmpDataTypeTable END; Create Table #tmpDataTypeTable ([DatabaseName] varchar(300) ,SchemaName varchar(300) ,[TableName] varchar(300) ,[ColumnName] varchar(500) ,[ColumnSystemTypeID] int ,[ColumnUserTypeID] int ,[ColumnMaxLength] int ,[Is_User_Defined] bit ,[DataType] varchar(200) ) WHILE @counter is not null BEGIN SET @DatabaseName = (Select [Name] from #tmpDatabaseList where ID = @Counter) Set @SQL = 'Insert INTO #tmpDataTypeTable SELECT DatabaseName = ''' + @DatabaseName + ''' ,s.[name],t.[name] , c.[name] , c.[system_type_id] , c.[user_type_id] ,c.max_length , y.[is_user_defined] , Case When c.max_length = -1 then y.[name] + ''(max)'' Else y.[name] END FROM [' + @DatabaseName + '].sys.tables t JOIN [' + @DatabaseName + '].sys.Schemas s on t.[Schema_ID] = s.[Schema_ID] JOIN [' + @DatabaseName + '].sys.columns c on t.[object_id] = c.[object_id] JOIN [' + @DatabaseName + '].sys.types y on c.[user_type_id] = y.[user_type_id] WHERE (y.[name] IN (''geography'',''geometry'',''hierarchyid'',''image'',''text'',''ntext'',''sql_variant'',''timestamp'',''xml'')) OR (y.[is_user_defined] = 1) OR (y.name in (''varchar'',''nvarchar'') AND c.max_length =-1 );' exec sp_executesql @SQL SET @counter = (Select Max(ID) from #tmpDatabaseList where ID < @counter) END Select * from #tmpDataTypeTable |
In Closing
As you can see, PolyBase does have its advantages and on the forefront looked like it was the solution for us. In the end however, because of the data type issue, we had to go a different route.
If you find yourself looking for a solution to effectively load on-premise data to the cloud, we here at Key2 have the professionals in-house that can guide you in making the best decisions to achieve the results you are looking for.
Thanks for Reading! Questions?
Thanks for reading! We hope you found this blog post useful. Feel free to let us know if you have any questions about this article by simply leaving a comment below. We will reply as quickly as we can.
Keep Your Data Analytics Knowledge Sharp
Get fresh Key2 content and more delivered right to your inbox!
About Us
Key2 Consulting is a boutique data analytics consultancy that helps business leaders make better business decisions. We are a Microsoft Gold-Certified Partner and are located in Atlanta, Georgia. Learn more here.