SQL Server 2019: Big Data Early Adoption Program Review

SQL Server 2019 – Key Features

At the 2018 Pass Summit conference in Seattle, I had the opportunity to attend many sessions about the new Big Data features being introduced in SQL Server 2019. One of Microsoft’s biggest announcements was the merged functionality of relational databases and the big data platform within a single release of SQL Server. SQL Server 2019 will be a one-stop shop for all of your analytical needs. The product will have the following key features:

Integrated with Apache HDFS / Spark and SQL
Extended Data Virtualization
Enhanced Polybase
Flexible scaling and deployment using the Kubernetes Platform

In addition to learning about SQL Server 2019, I was able to speak to many of the product owners / developers and get provisioned to the Early Adoption Program release. After checking out the tool, I thought I would share what I have learned about the new product.

Setup

Microsoft defines Data Virtualization as the process of viewing external data on a foreign platform within the domain of SQL Server. This process is realized with a SQL Server External Table that is pointed to an outside source.

Advantages:

Quick to create – In many cases, ETL can take days (if not weeks) to setup properly while external tables can be setup in a matter of minutes.
Easy to maintain – Oftentimes when a change is needed in an ETL pipeline, cascading effects occur and subsequent changes are required as well. Due to the abstract nature of an external table, these types of cascading changes are generally not required.
Less security and storage – Due to the fact that only a single copy of the data exists, additional security measures and storage considerations are not needed.

Disadvantages:

Limited ability to handle data – The ordering, aggregations, and predicate statements are still passed onto the source server. This can be problematic in the event that the host server handles many transactions with a high frequency. This event could cause the server to be inundated with too many requests and potentially cause disruption to business activities.

Polybase Enhancements

Polybase is the method in which external data is sourced and ingested into a SQL Server instance. SQL Server 2019 has added new Polybase connectors to outside sources which are used to ingest data into the Big Data Cluster. In addition to these new connectors, SQL Server has added upgraded Access control via Active Directory, as well as the ability to read HDFS files natively. This native read ability should allow for faster imports of data into the cluster.

Kubernetes Overview

Kubernetes is a platform used to orchestrate the creation and deployment of application containers. These containers are located on a Kubernetes pod and consist of applications such as Spark or SQL Server. In simple terms, these pods can be thought of as lightweight virtual machines.

Advantages:

Easy to scale – Due to the fact that Kubernetes is using container / VM technology, the ability to create new pods across a cluster is easy and fast. This is the primary advantage of the Kubernetes Platform.
Cross platform independent – Today, all Big Data technologies run on Linux, while many servers are running on Windows. With the use of Kubernetes, SQL Server can utilize Linux containers (running on a Windows machine) and bring these technologies together on a Windows Operating system.

SQL Server Big Data Cluster Overview

The SQL Server Big Data Cluster consists of a variety of components. As mentioned earlier in this article, data is ingested into the cluster via Polybase and external connectors. When brought into the cluster, the data can be persisted in either the Storage Pool or the Data Pool. Each Pool has its own purpose and will be discussed later on in this article. Access to data in the cluster can be acquired through queries on the SQL Server Master Instance. The SQL Server Master Instance has internal metadata on objects in the cluster and can locate data on the Storage Pool, Data Pool, and from External data sources (Data Virtualization).

Storage Pool

The storage pool largely consists of unstructured data persisted in HDFS Data Nodes. These nodes reside on one or many Kubernetes pods within the cluster. The data that is ingested into the Storage Pool is spread across all Kubernetes pods created in the SQL Big Data Cluster. Data ingestion can be achieved via Spark or TSQL.

Data Pool

The Data Pool is an alternative to persisting data, however the data is persisted in relational SQL Server instances called Shards. Each Shard is a collection of distributed data that is persisted in a set of partitioned SQL Server mdf data files. For maximum performance, these data files are cached and accessed via column store indexes for faster query processing. In addition, queries run against these data files are run in parallel for maximum throughput. Like the Storage Pool, data can be ingested into the Data Pool via Spark or TSQL.

Commute Pool

The Compute Pool is utilized for large queries and result sets sourced from data that resides in the Storage Pool, Data Pool, or an External data source. This pool performs the shuffling, aggregations, and filtering of the data before it is ultimately returned to the SQL Server Master Instance (where the query was originally initiated).

My SQL Server 2019 Big Data Cluster

In order to fully understand the capabilities of the SQL Server 2019 Big Data platform, I stood up a cluster of my own. Although I did not have access to a multi-node cluster on-Premises or in the Microsoft Azure Cloud Platform, I was able to successfully create a one node cluster on my work laptop. This can be achieved via the Minikube Kubernetes configuration.

The Minikube configuration is intended to be used for development, sandbox, and testing purposes. In order to activate the Minikube cluster with SQL Server, a password is needed to the Microsoft Container Registry. This password is provided via email when signing up for the Microsoft Early Adoption Program.

In addition to the password, strong hardware is needed to support Minikube (Microsoft recommends at least 32 GB of RAM). The specs on my laptop were actually quite good (32 GB of RAM, 8 logical processers, 1 TB Disk Space, Windows 10), however, when running the cluster my resources were approximately 98 -100% utilized. I would not recommend standing up a cluster with any lower resources.

After passing the initial requirements on the checklist, I went ahead and stood up my cluster via the configuration instructions on the Microsoft Website. The process was pretty straightforward and took as little as an hour to stand up a fully operational cluster.

When logging into the cluster, I implemented the Data Virtualization and Polybase features with relative ease. Additionally, I was able to successfully ingest data into the cluster via TSQL, but I had some difficulty ingesting data via Spark. With Spark, I followed the instructions on the Microsoft Website, but found that the documentation on the website did not match the functionality and parameters of the ingestion JAR file of the Spark Job. I plan to reach out to the Microsoft team and share my experiences to resolve this issue.

In addition to the difficulty encountered with Spark, I also found that the cluster had issues restarting when I would restart my laptop. In a few cases, I had to physically delete my cluster and rebuild a new one in order to get it back up and running. As a result, I lost all data in the cluster and had to re-ingest everything via TSQL. I am hoping that this happened as a result of my inexperience with the Minikube cluster rather than an inherent defect with the product.

Other difficulties I encountered involved navigating my Kubernetes cluster via the portal. The information on the portal was not well documented and was difficult to decipher. Shell Scripting skills (which I do not yet possess) are required to check container setups and configurations.

Conclusion

SQL Server 2019 provides users with a central data repository for Big Data and Relational Data needs. The product introduces increased speed for data consumption while reducing time spent on ETL and infrastructure management activities. This functionality is realized via the Kubernetes platform for easy scaling and cross operating system deployment. With this product, organizations can maximize operational efficiencies by focusing costs onto one data platform and perform data science activities in a faster manner.

References

Configure Minikube for SQL Server 2019 big data cluster deployments
https://docs.microsoft.com/en-us/sql/big-data-cluster/deploy-on-minikube?view=sql-server-ver15
The future of querying big data with Polybase and SQL Server
https://myignite.techcommunity.microsoft.com/sessions/66201
Deep dive on SQL Server and big data
https://myignite.techcommunity.microsoft.com/sessions/65967
Microsoft SQL Server 2019 Technical White Paper
https://info.microsoft.com/ww-landing-SQLDB-Microsoft-SQL-Server-WhitePaper.html
Introducing Microsoft SQL Server 2019 Big Data Clusters
https://cloudblogs.microsoft.com/sqlserver/2018/09/25/introducing-microsoft-sql-server-2019-big-data-clusters/
What are SQL Server big data clusters?
https://docs.microsoft.com/en-us/sql/big-data-cluster/big-data-cluster-overview?view=sqlallproducts-allversions

Questions?

Do you have any questions or topic ideas related to BI, analytics, the cloud, machine learning, SQL Server, (Star Wars), or anything else of the like that you’d like us to write about? Leave us a comment below, and we’ll see what we can do.

Keep your data analytics sharp by subscribing to our mailing list!

Thanks for reading! Keep your knowledge sharp by subscribing to our mailing list to receive fresh Key2 content around data analytics, business intelligence, data warehousing, and more!

Key2 Consulting is a data warehousing and business intelligence company located in Atlanta, Georgia. We create and deliver custom data warehouse solutions, business intelligence solutions, and custom applications.

SQL Server 2019: Big Data Early Adoption Program Review

SQL Server 2019 – Key Features

Setup

Polybase Enhancements

Kubernetes Overview

SQL Server Big Data Cluster Overview

Storage Pool

Data Pool

Commute Pool

My SQL Server 2019 Big Data Cluster

Conclusion

References

Questions?

Keep your data analytics sharp by subscribing to our mailing list!

Key2 Consulting is a data warehousing and business intelligence company located in Atlanta, Georgia. We create and deliver custom data warehouse solutions, business intelligence solutions, and custom applications.

Recent Posts

Categories