By: Brad Lathrop
Azure Databricks is a Microsoft data analytics platform that’s “optimized for the Microsoft Azure cloud services platform” (1).
It has three environments developers can choose from to create applications:
- Databricks Machine Learning
- Databricks SQL
- Databricks Data Science & Engineering
For this blog post, we’re going to focus on the Databricks Machine Learning environment and its key features.
What is Databricks Machine Learning?
Databricks Machine Learning (DML) is a powerful, cloud-based machine learning platform that’s part of the Azure ecosystem and is designed primarily for data scientists. The tool helps users predict changes and improve efficiencies by harnessing the potential of machine learning.
According to Microsoft’s official documentation, DML is “an integrated end-to-end machine learning platform incorporating managed services for experiment tracking, model training, feature development and management, and feature and model serving.” (2)
Following that definition, the four key features of Databricks Machine Learning are:
- Experiment tracking
- Model training
- Feature development
- Model serving
Let’s explore these features in some detail.
1. Experiment Tracking
An instrumental component of machine learning is “experiments.” Experiments involve someone (often a data scientist) tinkering with variables to see how doing so affects a machine learning algorithm and its outcomes.
Experiments are often followed up by a task called “experiment tracking.” According to machine learning software company Neptune, experiment tracking “is the process of saving all experiment related information that you care about for every experiment you run.”
DML has experiment tracking capabilities built right into it, and experiments are made possible by a component called “MLflow tracking.” MLflow tracking gives users the ability to “log source properties, parameters, metrics, tags, and artifacts related to training a machine learning model.” (3)
2. Model Training
DML empowers users to train data models to optimize machine learning algorithms. Models can be trained manually or automatically using Microsoft’s “Databricks AutoML” tool.
According to Microsoft’s official documentation, the AutoML tool “helps you automatically apply machine learning to a dataset. It prepares the dataset for model training and then performs and records a set of trials, creating, tuning, and evaluating multiple models.”
A Python notebook is provided for each trial that’s run. The notebook is open-source, which enables users to review and modify the code.
Microsoft has a free online module for training a machine learning model with Azure Databricks, which you can access here.
3. Feature Development and Management
The third Databricks Machine Learning feature we’re going to cover in this blog post is feature development. To best understand what this feature does, we need to understand what “feature” means in this context of machine learning.
According to Microsoft’s official documentation:
“Machine learning uses existing data to build a model to predict future outcomes. In almost all cases, the raw data requires preprocessing and transformation before it can be used to build a model. This process is called featurization or feature engineering, and the outputs of this process are called features – the building blocks of the model.”
So features are created after raw data is transformed, and the transformation of raw data is vital to successfully building a machine learning model. Features are the “building blocks” of the model. Understanding that makes understanding DML’s “feature store” much more simple.
DML has a feature called “Feature Store” that allows users to “catalog ML features and make them available for training and serving, increasing reuse.” (4)
Microsoft’s documentation goes on to state that a feature store serves as a “centralized repository that enables data scientists to find and share features and also ensures that the same code used to compute the feature values is used for model training and inference.”
In short, DML has excellent feature development and management capabilities right out of the box.
4. Model Serving
The fourth and final Databricks Machine Learning feature we’re going to highlight in this article is “model serving.” DML has a capability called “MLflow Registry” that “is a centralized model repository and a UI and set of APIs that enable you to manage the full lifecycle of MLflow Models.” (5)
The MLflow Registry provides a handful of useful functionalities, like chronological model lineage, model serving, model versioning, webhooks to trigger actions, and more. It also supports various languages, including Java, Python, and R.
Here’s an example from Microsoft on how the MLflow Registry works. And for those interested, Microsoft has example MLflow Model Registry notebooks here.
Thanks for reading! Have you worked with DML?
Thanks for reading this article – we hope you found it valuable! If you have any questions or thoughts, please feel free to leave a comment.
And now, a question for you – do you have experience with Databricks Machine Learning? If so, what have your experiences been?
1. Microsoft Docs – What is Azure Databricks?
2. Microsoft Docs – What is Databricks Machine Learning?
3. Microsoft Docs – Track Machine Learning Training Runs
4. Microsoft Docs – Databricks Feature Store
5. Microsoft Docs – MLflow Model Registry on Azure Databricks
Keep Your Data Analytics Knowledge Sharp
Get fresh Key2 content and more delivered right to your inbox!
Key2 Consulting is a boutique data analytics consultancy that helps business leaders make better business decisions. We are a Microsoft Gold-Certified Partner and are located in Atlanta, Georgia. Learn more here.