By: Kash Sabba
Databricks provides a unified analytics platform in the cloud. Here at Key2 Consulting we have written several articles on the prominent software platform to date, including a quick overview of Databricks, a detailed explanation on how to boost query performance using Databricks and Spark, and a look at using Azure Databricks Secret Scopes.
For this blog post, I’m going to discuss Databricks Utilities (dbutils), a utility tool for basic data file handling and data manipulation within Databricks Notebooks. Let’s look at four useful functionalities “dbutils” provides.
Databricks Utilities can show all the mount points within a Databricks Workspace using the command below when typed within a Python Notebook. “dbutils.fs.mounts()” will print out all the mount points within the Workspace. The “display” function helps visualize the data and/or helps view the data in rows and columns.
Utility can list all the folders/files within a specific mount point. For instance, in the example below, using “dbutils.fs.ls(“/mnt/location”)” prints out all the directories within that mount point location. To learn more about mount points, go here.
Databricks Utilities can also list specific files within a directory/sub-directory nests as shown below. The image below lists all the files within a specific directory and sub-directory.
2. Read Files
Utility can pull the first few records of a file using the “head” function, as shown below. “dbutils.fs.head()” can be passed with number of bytes parameter to limit the data that gets printed out. In the example below, the first 1000 bytes of a csv file are printed out.
Python APIs can also be used to read file contents (in addition to the utility) as shown below.
3. Create Directories and Files
Utility can be used to create new directories and add new files/scripts within the newly created directories. The example below shows how “dbutils.fs.mkdirs()” can be used to create a new directory called “scripts” within “dbfs” file system.
And further add a bash script to install a few libraries to the newly created directory, as seen below using the “dbutils.fs.put()” command.
Utility can be used to create Widgets in Notebooks. The image below shows how to create a spark dataframe from a csv file taken from one of the examples above.
Using one of the column’s distinct values within the dataframe, a Widget can be created as a dropdown item at the top of a Notebook, as shown in the images below. The “dbutils.widgets.dropdown()” will help create a widget using a list created from a distinct select query.
A filter query can be written using the widget value in the dropdown, as shown below. The “dbutils.widgets.get()” will help collect the widget value which can be further used in a filter query. The query and the resulting table shows only filtered values where weekday is 5 (based on the widget value which is also set to 5 from the dropdown).
If the widget value is changed from the dropdown, then the corresponding table below will also reflect the widget value (in this case if widget value is changed to 2, then the table will only print out values where weekday is 2).
Databricks Utility “dbutils” provides a convenient command line style tool for easy data and file manipulation. It can provide great value when used in Databricks Notebooks for different applications, such as data engineering and machine learning.
Thanks for Reading! Questions?
Thanks for reading! We hope you found this blog post useful. Feel free to let us know if you have any questions about this article by simply leaving a comment below. We will reply as quickly as we can.
Keep Your Data Analytics Knowledge Sharp
Get fresh Key2 content and more delivered right to your inbox!
Key2 Consulting is a boutique data analytics consultancy that helps business leaders make better business decisions. We are a Microsoft Gold-Certified Partner and are located in Atlanta, Georgia. Learn more here.