Brad Harris, Key2 Consulting
By: Brad Harris  
 
 

What is Apache Airflow?

Apache Airflow is an open-source platform for authoring, scheduling, and monitoring workflows. In layman’s terms, I like to think of the platform as similar to a Microsoft SQL Server SQL Agent job on steroids.

While Microsoft’s SQL Agent jobs can be used to schedule and monitor complex workflows, Apache Airflow is open source and cuts ties with Microsoft products, opening up your world to scheduling and maintaining workflows initiated from any platform of your choosing.
 

How Does Apache Airflow Work?

Apache Airflow is an orchestrator for a multitude of different workflows. It is written in Python and was used by Airbnb until it was inducted as a part of the Apache Software Foundation Incubator Program in March 2016. It was announced as a Top-Level Project in March of 2019.

The platform uses Directed Acyclic Graphs (DAGS) to author workflows. DAGS are the foundation for your workflow and incorporate the dependencies on job step completion before another job step starts.

Using an acyclic graph versus a cyclic graph guarantees there is a start and finish to your particular workflow, whereas when using a cyclic graph, there is really no definite end.

When using Airflow, users compose their DAG in Python by setting specific properties in DAG. Users also define their tasks within the DAG and set dependencies on those tasks. Then, the DAG is imported into Airflow by simply adding to a DAGS folder.
 

Example DAG in Python Code

Here’s a DAG I recently wrote.

 

DAG View in Apache Airflow GUI

 

 

Apache Airflow Implementation

One of the advantages that come from using DAGs to represent a workflow is that it is written in Python code. This allows it to be source-controlled and integrated into a development/production deployment model.

In a typical Apache Airflow environment, you would develop your DAGS within a test environment. Once developed, you could save those DAGS to your development environment DAGS folder and compile them using command line. If everything compiles properly, you at least know that your DAG and airflow environment is sound.

The DAG(s) then can actually be committed back to your master branch for deployment to production. For deployment to production, it is just as easy as setting up a workflow to get the latest DAG from your source control repository and then replacing the previous version DAG with the latest.

Once you have created your DAGs and have deployed them to production, Apache Airflow takes care of the rest. DAGs can be viewed through the Airflow GUI, which is core to the Apache Airflow web server component.

The web server offers up many other options than the GUI itself. There are ways to manage custom connections, parameters and plugins (within the GUI) that can be used to pass information to your DAGS when processing.

One of the other essential components of Airflow is the Airflow Scheduler. The scheduler is tasked with knowing which jobs need to run and at what time those tasks need to be kicked off. The scheduler can be set up to run tasks in a sequential manner (or ideally in a production environment setup so that tasks are distributed to worker nodes, offering some degree of parallelism).
 

Apache Airflow Implementation

Apache Airflow is a lightweight client that can be installed through various methods. For the purposes of setting up a sandbox environment, the platform can be easily installed on a Mac or Windows using WSL, or by setting up a Linux virtual environment.

However, there are far better ways to install Apache Airflow when considering the platform for a production environment.

One of the most popular methods for installing Airflow is to use an Airflow Docker container. Docker has become popular when using Airflow because it is comparative to a virtual machine that doesn’t require a hypervisor. And since Apache Airflow is such a popular Apache project and is constantly updated from development commits, Docker allows you to stay on top of the current Airflow releases by updating your Docker container image instead of having to update the individual airflow components.

Another way to implement Airflow in production is to use a cloud provider like Astronomer.io. If your company is big into Amazon Web Services or Microsoft Azure, Astronomer.io can deliver managed Airflow instances to your cloud environment. This eliminates the task of managing the Airflow installation all together.

To install, you tell Astronomer.io what kind of Airflow environment you are looking for and it takes care of deploying nodes to your cloud environment. If you are just looking to take Airflow for a test drive, I would recommend going with a simple Linux installation or the Docker route.
 

Why Use Apache Airflow Over Other Tools?

There are of course other tools available that can manage data pipelines/workflows, but it has become apparent given adoption rates that Airflow is the preferred method for managing data workflows. With its vast community of contributors, chances are if you are tasked with a particular data engineering problem, someone in the community has already solved the problem.

Other tools commonly used to manage workflows include Luigi, Jenkins and Apache Nifi. These tools do provide a way to manage and develop workflows but in the end, they do not meet the mustard when compared to Apache Airflow. Airflow is superior to these tools in the areas of monitoring, customization, data lineage and extensibility.

Monitoring – The Airflow GUI provides a robust capability to monitor your workflows. By using the web GUI, you can visualize workflows as they move from step to step. With monitoring, you can also dig into charts dictating stats from previous job runs with the additional benefits of digging into the underlying logs. The ability to send notifications for jobs is also available right out of the box.

Customization – Airflow comes with components right out of the box that allow the user to not only get started quickly but to also create custom operators and plugins.

Data Lineage – It can’t be said enough that the Airflow web server GUI is the core component behind Airflow. The Airflow GUI provides the user with the ability to visualize the origins of their data and what happens to it over time.

Extensibility – Apache Airflow is written in Python, so enough said. Python has been touted as the language for data engineers so it goes without saying that Apache Airflow can be easily extended by writing custom operators and writing plugins that create additional GUI menus.
 

My Use Case

I was playing around with Apache Airflow six months ago and, as a new user, I was unclear of its use. Once I started playing around with it, I started to realize its massive potential!

Since the technology’s name is Apache Airflow, I automatically started to think that it was some new ETL tool that managed data as it passed from a source to a destination. I soon realized, however, that it was more of a job scheduler and a way to monitor workflows that you have created elsewhere.

On a recent client project, I was looking for a tool that could help me manage some Talend workflows. We were working with the open-source version of Talend data integration at the time due to budget constraints.

The open-source version of Talend did not come with a way to manage the workflows that we created, and we were forced to use Windows Scheduler to manage the scheduling and monitoring of when the jobs kicked and finished.

And as we all know though, Windows Scheduler is very limited in its ability to handle complex workflows, so we used Windows Scheduler to simply schedule the jobs and then we pushed a lot of the logging and monitoring of our workflows to the jobs itself.

This meant adding a lot of additional code to our Talend workflows to capture when jobs were taking too long or when jobs started and stopped and or when they failed all together.

At this point I started looking at Apache Airflow as an alternative for this measure. Unfortunately, though, my team never got to a point where I could fully implement the solution, but I hope I can take you on my journey of exploration to see what Apache Airflow has to offer. Especially to those looking to find a better way to author, schedule, and monitor their complex workflows.
 

Keep Your Business Intelligence Knowledge Sharp by Subscribing to our Email List

Get fresh Key2 content around Business Intelligence, Data Warehousing, Analytics, and more delivered right to your inbox!
 

 


Key2 Consulting is a data warehousing and business intelligence company located in Atlanta, Georgia. We create and deliver custom data warehouse solutions, business intelligence solutions, and custom applications.