How to use Apache Airflow CLI with Amazon MWAA

Photo by John Schnoon on Unsplash

Amazon MWAA (Managed Workflow for Apache Airflow) was released by AWS at the end of 2020. This brand new service provides a managed solution to deploy Apache Airflow in the cloud, making it easy to build and manage data processing workflows in AWS.

MWAA enables automatic deployment of all the infrastructure and configuration for Airflow Web Server, Scheduler, Workers, Metadata Database and also the Celery executor combined with SQS, to manage jobs dispatching. Users just need to setup an S3 bucket for DAGs, plugins and Python dependencies (via requirements.txt) and associate its content with the MWAA environment.

Authentication is also managed by AWS — native integration with IAM and resources can be deployed inside a private VPC for additional security.

Airflow process

Example of an Amazon MWAA architecture deployed inside a VPC

Since all the resources are deployed by AWS, developers don’t have access to the underlying infrastructure. Consequently, the main interface used by Data Engineers is the Airflow UI, which is available via public URL or VPC endpoint, depending on the deployment type selected (public or private network).

Airflow example complex

Example of DAG managed via Airflow UI

However, Airflow UI is not the only option for interacting with your environment; MWAA also provides support to the Airflow CLI. This is a useful option if you want to automate operations to monitor or trigger your DAGs, and in this post I explain how you can best make use of Airflow CLI from an MWAA environment.

The following content is suitable for those already familiar with the benefits and functionality of Apache Airflow. If you are new to Airflow, or you’re searching for more insights about the advantages of using Amazon MWAA compared to hosting your own environment, I recommend you explore earlier posts on this subject before reading on.

The Apache Airflow CLI and its use with Amazon MWAA

 

Airflow CLI is an interesting maintenance alternative within MWAA, since it allows Data Engineers to create scripts to automate otherwise manual/ repetitive tasks.

In MWAA, not all commands are supported because as developers we cannot perform operations that might impact server resources or user management (e.g. webserver, scheduler, worker, etc), but all commands related to monitoring, processing and testing DAGs are supported in the current version.

To check the full list of supported and unsupported commands, refer to the official User Guide. At the time of writing, this is the status of different commands:

List of supported commands

 

  • backfill
  • clear
  • dag_state
  • delete_dag
  • list_dag_runs
  • list_dags
  • list_tasks
  • next_execution
  • pause
  • pool
  • render
  • run
  • show_dag
  • task_failed_deps
  • task_state
  • test
  • trigger_dag
  • unpause
  • variables
  • version

List of unsupported commands

 

  • connections
  • create_user
  • delete_user
  • flower
  • initdb
  • kerberos
  • list_users
  • resetdb
  • rotate_fernet_key
  • scheduler
  • serve_logs
  • shell
  • sync_perm
  • upgradedb
  • webserver
  • worker

How to trigger a CLI command from Amazon MWAA

 

  1. Authenticate your AWS account via AWS CLI;
  2. Get a CLI token and the MWAA web server hostname via AWS CLI;
  3. Send a post request to your MWAA web server forwarding the CLI token and Airflow CLI command;
  4. Check the response, parse the results and decode the output.

This sounds complicated but is actually a fairly straightforward process. Let’s deep dive and investigate each step in detail.

1. Authenticate to your AWS account

 

If you are not used to this process, read the AWS CLI User Guide, which explains how you can configure a profile in your AWS CLI and grant access to your accounts.

2. Get CLI token and MWAA web server hostname via AWS CLI

 

aws mwaa create-cli-token --name $MWAA_ENVIRONMENT

Remember to assign the name of your MWAA environment by exporting the environment variable $MWAA_ENVIRONMENT.

export MWAA_ENVIRONMENT=my_environment_name

If the command is successfully executed, you should receive a JSON response with two attributes:

{
  "CliToken" : "",
  "WebServerHostname" : ""
}

Parse the results and store them in other environment variables for later use. I suggest the following names:

  • $CLI_TOKEN
  • $WEB_SERVER_HOSTNAME

3. Send a post request to your MWAA web server forwarding the CLI token and Airflow CLI command

 

curl \
  --request POST "https://$WEB_SERVER_HOSTNAME/aws_mwaa/cli" \
  --header "Authorization: Bearer $CLI_TOKEN" \
  --header "Content-Type: text/plain" \
  --data-raw "$AIRFLOW_CLI_COMMAND"

Notice we assigned the environment variables acquired from the previous step ($CLI_TOKEN and $WEB_SERVER_HOSTNAME) and also published a third variable with the name $AIRFLOW_CLI_COMMAND. In this variable we send the Airflow command to be performed by the CLI, for example, if you want to execute the following command:

airflow list_dags

The variable $AIRFLOW_CLI_COMMAND should be filled with:

list_dags

Important note: if your MWAA environment is published in a private network you can’t perform the curl request via public internet; a VPN must be used to establish the connection between your local machine and the VPC endpoint, or you may need to execute this command from another computing resource placed inside of the same VPC.

4. Check the response, parse the results and decode the output

 

{
  "stderr" : "",
  "stdout" : ""
}

Notice both attribute values are encoded in Base64. Remember to decode the results to collect the final output from Airflow CLI. A simple way to achieve that is by using the command:

base64 -d

Automating all the steps with a single script

 

This is a shell script created for Unix based operational systems (e.g. Linux, MacOS); if you are running Windows you may need to adapt this content or run the script through Windows WSL (Windows Subsystem for Linux).

Notice I am using jq to parse the JSON responses from AWS CLI and the curl request to MWAA, but feel free to adapt the code if you prefer another approach.

The script above collects all the arguments and send it to the curl request by using the variable $*. So, if we name this script as airflow-cli.sh and you type the following command in your terminal:

airflow-cli.sh list_dag_runs my_dag_name

The MWAA environment will perform the following CLI command:

airflow list_dag_runs my_dag_name

An interesting trick to improve the user experience is to rename this script as airflow and copy it to one of the folders mapped in the local $PATH (e.g. /usr/local/bin/airflow). In this way you can call the commands in the Airflow CLI by typing:

airflow <arguments>

Just ensure you don’t have the real Airflow CLI installed, to avoid conflicts.

If you already have Airflow CLI installed, another option is to run this script from a Docker image and map it to the container local path.

Conclusion

 

The Airflow UI continues to be the primary means of interaction with MWAA, but the additional option of using Airflow CLI allows advanced users to maximise the benefits of MWAA, and take advantage of all the features of a hosted solution.

I hope you enjoyed this content and make good use of this script in your Amazon MWAA environment!

References:

At DNX Solutions, we work to bring a better cloud and application experience for digital-native companies in Australia.

Our current focus areas are AWS, Well-Architected Solutions, Containers, ECS, Kubernetes, Continuous Integration/Continuous Delivery and Service Mesh.

We are always hiring cloud engineers for our Sydney office, focusing on cloud-native concepts.

Check our open-source projects at https://github.com/DNXLabs and follow us on Twitter, Linkedin or Facebook.

Stay informed on the latest
insights and tech-updates

No spam - just releases, updates, and tech information.