Tuesday, October 16, 2018

Apache Airflow - Runbook

To try out a different scheduler,  we tried Apache Airflow to schedule Spark jobs. 
Due to a known issue with Kerberos and Python 3 (see below), Python 2 had to be installed. 

I really like Airflow, but it doesn't handle user propagation as of 1.9 very well. Multi-tenancy isn't supported in a fully secure way, allowing users to execute jobs as other users in some cases. 

Following is the runbook to install Airflow.
An Ansible playbook that we used is here: 
https://github.com/infOpen/ansible-role-airflow


Apache Airflow Runbook

4 comments:

  1. This comprehensive setup guide is highly beneficial for engineers working with workflow automation and distributed data processing. Exploring Big Data Projects can help students and professionals gain hands-on experience with technologies such as Apache Spark, Airflow, Hadoop, and scalable data pipeline architectures used in modern analytics platforms.

    ReplyDelete
  2. Since Apache Airflow relies heavily on Python for DAG creation, task scheduling, and workflow management, strengthening Python development skills is equally important. Learning through Python Projects For Final Year enables learners to build robust automation scripts, ETL pipelines, and data engineering solutions for real-world applications.

    ReplyDelete

Apache Airflow - Runbook

To try out a different scheduler,  we tried Apache Airflow to schedule Spark jobs.  Due to a known issue with Kerberos and Python 3 (see...