Copy/paste the following commands to install and configure Apache Airflow.
Install required system packages
extra yum packages required for airflow-kerberos / sasl:
https://stackoverflow.com/a/45054297
https://stackoverflow.com/a/45054297
yum install -y gcc-c++ python-devel cyrus-sasl-devel wget bzip2 krb5-devel
|
Setup airflow user
System variables for airflow
AIRFLOW_BASE=/opt/airflow
AIRFLOW_USER=airflow
|
Setup airflow user
adduser $AIRFLOW_USER
|
Specify a directory to be used by airflow
mkdir $AIRFLOW_BASE
chown -R airflow:airflow $AIRFLOW_BASE
|
root will need to create the /var/log/airflow directory
mkdir /var/log/airflow
chown -R airflow:airflow /var/log/airflow
|
Download and install anaconda, create a conda environment
Anaconda install as airflow user
su - airflow
cd /opt/airflow
|
Download Anaconda
wget https://repo.continuum.io/archive/Anaconda2-5.1.0-Linux-x86_64.sh
|
Install Anaconda
bash Anaconda2-5.1.0-Linux-x86_64.sh -u -b -p /opt/airflow/anaconda2
|
Prepend the Anaconda2 install location to PATH in your /root/.bashrc
echo export PATH='/opt/airflow/anaconda2/bin:$PATH' >> ~/.bashrc
source ~/.bashrc
|
Create a conda/anaconda virtual environment
https://conda.io/docs/user-guide/tasks/manage-environments.html
conda create --name airflow --offline
|
To activate this environment, use:
source activate airflow
|
To deactivate an active environment, use:
source deactivate
|
Install Airflow
These should be copied & pasted from the vars in the root section
If these don't match, there will be problems.
If these don't match, there will be problems.
AIRFLOW_BASE=/opt/airflow
AIRFLOW_USER=airflow
|
We need to append the ./app path to the AIRFLOW_HOME
AIRFLOW_HOME=$AIRFLOW_BASE/app
|
Set airflow home path
export AIRFLOW_HOME=$AIRFLOW_HOME
echo export AIRFLOW_HOME=$AIRFLOW_HOME >> ~/.bashrc
cd $AIRFLOW_BASE
|
sqlalchemy > 1.2 introduced a backwards incompatible change that breaks the user.password python function in airflow, so we use < 1.2 to work around this.
source: https://stackoverflow.com/questions/48075826
pip install --upgrade pip
pip install 'sqlalchemy<1.2' cryptography
|
For more installation options, see documentation at:
https://airflow.apache.org/installation.html
https://airflow.apache.org/installation.html
pip install apache-airflow[password,ldap,kerberos,hive,hdfs,jdbc]
|
Initialize the airflow database & config files
airflow initdb
|
Configure Airflow
Setup some variables:
export AIRFLOW_HOSTNAME=host.domain.com
export AIRFLOW_SSL_PREFIX=${AIRFLOW_HOSTNAME}
|
Change the base logging to /var/log/airflow
sed -i "/^base_log_folder/ s|.*|base_log_folder = /var/log/airflow|" $AIRFLOW_HOME/airflow.cfg
sed -i "/^child_process_log_directory/ s|.*|child_process_log_directory = /var/log/airflow/scheduler|" $AIRFLOW_HOME/airflow.cfg
|
Enable authentication in airflow.cfg
sed -i -e 's/authenticate = False/authenticate = True\nauth_backend = airflow.contrib.auth.backends.ldap_auth/g' $AIRFLOW_HOME/airflow.cfg
|
Enable muti-tenancy in airflow.cfg
sed -i -e 's/filter_by_owner = False/filter_by_owner = True/g' $AIRFLOW_HOME/airflow.cfg
|
Configure separate directory for Airflow DAGs, into which per-user sub-directories will be created:
mkdir ${AIRFLOW_BASE}/airflow-dags
# mkdir ${AIRFLOW_BASE}/airflow-dags/USER1
# root user will need to `chown USER1 ${AIRFLOW_BASE}/airflow-dags/USER1`
sed -i "/^dags_folder/ s|.*|dags_folder = ${AIRFLOW_BASE}/airflow-dags|" $AIRFLOW_HOME/airflow.cfg
|
Enable Kerberos support and configure keytab for airflow user in airflow.cfg
sed -i '/^security/ s/.*/security = kerberos/' $AIRFLOW_HOME/airflow.cfg
sed -i '/^keytab/ s|.*|keytab = /etc/security/keytabs/airflow.keytab|' $AIRFLOW_HOME/airflow.cfg
|
Configure SSL certs:
sed -i -e "s|web_server_ssl_cert =|web_server_ssl_cert = /etc/pki/tls/certs/${AIRFLOW_SSL_PREFIX}.crt|" $AIRFLOW_HOME/airflow.cfg
sed -i -e "s|web_server_ssl_key =|web_server_ssl_key = /etc/pki/tls/private/${AIRFLOW_SSL_PREFIX}.key|" $AIRFLOW_HOME/airflow.cfg
|
Configure URLs used by Airflow, to reflect proper host name, port, and protocol:
sed -i "/^endpoint_url/ s|.*|endpoint_url = https://${AIRFLOW_HOSTNAME}:8443|" $AIRFLOW_HOME/airflow.cfg
sed -i "/^base_url/ s|.*|base_url = https://${AIRFLOW_HOSTNAME}:8443|" $AIRFLOW_HOME/airflow.cfg
sed -i 's/8080$/8443/g' $AIRFLOW_HOME/airflow.cfg
|
LDAPS config for AD domain auth in $AIRFLOW_HOME/airflow.cfg:
[ldap]
uri = ldaps:// company.com:636
user_filter = objectClass=*
user_name_attr = sAMAccountName
group_member_attr = memberOf
superuser_filter = memberOf=CN=grouphere,OU=Groups,OU=Boston,DC=company,DC=com
data_profiler_filter = memberOf=CN=grouphere,OU=Groups,DC=company,DC=com
bind_user = CN=user,OU=BindUsers,OU=,DC=company,DC=com
bind_password = Password1
basedn = OU=Users,DC=COMPANY,DC=COM
cacert = /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt
search_scope = SUBTREE
|
Running Airflow with systemd
For a production system, all system services should be run with systemd.
https://github.com/apache/incubator-airflow/tree/master/scripts/systemd
https://github.com/apache/incubator-airflow/tree/master/scripts/systemd
As root: setup variables to match the above settings
AIRFLOW_BASE=/opt/airflow
AIRFLOW_USER=airflow
AIRFLOW_HOME=${AIRFLOW_BASE}/app
AIRFLOW_SCHEDULER_SCRIPT=${AIRFLOW_BASE}/anaconda2/bin/anaconda_airflow_scheduler.sh
AIRFLOW_WEBSERVER_SCRIPT=${AIRFLOW_BASE}/anaconda2/bin/anaconda_airflow_webserver.sh
AIRFLOW_KERBEROS_SCRIPT=${AIRFLOW_BASE}/anaconda2/bin/anaconda_airflow_kerberos.sh
|
As root: Create some helper scripts to run Airflow inside the Anaconda environment, for systemd to run:
cat > ${AIRFLOW_SCHEDULER_SCRIPT} << EOF
#!/bin/bash
export PATH=${AIRFLOW_BASE}/anaconda2/bin:\$PATH
export AIRFLOW_HOME=${AIRFLOW_BASE}/app
export SPARK_MAJOR_VERSION=2
source ${AIRFLOW_BASE}/anaconda2/bin/activate airflow
${AIRFLOW_BASE}/anaconda2/bin/airflow scheduler
EOF
chmod a+x ${AIRFLOW_SCHEDULER_SCRIPT}
chown ${AIRFLOW_USER}:${AIRFLOW_USER} ${AIRFLOW_SCHEDULER_SCRIPT}
|
cat > ${AIRFLOW_WEBSERVER_SCRIPT} << EOF #!/bin/bash export PATH=${AIRFLOW_BASE}/anaconda2/bin:\$PATH export AIRFLOW_HOME=${AIRFLOW_BASE}/app export SPARK_MAJOR_VERSION=2 source ${AIRFLOW_BASE}/anaconda2/bin/activate airflow ${AIRFLOW_BASE}/anaconda2/bin/airflow webserver --pid /run/airflow/webserver.pid EOF chmod a+x ${AIRFLOW_WEBSERVER_SCRIPT} chown ${AIRFLOW_USER}:${AIRFLOW_USER} ${AIRFLOW_WEBSERVER_SCRIPT} |
cat > ${AIRFLOW_KERBEROS_SCRIPT} << EOF #!/bin/bash export PATH=${AIRFLOW_BASE}/anaconda2/bin:\$PATH export AIRFLOW_HOME=${AIRFLOW_BASE}/app export SPARK_MAJOR_VERSION=2 source ${AIRFLOW_BASE}/anaconda2/bin/activate airflow ${AIRFLOW_BASE}/anaconda2/bin/airflow kerberos EOF chmod a+x ${AIRFLOW_KERBEROS_SCRIPT} chown ${AIRFLOW_USER}:${AIRFLOW_USER} ${AIRFLOW_KERBEROS_SCRIPT} |
As root: Perform the following actions. Download and modify the systemd scripts.
cd /etc/systemd/system
wget https://raw.githubusercontent.com/apache/incubator-airflow/master/scripts/systemd/airflow-scheduler.service
wget https://raw.githubusercontent.com/apache/incubator-airflow/master/scripts/systemd/airflow-webserver.service
wget https://raw.githubusercontent.com/apache/incubator-airflow/master/scripts/systemd/airflow-kerberos.service
sed -i "/EnvironmentFile/ s|.*|EnvironmentFile=${AIRFLOW_HOME}/airflow.cfg|" airflow-scheduler.service
sed -i "/User/ s|airflow|${AIRFLOW_USER}|" airflow-scheduler.service
sed -i "/Group/ s|airflow|${AIRFLOW_USER}|" airflow-scheduler.service
sed -i "/ExecStart/ s|.*|ExecStart=${AIRFLOW_SCHEDULER_SCRIPT}|" airflow-scheduler.service
sed -i "/EnvironmentFile/ s|.*|EnvironmentFile=${AIRFLOW_HOME}/airflow.cfg|" airflow-webserver.service
sed -i "/User/ s|airflow|${AIRFLOW_USER}|" airflow-webserver.service
sed -i "/Group/ s|airflow|${AIRFLOW_USER}|" airflow-webserver.service
sed -i "/ExecStart/ s|.*|ExecStart=${AIRFLOW_WEBSERVER_SCRIPT}|" airflow-webserver.service
sed -i "/EnvironmentFile/ s|.*|EnvironmentFile=${AIRFLOW_HOME}/airflow.cfg|" airflow-kerberos.service
sed -i "/User/ s|airflow|${AIRFLOW_USER}|" airflow-kerberos.service
sed -i "/Group/ s|airflow|${AIRFLOW_USER}|" airflow-kerberos.service
sed -i "/ExecStart/ s|.*|ExecStart=${AIRFLOW_KERBEROS_SCRIPT}|" airflow-kerberos.service
|
As root: Perform the following actions: Download the systemd config
cd /etc/tmpfiles.d/
wget https://raw.githubusercontent.com/apache/incubator-airflow/master/scripts/systemd/airflow.conf
|
Create a run directory:
mkdir /run/airflow
chown ${AIRFLOW_USER}:${AIRFLOW_USER} /run/airflow
|
Start the services
systemctl daemon-reload
# services should start automatically on reboot
systemctl enable airflow-scheduler
systemctl enable airflow-webserver
systemctl enable airflow-kerberos
# start the services now
systemctl start airflow-scheduler
systemctl start airflow-webserver
systemctl start airflow-kerberos
|
Impersonation
Edit HDFS "Custom core-site" settings in Ambari, to add the
hadoop.proxyuser
properties as described in the Airflow documentation: https://airflow.apache.org/security.html#hadoopYARN Queue to Connection mapping
The way that Spark jobs in Airflow interact with the YARN queues is via the Connections settings of Airflow. These Connections need to be configured by an Airflow admin, with one Connection being created for each YARN queue. The Airflow users can then direct jobs to the appropriate YARN queues by selecting the appropriate Connection when configuring their job.
Navigate to Admin → Connections to modify these settings.
One Connection is provided by default in Airflow, and this Connection will be used if the user does not specify a Connection. This Connection, named
spark_default
should be edited to point to the default YARN queue for users.
· Conn Id:
spark_default
· Host:
yarn
· Extra:
{"queue" : "default"}
Additional Connections should be added for
each
queue.Optional - Altering System Config
# Change to aiflow user
sudo su - airflow
# Enter the airflow environment.
source activate airflow
# Alter the configuration file.
vim $AIRFLOW_HOME/airflow.cfg
# Alter the config file as needed.
# Start the system services
systemctl restart airflow-scheduler
systemctl restart airflow-webserver
systemctl restart airflow-kerberos
|
Optional - Using local user and group for SSL/TLS certs
As the root user:
groupadd -r ssl-cert
usermod -G ssl-cert airflow
chgrp ssl-cert /etc/pki/tls/private/company.key
chmod g+r /etc/pki/tls/private/company.key
|
As the airflow user:
sed -i -e 's|web_server_ssl_cert =|web_server_ssl_cert = /etc/pki/tls/certs/company.crt|' $AIRFLOW_HOME/airflow.cfg
sed -i -e 's|web_server_ssl_key =|web_server_ssl_key = /etc/pki/tls/private/company.key|' $AIRFLOW_HOME/airflow.cfg
|
Optional - Set a password instead of configuring LDAP:
vim /opt/airflow/app/airflow.cfg
# Add the following lines
[webserver]
authenticate = True
auth_backend = airflow.contrib.auth.backends.password_auth
Run the following script to add a user and configure the username/password.
Make sure the database is set up first.
Make sure the database is set up first.
(airflow) [airflow@anaconda-01 ~]$ python
Python 2.7.14 |Anaconda, Inc.| (default, Dec 7 2017, 17:05:42)
[GCC 7.2.0] on linux2
Type "help", "copyright", "credits" or "license" for more information
import airflow
from airflow import models, settings
from airflow.contrib.auth.backends.password_auth import PasswordUser
user = PasswordUser(models.User())
user.username = 'DL2'
user.email = 'user@company.com'
user.password = 'passwordhere'
session = settings.Session()
session.add(user)
session.commit()
session.close()
exit()
Optional - Testing Airflow Running
Start the scheduler
airflow scheduler
^C
If there are no errors, then start as a daemon
airflow scheduler --daemon
Start the airflow web server on an unused port
airflow webserver -p 8086
^C
If there are no errors, then start as a daemon
airflow webserver -p 8086 --daemon
I like this post and regularly I am reading your blog, so please updates more unique posts. Continue your great work.
ReplyDeleteCorporate Training in Chennai
Corporate Training institute in Chennai
Embedded System Course Chennai
Tableau Training in Chennai
Pega Training in Chennai
Spark Training in Chennai
Unix Training in Chennai
Excel Training in Chennai
Corporate Training in Chennai
Corporate Training institute in Chennai
Nice article ..thank you
ReplyDelete