To try out a different scheduler, we tried Apache Airflow to schedule Spark jobs.

Due to a known issue with Kerberos and Python 3 (see below), Python 2 had to be installed.

I really like Airflow, but it doesn't handle user propagation as of 1.9 very well. Multi-tenancy isn't supported in a fully secure way, allowing users to execute jobs as other users in some cases.

Following is the runbook to install Airflow.
An Ansible playbook that we used is here:

https://github.com/infOpen/ansible-role-airflow

Apache Airflow Runbook

Copy/paste the following commands to install and configure Apache Airflow.

Install required system packages

extra yum packages required for airflow-kerberos / sasl:
https://stackoverflow.com/a/45054297

yum install -y gcc-c++ python-devel cyrus-sasl-devel wget bzip2 krb5-devel

Setup airflow user

System variables for airflow

AIRFLOW_BASE=/opt/airflow

AIRFLOW_USER=airflow

Setup airflow user

adduser $AIRFLOW_USER

Specify a directory to be used by airflow

mkdir $AIRFLOW_BASE

chown -R airflow:airflow $AIRFLOW_BASE

root will need to create the /var/log/airflow directory

mkdir /var/log/airflow

chown -R airflow:airflow /var/log/airflow

Download and install anaconda, create a conda environment

Anaconda install as airflow user

su - airflow

cd /opt/airflow

Download Anaconda

wget https://repo.continuum.io/archive/Anaconda2-5.1.0-Linux-x86_64.sh

Install Anaconda

bash Anaconda2-5.1.0-Linux-x86_64.sh -u -b -p /opt/airflow/anaconda2

Prepend the Anaconda2 install location to PATH in your /root/.bashrc

echo export PATH='/opt/airflow/anaconda2/bin:$PATH' >> ~/.bashrc

source ~/.bashrc

Create a conda/anaconda virtual environment
https://conda.io/docs/user-guide/tasks/manage-environments.html

conda create --name airflow  --offline


To activate this environment, use:

source activate airflow

To deactivate an active environment, use:

source deactivate

Install Airflow

These should be copied & pasted from the vars in the root section
If these don't match, there will be problems.

AIRFLOW_BASE=/opt/airflow

AIRFLOW_USER=airflow

We need to append the ./app path to the AIRFLOW_HOME

AIRFLOW_HOME=$AIRFLOW_BASE/app

Set airflow home path

export AIRFLOW_HOME=$AIRFLOW_HOME

echo export AIRFLOW_HOME=$AIRFLOW_HOME >> ~/.bashrc

cd $AIRFLOW_BASE

sqlalchemy > 1.2 introduced a backwards incompatible change that breaks the user.password python function in airflow, so we use < 1.2 to work around this.
source: https://stackoverflow.com/questions/48075826

pip install --upgrade pip

pip install 'sqlalchemy<1.2' cryptography

For more installation options, see documentation at:
https://airflow.apache.org/installation.html

pip install apache-airflow[password,ldap,kerberos,hive,hdfs,jdbc]

Initialize the airflow database & config files

airflow initdb

Configure Airflow

Setup some variables:

export AIRFLOW_HOSTNAME=host.domain.com

export AIRFLOW_SSL_PREFIX=${AIRFLOW_HOSTNAME}

Change the base logging to /var/log/airflow

sed -i "/^base_log_folder/ s|.*|base_log_folder = /var/log/airflow|" $AIRFLOW_HOME/airflow.cfg

sed -i "/^child_process_log_directory/ s|.*|child_process_log_directory = /var/log/airflow/scheduler|" $AIRFLOW_HOME/airflow.cfg

Enable authentication in airflow.cfg

sed -i -e 's/authenticate = False/authenticate = True\nauth_backend = airflow.contrib.auth.backends.ldap_auth/g' $AIRFLOW_HOME/airflow.cfg

Enable muti-tenancy in airflow.cfg

sed -i -e 's/filter_by_owner = False/filter_by_owner = True/g' $AIRFLOW_HOME/airflow.cfg

Configure separate directory for Airflow DAGs, into which per-user sub-directories will be created:

mkdir ${AIRFLOW_BASE}/airflow-dags

# mkdir ${AIRFLOW_BASE}/airflow-dags/USER1

# root user will need to `chown USER1 ${AIRFLOW_BASE}/airflow-dags/USER1`

sed -i "/^dags_folder/ s|.*|dags_folder = ${AIRFLOW_BASE}/airflow-dags|" $AIRFLOW_HOME/airflow.cfg

Enable Kerberos support and configure keytab for airflow user in airflow.cfg

sed -i '/^security/ s/.*/security = kerberos/' $AIRFLOW_HOME/airflow.cfg

sed -i '/^keytab/ s|.*|keytab = /etc/security/keytabs/airflow.keytab|' $AIRFLOW_HOME/airflow.cfg

Configure SSL certs:

sed -i -e "s|web_server_ssl_cert =|web_server_ssl_cert = /etc/pki/tls/certs/${AIRFLOW_SSL_PREFIX}.crt|" $AIRFLOW_HOME/airflow.cfg

sed -i -e "s|web_server_ssl_key =|web_server_ssl_key = /etc/pki/tls/private/${AIRFLOW_SSL_PREFIX}.key|" $AIRFLOW_HOME/airflow.cfg

Configure URLs used by Airflow, to reflect proper host name, port, and protocol:

sed -i "/^endpoint_url/ s|.*|endpoint_url = https://${AIRFLOW_HOSTNAME}:8443|" $AIRFLOW_HOME/airflow.cfg

sed -i "/^base_url/ s|.*|base_url = https://${AIRFLOW_HOSTNAME}:8443|" $AIRFLOW_HOME/airflow.cfg

sed -i 's/8080$/8443/g' $AIRFLOW_HOME/airflow.cfg

LDAPS config for AD domain auth in $AIRFLOW_HOME/airflow.cfg:

[ldap]

uri = ldaps:// company.com:636

user_filter = objectClass=*

user_name_attr = sAMAccountName

group_member_attr = memberOf

superuser_filter = memberOf=CN=grouphere,OU=Groups,OU=Boston,DC=company,DC=com

data_profiler_filter = memberOf=CN=grouphere,OU=Groups,DC=company,DC=com

bind_user = CN=user,OU=BindUsers,OU=,DC=company,DC=com

bind_password = Password1

basedn = OU=Users,DC=COMPANY,DC=COM

cacert = /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt

search_scope = SUBTREE

Running Airflow with systemd

For a production system, all system services should be run with systemd.
https://github.com/apache/incubator-airflow/tree/master/scripts/systemd

As root: setup variables to match the above settings

AIRFLOW_BASE=/opt/airflow

AIRFLOW_USER=airflow

AIRFLOW_HOME=${AIRFLOW_BASE}/app

AIRFLOW_SCHEDULER_SCRIPT=${AIRFLOW_BASE}/anaconda2/bin/anaconda_airflow_scheduler.sh

AIRFLOW_WEBSERVER_SCRIPT=${AIRFLOW_BASE}/anaconda2/bin/anaconda_airflow_webserver.sh

AIRFLOW_KERBEROS_SCRIPT=${AIRFLOW_BASE}/anaconda2/bin/anaconda_airflow_kerberos.sh

As root: Create some helper scripts to run Airflow inside the Anaconda environment, for systemd to run:

cat > ${AIRFLOW_SCHEDULER_SCRIPT} << EOF

#!/bin/bash

export PATH=${AIRFLOW_BASE}/anaconda2/bin:\$PATH

export AIRFLOW_HOME=${AIRFLOW_BASE}/app

export SPARK_MAJOR_VERSION=2

source ${AIRFLOW_BASE}/anaconda2/bin/activate airflow

${AIRFLOW_BASE}/anaconda2/bin/airflow scheduler

EOF

chmod a+x ${AIRFLOW_SCHEDULER_SCRIPT}

chown ${AIRFLOW_USER}:${AIRFLOW_USER} ${AIRFLOW_SCHEDULER_SCRIPT}

cat > ${AIRFLOW_WEBSERVER_SCRIPT} << EOF

#!/bin/bash

export PATH=${AIRFLOW_BASE}/anaconda2/bin:\$PATH

export AIRFLOW_HOME=${AIRFLOW_BASE}/app

export SPARK_MAJOR_VERSION=2

source ${AIRFLOW_BASE}/anaconda2/bin/activate airflow

${AIRFLOW_BASE}/anaconda2/bin/airflow webserver --pid /run/airflow/webserver.pid

EOF

chmod a+x ${AIRFLOW_WEBSERVER_SCRIPT}

chown ${AIRFLOW_USER}:${AIRFLOW_USER} ${AIRFLOW_WEBSERVER_SCRIPT}

cat > ${AIRFLOW_KERBEROS_SCRIPT} << EOF

#!/bin/bash

export PATH=${AIRFLOW_BASE}/anaconda2/bin:\$PATH

export AIRFLOW_HOME=${AIRFLOW_BASE}/app

export SPARK_MAJOR_VERSION=2

source ${AIRFLOW_BASE}/anaconda2/bin/activate airflow

${AIRFLOW_BASE}/anaconda2/bin/airflow kerberos

EOF

chmod a+x ${AIRFLOW_KERBEROS_SCRIPT}

chown ${AIRFLOW_USER}:${AIRFLOW_USER} ${AIRFLOW_KERBEROS_SCRIPT}

As root: Perform the following actions. Download and modify the systemd scripts.

cd /etc/systemd/system

wget https://raw.githubusercontent.com/apache/incubator-airflow/master/scripts/systemd/airflow-scheduler.service

wget https://raw.githubusercontent.com/apache/incubator-airflow/master/scripts/systemd/airflow-webserver.service

wget https://raw.githubusercontent.com/apache/incubator-airflow/master/scripts/systemd/airflow-kerberos.service

sed -i "/EnvironmentFile/ s|.*|EnvironmentFile=${AIRFLOW_HOME}/airflow.cfg|" airflow-scheduler.service

sed -i "/User/ s|airflow|${AIRFLOW_USER}|" airflow-scheduler.service

sed -i "/Group/ s|airflow|${AIRFLOW_USER}|" airflow-scheduler.service

sed -i "/ExecStart/ s|.*|ExecStart=${AIRFLOW_SCHEDULER_SCRIPT}|" airflow-scheduler.service

sed -i "/EnvironmentFile/ s|.*|EnvironmentFile=${AIRFLOW_HOME}/airflow.cfg|" airflow-webserver.service

sed -i "/User/ s|airflow|${AIRFLOW_USER}|" airflow-webserver.service

sed -i "/Group/ s|airflow|${AIRFLOW_USER}|" airflow-webserver.service

sed -i "/ExecStart/ s|.*|ExecStart=${AIRFLOW_WEBSERVER_SCRIPT}|" airflow-webserver.service

sed -i "/EnvironmentFile/ s|.*|EnvironmentFile=${AIRFLOW_HOME}/airflow.cfg|" airflow-kerberos.service

sed -i "/User/ s|airflow|${AIRFLOW_USER}|" airflow-kerberos.service

sed -i "/Group/ s|airflow|${AIRFLOW_USER}|" airflow-kerberos.service

sed -i "/ExecStart/ s|.*|ExecStart=${AIRFLOW_KERBEROS_SCRIPT}|" airflow-kerberos.service

As root: Perform the following actions: Download the systemd config

cd /etc/tmpfiles.d/

wget https://raw.githubusercontent.com/apache/incubator-airflow/master/scripts/systemd/airflow.conf

Create a run directory:

mkdir /run/airflow

chown ${AIRFLOW_USER}:${AIRFLOW_USER} /run/airflow

Start the services

systemctl daemon-reload

# services should start automatically on reboot

systemctl enable airflow-scheduler

systemctl enable airflow-webserver

systemctl enable airflow-kerberos

# start the services now

systemctl start airflow-scheduler

systemctl start airflow-webserver

systemctl start airflow-kerberos

Impersonation

Edit HDFS "Custom core-site" settings in Ambari, to add the hadoop.proxyuser properties as described in the Airflow documentation: https://airflow.apache.org/security.html#hadoop

YARN Queue to Connection mapping

The way that Spark jobs in Airflow interact with the YARN queues is via the Connections settings of Airflow. These Connections need to be configured by an Airflow admin, with one Connection being created for each YARN queue. The Airflow users can then direct jobs to the appropriate YARN queues by selecting the appropriate Connection when configuring their job.

Navigate to Admin → Connections to modify these settings.

One Connection is provided by default in Airflow, and this Connection will be used if the user does not specify a Connection. This Connection, named spark_default should be edited to point to the default YARN queue for users.

· Conn Id: spark_default

· Host: yarn

· Extra: {"queue" : "default"}

Additional Connections should be added for each queue.

Optional - Altering System Config

# Change to aiflow user

sudo su - airflow

# Enter the airflow environment.

source activate airflow

# Alter the configuration file.

vim $AIRFLOW_HOME/airflow.cfg

# Alter the config file as needed.

# Start the system services

systemctl restart airflow-scheduler

systemctl restart airflow-webserver

systemctl restart airflow-kerberos

Optional - Using local user and group for SSL/TLS certs

As the root user:

groupadd -r ssl-cert

usermod -G ssl-cert airflow

chgrp ssl-cert /etc/pki/tls/private/company.key

chmod g+r /etc/pki/tls/private/company.key

As the airflow user:

sed -i -e 's|web_server_ssl_cert =|web_server_ssl_cert = /etc/pki/tls/certs/company.crt|' $AIRFLOW_HOME/airflow.cfg

sed -i -e 's|web_server_ssl_key =|web_server_ssl_key = /etc/pki/tls/private/company.key|' $AIRFLOW_HOME/airflow.cfg

Optional - Set a password instead of configuring LDAP:

vim /opt/airflow/app/airflow.cfg

# Add the following lines

[webserver]

authenticate = True

auth_backend = airflow.contrib.auth.backends.password_auth

Run the following script to add a user and configure the username/password.
Make sure the database is set up first.

(airflow) [airflow@anaconda-01 ~]$ python

Python 2.7.14 |Anaconda, Inc.| (default, Dec  7 2017, 17:05:42)

[GCC 7.2.0] on linux2

Type "help", "copyright", "credits" or "license" for more information

import airflow

from airflow import models, settings

from airflow.contrib.auth.backends.password_auth import PasswordUser

user = PasswordUser(models.User())

user.username = 'DL2'

user.email = 'user@company.com'

user.password = 'passwordhere'

session = settings.Session()

session.add(user)

session.commit()

session.close()

exit()

Optional - Testing Airflow Running

Start the scheduler

airflow scheduler

^C

If there are no errors, then start as a daemon

airflow scheduler --daemon

Start the airflow web server on an unused port

airflow webserver -p 8086

^C

If there are no errors, then start as a daemon

airflow webserver -p 8086 --daemon

Production Kafka Settings

The Confluent Developer Training has a lot of great information and examples.
The instructor kept saying, "And in production, you'll want to do this...".
This is like the professor saying "This will be on the exam."

Following are some production configurations mentioned in the course.

Confluent also has an in-depth link (see bottom of post), regarding production settings.
Cloudera has also written a very in-depth guide on Kafka setup.

Kafka Brokers:

- Run at least 3 brokers
- 8 gigabytes of RAM to start
- 32 GB on Host (less is counterproductive)
- 5 GB for JVM
- JVM, run with G1GC
- -Xms6g -Xmx6g -XX:MetaspaceSize=96m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
   -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M
   -XX:MinMetaspaceFreeRatio=50 -XX:MaxMetaspaceFreeRatio=80

ZooKeeper

- Minimum 3 nodes, sometimes 5
- 16 GB RAM on Host
- 1 GB JVM
- 64 GB SSD (1)

Topic Settings

- Topic Replication: Replication Factor 2 (3 copies)
- min.insync.replicas = required.acks=-1
- Turn off Auto-Creation in production: `auto.create.topics.enable: False`  
- Turn off Topic Deletion: `delete.topic.enable: False`

Altering Topics - Altering partitions in topics

Option 1: Only create new topics  
Option 2: Shut down producers, increase partitions, restart producers

Kafka Connect:

- Use distributed mode for fault tolerance and availability  
- Set topic replication to 3 for connect topics
- Set cleanup.policy to compact  
- Set offset.storage.topic: 50  
- Set status.storage.topic: 10

Kafka Streams:

- Use at least 2 brokers in code, preferrably 3
- Use a ShutdownHook in Streams code
- Job scaling limited by parallelism of first topic

Links:

- https://docs.confluent.io/current/kafka/deployment.html
- https://www.cloudera.com/documentation/enterprise/6/6.0/topics/kafka.html

Oops, My Data!

Tuesday, October 16, 2018

Apache Airflow - Runbook

Install required system packages

Setup airflow user

Download and install anaconda, create a conda environment

Install Airflow

Configure Airflow

Running Airflow with systemd

Impersonation

YARN Queue to Connection mapping

Optional - Altering System Config

Optional - Using local user and group for SSL/TLS certs

Optional - Set a password instead of configuring LDAP:

Optional - Testing Airflow Running

Sunday, October 7, 2018

Managing Multi-Tenant Environments

Thursday, October 4, 2018

Decrypting Nifi Passwords

Monday, September 24, 2018

Production Kafka Settings

Production Kafka Settings

Kafka Brokers:

ZooKeeper

Topic Settings

Altering Topics - Altering partitions in topics

Kafka Connect:

Kafka Streams:

Links:

Apache Airflow - Runbook