Tuesday, October 16, 2018

Apache Airflow - Runbook

To try out a different scheduler,  we tried Apache Airflow to schedule Spark jobs. 
Due to a known issue with Kerberos and Python 3 (see below), Python 2 had to be installed. 

I really like Airflow, but it doesn't handle user propagation as of 1.9 very well. Multi-tenancy isn't supported in a fully secure way, allowing users to execute jobs as other users in some cases. 

Following is the runbook to install Airflow.
An Ansible playbook that we used is here: 
https://github.com/infOpen/ansible-role-airflow


Apache Airflow Runbook

Sunday, October 7, 2018

Managing Multi-Tenant Environments

Managing multi-tenancy on systems is a balancing act. Administrators must prevent actions from adversely affecting other tenants, while providing users the resources to do their jobs. If done correctly, cluster management should be seamless, which greatly reduces fire-fighting and allowing time to be spent on improving the architecture.

I've had to tackle all of the following issues in production on various clusters, with hundreds of users.  Automation using Python scripting, scheduling and orchestration is key to making your life easy.

  1. User Space Management
    This is critical in large environments, which may have hundreds of users. 
    1. Onboarding/Offboarding
      Create home directories automatically when new users appear in AD groups. 
    2. Directory Quotas
      Explicitly set to a maximum on all file systems. Prevent users from DDOSing the system. 
    3. Directory Rights
      Prevent users from writing to directories they shouldn't, and allow access to universal resources.
    4. Symlinks and System Defaults
      Alter /etc/profile.d to realias system commands to the preferred ones.
  2. Resource Self-Service
    This is key, all accounts and resources should be automatically provisioned, so you don't have to provide these to every user.
    1. ETL Mechanism
      Developers will need some way to populate the system (NiFi/StreamSets/Logstash)
    2. Scalable Buffer
      Kafka is your friend! Set this up with Kerberos and auto-topic creation in a dev environment.
    3. Processing Resources
      Yarn Queues for Hadoop or cluster limits for DataBricks.
      https://docs.databricks.com/administration-guide/cloud-configurations/aws/cmbp.html
    4. Scheduling Mechanism
      Set up multi-tenancy for users (Oozie/Airflow/Azkaban)
  3. User Examples
    Write some examples, so users can become comfortable running their first job.
    1. Job - Sample Spark job/Elasticsearch Watch/etc.
    2. Scheduling - An example of how to run the job periodically. 
    3. Advanced Tasks - Examples to show how to perform geolocation/write their own record reader/handle reading/writing from databases, etc.

Thursday, October 4, 2018

Decrypting Nifi Passwords

One of the great dangers of a multi-tenant environment is ensuring that all of the necessary data is being properly archived.  In migrating from one Nifi version to another, we realized that not all of the passwords had been stored in a password safe.  This would have slowed down our migration significantly.

Nifi holds its passwords in the flow.xml.gz file, but they are encryptted. To recover the passwords, we had to find a way to decrypt all of them easily. Luckily, Nifi has a toolkit for just this!

Big props to my coworker, who did the following.

The encrypt-config.sh shell script calls the following class:
org.apache.nifi.toolkit.encryptconfig.EncryptConfigMain

On line 814, you can see the following code, where the password is decoded.
String plaintext = decryptFlowElement(wrappedCipherText, existingFlowPassword, existingAlgorithm, existingProvider)

With one line of code, and a quick recompile, the script now outputs all of the formerly encrypted passwords.

Code to add after line 814:
flowXmlContent.findAll(WRAPPED_FLOW_XML_CIPHER_TEXT_REGEX) {String wrappedCipherText ->
            logger.warn("Original: "+wrappedCipherText+"\t Decrypted:"+decryptFlowElement(wrappedCipherText, existingFlowPassword, existingAlgorithm, existingProvider))
        }


This worked out really well, we can now archive all of the passwords into a password safe and continue with the migration.

Apache Airflow - Runbook

To try out a different scheduler,  we tried Apache Airflow to schedule Spark jobs.  Due to a known issue with Kerberos and Python 3 (see...