Tuesday, October 16, 2018

Apache Airflow - Runbook

To try out a different scheduler,  we tried Apache Airflow to schedule Spark jobs. 
Due to a known issue with Kerberos and Python 3 (see below), Python 2 had to be installed. 

I really like Airflow, but it doesn't handle user propagation as of 1.9 very well. Multi-tenancy isn't supported in a fully secure way, allowing users to execute jobs as other users in some cases. 

Following is the runbook to install Airflow.
An Ansible playbook that we used is here: 
https://github.com/infOpen/ansible-role-airflow


Apache Airflow Runbook

Sunday, October 7, 2018

Managing Multi-Tenant Environments

Managing multi-tenancy on systems is a balancing act. Administrators must prevent actions from adversely affecting other tenants, while providing users the resources to do their jobs. If done correctly, cluster management should be seamless, which greatly reduces fire-fighting and allowing time to be spent on improving the architecture.

I've had to tackle all of the following issues in production on various clusters, with hundreds of users.  Automation using Python scripting, scheduling and orchestration is key to making your life easy.

  1. User Space Management
    This is critical in large environments, which may have hundreds of users. 
    1. Onboarding/Offboarding
      Create home directories automatically when new users appear in AD groups. 
    2. Directory Quotas
      Explicitly set to a maximum on all file systems. Prevent users from DDOSing the system. 
    3. Directory Rights
      Prevent users from writing to directories they shouldn't, and allow access to universal resources.
    4. Symlinks and System Defaults
      Alter /etc/profile.d to realias system commands to the preferred ones.
  2. Resource Self-Service
    This is key, all accounts and resources should be automatically provisioned, so you don't have to provide these to every user.
    1. ETL Mechanism
      Developers will need some way to populate the system (NiFi/StreamSets/Logstash)
    2. Scalable Buffer
      Kafka is your friend! Set this up with Kerberos and auto-topic creation in a dev environment.
    3. Processing Resources
      Yarn Queues for Hadoop or cluster limits for DataBricks.
      https://docs.databricks.com/administration-guide/cloud-configurations/aws/cmbp.html
    4. Scheduling Mechanism
      Set up multi-tenancy for users (Oozie/Airflow/Azkaban)
  3. User Examples
    Write some examples, so users can become comfortable running their first job.
    1. Job - Sample Spark job/Elasticsearch Watch/etc.
    2. Scheduling - An example of how to run the job periodically. 
    3. Advanced Tasks - Examples to show how to perform geolocation/write their own record reader/handle reading/writing from databases, etc.

Thursday, October 4, 2018

Decrypting Nifi Passwords

One of the great dangers of a multi-tenant environment is ensuring that all of the necessary data is being properly archived.  In migrating from one Nifi version to another, we realized that not all of the passwords had been stored in a password safe.  This would have slowed down our migration significantly.

Nifi holds its passwords in the flow.xml.gz file, but they are encryptted. To recover the passwords, we had to find a way to decrypt all of them easily. Luckily, Nifi has a toolkit for just this!

Big props to my coworker, who did the following.

The encrypt-config.sh shell script calls the following class:
org.apache.nifi.toolkit.encryptconfig.EncryptConfigMain

On line 814, you can see the following code, where the password is decoded.
String plaintext = decryptFlowElement(wrappedCipherText, existingFlowPassword, existingAlgorithm, existingProvider)

With one line of code, and a quick recompile, the script now outputs all of the formerly encrypted passwords.

Code to add after line 814:
flowXmlContent.findAll(WRAPPED_FLOW_XML_CIPHER_TEXT_REGEX) {String wrappedCipherText ->
            logger.warn("Original: "+wrappedCipherText+"\t Decrypted:"+decryptFlowElement(wrappedCipherText, existingFlowPassword, existingAlgorithm, existingProvider))
        }


This worked out really well, we can now archive all of the passwords into a password safe and continue with the migration.

Monday, September 24, 2018

Production Kafka Settings

Production Kafka Settings

The Confluent Developer Training has a lot of great information and examples.
The instructor kept saying, "And in production, you'll want to do this...". 
This is like the professor saying "This will be on the exam."

Following are some production configurations mentioned in the course. 
Confluent also has an in-depth link (see bottom of post), regarding production settings.
Cloudera has also written a very in-depth guide on Kafka setup. 

Kafka Brokers:

- Run at least 3 brokers
- 8 gigabytes of RAM to start
- 32 GB on Host (less is counterproductive)
- 5 GB for JVM
- JVM, run with G1GC
- -Xms6g -Xmx6g -XX:MetaspaceSize=96m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
   -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M
   -XX:MinMetaspaceFreeRatio=50 -XX:MaxMetaspaceFreeRatio=80

ZooKeeper

- Minimum 3 nodes, sometimes 5
- 16 GB RAM on Host
- 1 GB JVM
- 64 GB SSD (1)

Topic Settings

- Topic Replication: Replication Factor 2 (3 copies)
- min.insync.replicas = required.acks=-1
- Turn off Auto-Creation in production: `auto.create.topics.enable: False`  
- Turn off Topic Deletion: `delete.topic.enable: False`

Altering Topics - Altering partitions in topics

Option 1: Only create new topics  
Option 2: Shut down producers, increase partitions, restart producers  

Kafka Connect:

- Use distributed mode for fault tolerance and availability  
- Set topic replication to 3 for connect topics
- Set cleanup.policy to compact  
- Set offset.storage.topic: 50  
- Set status.storage.topic: 10  

Kafka Streams:

- Use at least 2 brokers in code, preferrably 3
- Use a ShutdownHook in Streams code
- Job scaling limited by parallelism of first topic

Links:

- https://docs.confluent.io/current/kafka/deployment.html
- https://www.cloudera.com/documentation/enterprise/6/6.0/topics/kafka.html

Apache Airflow - Runbook

To try out a different scheduler,  we tried Apache Airflow to schedule Spark jobs.  Due to a known issue with Kerberos and Python 3 (see...