Understand Airflow

Key concepts

For context around the terms used in this blog post, here are a few key concepts for Airflow:

  • DAG (Directed Acyclic Graph): a workflow which glues all the tasks with inter-dependencies.
  • Operator: a template for a specific type of work to be executed. For example, BashOperator represents how to execute a bash script, while PythonOperator represents how to execute a python function, etc.
  • Sensor: a type of special operator which will only execute if a certain condition is met.
  • Task: a parameterized instance of an operator/sensor which represents a unit of actual work to be executed.
  • Plugin: an extension to allow users to easily extend Airflow with various custom hooks, operators, sensors, macros, and web views.
  • Pools: concurrency limit configuration for a set of Airflow tasks.
  • Connections to define any external DB, FTP etc. connection’s authentication.
  • Variables to store and retrieve arbitrary content or settings as a simple key value.
  • XCom to share keys/values between independent tasks.
  • Pools to limit the execution parallelism on arbitrary sets of tasks.
  • Hooks to reach external platforms and databases.

Spark Run Mode and Application Deployment Mode

Spark running mode is often be confused with application deploy mode. Spark Running Mode Spark can run on a single local machine or on a cluster manager like Mesos or YARN to leverage the resources(memory, CPU, and so on) across the cluster. Run Locally In local mode, spark jobs run on a single machine and are executed in parallel using multi-threading: this restricts parallelism to (at most) the number of cores in your machine.

YARN Basics

Apache Hadoop YARN is the cluster manager for Hadoop MapReduce, but it can also be used for other compute framework such as Spark. YARN(Yet Another Resource Negotiator) was introduced since Hadoop 2.0 to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job or a DAG of jobs.