Diving Into Airflow Scheduler

Last article we told about the basic concepts and architecture of Airflow, and we knew that Airflow has three major components: webserver, scheduler and executor. This article will talk about the detail of the scheduler by diving into some of the source code(version: 1.10.1).

Understand Airflow

Key concepts

For context around the terms used in this blog post, here are a few key concepts for Airflow:

  • DAG (Directed Acyclic Graph): a workflow which glues all the tasks with inter-dependencies.
  • Operator: a template for a specific type of work to be executed. For example, BashOperator represents how to execute a bash script, while PythonOperator represents how to execute a python function, etc.
  • Sensor: a type of special operator which will only execute if a certain condition is met.
  • Task: a parameterized instance of an operator/sensor which represents a unit of actual work to be executed.
  • Plugin: an extension to allow users to easily extend Airflow with various custom hooks, operators, sensors, macros, and web views.
  • Pools: concurrency limit configuration for a set of Airflow tasks.
  • Connections to define any external DB, FTP etc. connection’s authentication.
  • Variables to store and retrieve arbitrary content or settings as a simple key value.
  • XCom to share keys/values between independent tasks.
  • Pools to limit the execution parallelism on arbitrary sets of tasks.
  • Hooks to reach external platforms and databases.

Spark Run Mode and Application Deployment Mode

Spark running mode is often be confused with application deploy mode. Spark Running Mode Spark can run on a single local machine or on a cluster manager like Mesos or YARN to leverage the resources(memory, CPU, and so on) across the cluster. Run Locally In local mode, spark jobs run on a single machine and are executed in parallel using multi-threading: this restricts parallelism to (at most) the number of cores in your machine.