Last article we told about the basic concepts and architecture of Airflow, and we knew that Airflow has three major components:
executor. This article will talk about the detail of the
scheduler by diving into some of the source code(version: 1.10.1).
For context around the terms used in this blog post, here are a few key concepts for Airflow:
- DAG (Directed Acyclic Graph): a workflow which glues all the tasks with inter-dependencies.
- Operator: a template for a specific type of work to be executed. For example, BashOperator represents how to execute a bash script, while PythonOperator represents how to execute a python function, etc.
- Sensor: a type of special operator which will only execute if a certain condition is met.
- Task: a parameterized instance of an operator/sensor which represents a unit of actual work to be executed.
- Plugin: an extension to allow users to easily extend Airflow with various custom hooks, operators, sensors, macros, and web views.
- Pools: concurrency limit configuration for a set of Airflow tasks.
- Connections to define any external DB, FTP etc. connection’s authentication.
- Variables to store and retrieve arbitrary content or settings as a simple key value.
- XCom to share keys/values between independent tasks.
- Pools to limit the execution parallelism on arbitrary sets of tasks.
- Hooks to reach external platforms and databases.
Spark running mode is often be confused with application deploy mode. Spark Running Mode Spark can run on a single local machine or on a cluster manager like Mesos or YARN to leverage the resources(memory, CPU, and so on) across the cluster. Run Locally In local mode, spark jobs run on a single machine and are executed in parallel using multi-threading: this restricts parallelism to (at most) the number of cores in your machine.