Spark Run Mode and Application Deployment Mode

Spark running mode is often be confused with application deploy mode. Spark Running Mode Spark can run on a single local machine or on a cluster manager like Mesos or YARN to leverage the resources(memory, CPU, and so on) across the cluster. Run Locally In local mode, spark jobs run on a single machine and are executed in parallel using multi-threading: this restricts parallelism to (at most) the number of cores in your machine.

YARN Basics

Apache Hadoop YARN is the cluster manager for Hadoop MapReduce, but it can also be used for other compute framework such as Spark. YARN(Yet Another Resource Negotiator) was introduced since Hadoop 2.0 to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job or a DAG of jobs.

How to Extend Spark

In this post, we go through extending a Spark application and also Spark APIs by some examples. These two kinds of extensions are sometimes related, and we go with extending a Spark application first.