Apache Hadoop YARN is the cluster manager for Hadoop MapReduce, but it can also be used for other compute framework such as Spark. YARN(Yet Another Resource Negotiator) was introduced since Hadoop 2.0 to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job or a DAG of jobs.

  • ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. There is only one global ResourceManager in a YARN cluster and it has two main components:

    • Scheduler is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc.
    • ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.
  • NodeManager is the per-machine framework agent who is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler.

  • ApplicationMaster is responsible for negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress. The ApplicationMaster is run per-application.

Spark On YARN

How Spark executors are started in YARN cluster mode: