spark任務調度

Job Scheduling

Overview

Spark has several facilities for scheduling resources between computations. First, recall that, as described in the cluster mode overview, each Spark application (instance of SparkContext) runs an independent set of executor processes. The cluster managers that Spark runs on provide facilities for scheduling across applications. Second, within each Spark application, multiple “jobs” (Spark actions) may be running concurrently if they were submitted by different threads. This is common if your application is serving requests over the network; for example, the Shark server works this way. Spark includes a fair scheduler to schedule resources within each SparkContext.

Spark有多種設施來調度計算所需的資源。首先,如同在 cluster mode overview中描述一樣,不同的Spark應用使用獨立的核,Spark運行的集羣管理器提供應用間的調度方法。然後,在同一個Spark應用之間,被多個線程提交的多個任務(Spark actions)可以並行執行。如果你的應用是響應來自互聯網的請求, 這個功能會非常有用,Spark Shark就是用這種方式運行的。Spark 包含了一個公平調度器來調度各個SparkContext之間的資源。

Scheduling Across Applications

When running on a cluster, each Spark application gets an independent set of executor JVMs that only run tasks and store data for that application. If multiple users need to share your cluster, there are different options to manage allocation, depending on the cluster manager.

當運行在集羣上的時候,各個Spark應用得到獨立的只運行該應用的任務存儲該應用數據的的JVM集,如果多個用戶需要共享集羣,依賴於集羣管理器有多種方式管理分配。

The simplest option, available on all cluster managers, is static partitioning of resources. With this approach, each application is given a maximum amount of resources it can use, and holds onto them for its whole duration. This is the approach used in Spark’s standalone andYARN modes, as well as the coarse-grained Mesos mode. Resource allocation can be configured as follows, based on the cluster type:

最簡單的方式就是靜態分配資源,所有的集羣管理器都支持這個方法。使用這個方法,每個應用一開始就會被分配最大可能使用的資源,然後整個運行階段都保持這些資源。Spark自帶的管理器和YARN模式,以及粗粒度的Mesos模式採用這種方法。在各平臺上,按一下方式配置:

  • Standalone mode: By default, applications submitted to the standalone mode cluster will run in FIFO (first-in-first-out) order, and each application will try to use all available nodes. You can limit the number of nodes an application uses by setting the spark.cores.maxconfiguration property in it, or change the default for applications that don’t set this setting through spark.deploy.defaultCores. Finally, in addition to controlling cores, each application’s spark.executor.memory setting controls its memory use.
  • 自帶管理器模式:這是Spark調度的默認模式。所有被提交的應用會按FIFO的方式運行。每個應用會試圖使用所有的節點, 你也可以通過設置spark.cores.max屬性來限制一個應用最大可以使用的節點,也可以通過設置spark.deploy.defaultCores來對所有應用生效。你還可以設置spark.executor.memory來控制每個應用的內存使用量。
  • Mesos: To use static partitioning on Mesos, set the spark.mesos.coarse configuration property to true, and optionally set spark.cores.maxto limit each application’s resource share as in the standalone mode. You should also set spark.executor.memory to control the executor memory.
  • Mesos模式:如果你想使用Mesos的靜態資源分配, 你需要設置spark.mesos.coarse 爲true, 然後你可以設置spark.cores.max來限制每個應用能使用的節點。同時,內存用量也是通過spark.executor.coarse來設置的,跟自帶的管理器模式基本一樣。
  • YARN: The --num-executors option to the Spark YARN client controls how many executors it will allocate on the cluster, while --executor-memory and --executor-cores control the resources per executor.
  • YARN模式:使用--num-executors 來設置節點數,--executor-memory和--executor-cores來控制每個節點的內存和核。

A second option available on Mesos is dynamic sharing of CPU cores. In this mode, each Spark application still has a fixed and independent memory allocation (set by spark.executor.memory), but when the application is not running tasks on a machine, other applications may run tasks on those cores. This mode is useful when you expect large numbers of not overly active applications, such as shell sessions from separate users. However, it comes with a risk of less predictable latency, because it may take a while for an application to gain back cores on one node when it has work to do. To use this mode, simply use a mesos:// URL without setting spark.mesos.coarse to true.

Mesos支持動態共享CPU核。在這種模式下,各個Spark應用依然被分配固定的不可共享的內存(依然通過spark.executor.memory來分配),但一個節點上如果沒有應用在執行任務,那麼其他的應用可以使用這些核。這種模式適合你會有很多的應用,而這些應用並不是長時間處於激活,這種模式可以用來提供多個用戶的shell會話。然而此模式的風險是不可預測的延遲,原因在於應用需要時間去取回核。要使用這個模式,使用mesos://URL 不要設置spark.mesos.coarse爲true。

Note that none of the modes currently provide memory sharing across applications. If you would like to share data this way, we recommend running a single server application that can serve multiple requests by querying the same RDDs. For example, the Shark JDBC server works this way for SQL queries. In future releases, in-memory storage systems such as Tachyon will provide another approach to share RDDs.

注意,以上模式都不提供應用間共享內存。如果你想共享內存,我們推薦你運行一個可以響應多個請求的應用。比如說 Shark JDBC服務器以這種方式運行來響應SQL請求。Tachyon提供另一種方式來共享RDD。

Scheduling Within an Application

Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. savecollect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).

在同一個Spark應用中, 只要由多個線程提交,那麼多個任務就可以並行執行。Spark的調度器是完全線程安全的,支持這種方法來響應多用戶請求。

By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.

默認情況下,Spark的調度器使用FIFO的方式運行job, 每個job被分爲stages,最早的job享有得到所有資源的權利,然後是第二個job,......。如果隊列頭的job不需要所有的節點,那麼第二個job可以立馬運行。如果隊列頭的job非常大,後面的Job就會有嚴重的延遲。

Starting in Spark 0.8, it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.

從Spark0.8開始,spark支持job間的公平調度(fair scheduling)。這種模式下,每個任務得到粗略相同的集羣資源。這樣,短作業就能獲得資源,降低了反應時間。這個模式適合多用戶模式。

To enable the fair scheduler, simply set the spark.scheduler.mode property to FAIR when configuring a SparkContext:

要開啓公平調度,你需要如下操作:

val conf = new SparkConf().setMaster(...).setAppName(...)
conf.set("spark.scheduler.mode", "FAIR")
val sc = new SparkContext(conf)

Fair Scheduler Pools

The fair scheduler also supports grouping jobs into pools, and setting different scheduling options (e.g. weight) for each pool. This can be useful to create a “high-priority” pool for more important jobs, for example, or to group the jobs of each user together and give users equal shares regardless of how many concurrent jobs they have instead of giving jobs equal shares. This approach is modeled after the Hadoop Fair Scheduler.

Spark也支持分級公平調度,將作業分配到不同的池,然後爲每個池設置調度的優先級。用戶可以用這個方法來創建高優先的池。比如給每個用戶一個池從而給每個用戶相同的資源而不是給每個作業相同的資源。這個方法來源於  Hadoop公平調度

Without any intervention, newly submitted jobs go into a default pool, but jobs’ pools can be set by adding the spark.scheduler.pool “local property” to the SparkContext in the thread that’s submitting them. This is done as follows:

默認情況下, 新提交的作業會被分配到默認池, 或者你可以設置SparkContext的spark.scheduler.pool來改變作業所在的池:

// Assuming sc is your SparkContext variable
sc.setLocalProperty("spark.scheduler.pool", "pool1")

After setting this local property, all jobs submitted within this thread (by calls in this thread to RDD.savecountcollect, etc) will use this pool name. The setting is per-thread to make it easy to have a thread run multiple jobs on behalf of the same user. If you’d like to clear the pool that a thread is associated with, simply call:

使用上面的操作後,該線程提交的所有作業(通過調用 RDD.save, count, collect等)都會使用同樣的池。 如果你想清楚掉該線程使用的池,你可以:

sc.setLocalProperty("spark.scheduler.pool", null)

Default Behavior of Pools

By default, each pool gets an equal share of the cluster (also equal in share to each job in the default pool), but inside each pool, jobs run in FIFO order. For example, if you create one pool per user, this means that each user will get an equal share of the cluster, and that each user’s queries will run in order instead of later queries taking resources from that user’s earlier ones.

默認情況下,每個池獲得相同的節點資源,但在每個池內,作業以FIFO的順序獲得節點資源。

Configuring Pool Properties

Specific pools’ properties can also be modified through a configuration file. Each pool supports three properties:

你可以通過修改配置文件來設置池的屬性。每個池有3個屬性:

  • schedulingMode: This can be FIFO or FAIR, to control whether jobs within the pool queue up behind each other (the default) or share the pool’s resources fairly.
  • schedulingMode: 這個屬性的值可以是FIFO或者FIAR, 是用來控制池內的作業如何分享池內的資源。
  • weight: This controls the pool’s share of the cluster relative to other pools. By default, all pools have a weight of 1. If you give a specific pool a weight of 2, for example, it will get 2x more resources as other active pools. Setting a high weight such as 1000 also makes it possible to implement priority between pools—in essence, the weight-1000 pool will always get to launch tasks first whenever it has jobs active.
  • weight: 這個屬性控制該池相對與其他池的權值。該屬性默認值是1,如果你給一個池2的權值,那麼他會得到默認池兩倍的資源。設置高權值也會讓池獲得相比於其他池高的優先級。比如設置權值爲1000,不管池內是否有活動作業, 該池始終第一個啓動。
  • minShare: Apart from an overall weight, each pool can be given a minimum shares (as a number of CPU cores) that the administrator would like it to have. The fair scheduler always attempts to meet all active pools’ minimum shares before redistributing extra resources according to the weights. The minShare property can therefore be another way to ensure that a pool can always get up to a certain number of resources (e.g. 10 cores) quickly without giving it a high priority for the rest of the cluster. By default, each pool’s minShare is 0.
  • minShare: 除了權值,每個池還有一個最小所需的CPU核的值minimum shares。公平調度器總是嘗試滿足所有激活的池的最小CPU值,然後才根據權值重新分配額外的資源。minShare屬性可以用來保證一個池能獲得的最小的CPU,該屬性默認爲0。

The pool properties can be set by creating an XML file, similar to conf/fairscheduler.xml.template, and setting aspark.scheduler.allocation.file property in your SparkConf.

你可以通過創建一個XML文件來設置池屬性, 類似與conf/fairsheduler.xml.template, 然後在你的SparkConf中設置你的spark.scheduler.allocation.file:

conf.set("spark.scheduler.allocation.file", "/path/to/file")

The format of the XML file is simply a <pool> element for each pool, with different elements within it for the various settings. For example:

這個文件的樣子如下:

<?xml version="1.0"?>
<allocations>
  <pool name="production">
    <schedulingMode>FAIR</schedulingMode>
    <weight>1</weight>
    <minShare>2</minShare>
  </pool>
  <pool name="test">
    <schedulingMode>FIFO</schedulingMode>
    <weight>2</weight>
    <minShare>3</minShare>
  </pool>
</allocations>

A full example is also available in conf/fairscheduler.xml.template. Note that any pools not configured in the XML file will simply get default values for all settings (scheduling mode FIFO, weight 1, and minShare 0).

你可以參考 conf/fairscheduler.xml.template 來修改這個文件。注意,所有沒被配置的項都會被設置爲默認項(FIFO模式, 權值爲1, 最小所需CPU爲0)。



### 本文來自於spark任務調度

### 本文中帶刪除線比如 刪除線 是我翻譯不知道取捨的地方, 或者是我知識不夠無法正確理解的地方。


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章