Master基於ZooKeeper的High Availability源碼實現

如果Spark的部署方式選擇Standalone,一個採用Master/Slaves的典型架構,那麼Master是有SPOF(單點故障,Single Point of Failure)。Spark可以選用ZooKeeper來實現HA。

ZooKeeper提供了一個Leader Election機制,利用這個機制可以保證雖然集羣存在多個Master但是隻有一個是Active的,其他的都是Standby,當Active的Master出現故障時,另外的一個Standby Master會被選舉出來。由於集羣的信息,包括Worker, Driver和Application的信息都已經持久化到文件系統,因此在切換的過程中只會影響新Job的提交,對於正在進行的Job沒有任何的影響。加入ZooKeeper的集羣整體架構如下圖所示。

1. Master的重啓策略

Master在啓動時,會根據啓動參數來決定不同的Master故障重啓策略:

1.ZOOKEEPER實現HA

2.FILESYSTEM:實現Master無數據丟失重啓,集羣的運行時數據會保存到本地/網絡文件系統上

3.丟棄所有原來的數據重啓

Master::preStart()可以看出這三種不同邏輯的實現。

override def preStart() {  
    logInfo("Starting Spark master at " + masterUrl)  
    ...  
    //persistenceEngine是持久化Worker,Driver和Application信息的,這樣在Master重新啓動時不會影響  
    //已經提交Job的運行  
    persistenceEngine = RECOVERY_MODE match {  
      case "ZOOKEEPER" =>  
        logInfo("Persisting recovery state to ZooKeeper")  
        new ZooKeeperPersistenceEngine(SerializationExtension(context.system), conf)  
      case "FILESYSTEM" =>  
        logInfo("Persisting recovery state to directory: " + RECOVERY_DIR)  
        new FileSystemPersistenceEngine(RECOVERY_DIR, SerializationExtension(context.system))  
      case _ =>  
        new BlackHolePersistenceEngine()  
    }  
    //leaderElectionAgent負責Leader的選取。  
    leaderElectionAgent = RECOVERY_MODE match {  
        case "ZOOKEEPER" =>  
          context.actorOf(Props(classOf[ZooKeeperLeaderElectionAgent], self, masterUrl, conf))  
        case _ => // 僅僅有一個Master的集羣,那麼當前的Master就是Active的  
          context.actorOf(Props(classOf[MonarchyLeaderAgent], self))  
      }  
  } 

RECOVERY_MODE是一個字符串,可以從spark-env.sh中去設置。

val RECOVERY_MODE = conf.get("spark.deploy.recoveryMode", "NONE")  

如果不設置spark.deploy.recoveryMode的話,那麼集羣的所有運行數據在Master重啓是都會丟失,這個結論是從BlackHolePersistenceEngine的實現得出的。

private[spark] class BlackHolePersistenceEngine extends PersistenceEngine {  
  override def addApplication(app: ApplicationInfo) {}  
  override def removeApplication(app: ApplicationInfo) {}  
  override def addWorker(worker: WorkerInfo) {}  
  override def removeWorker(worker: WorkerInfo) {}  
  override def addDriver(driver: DriverInfo) {}  
  override def removeDriver(driver: DriverInfo) {}  
  
  override def readPersistedData() = (Nil, Nil, Nil)  
}  

它把所有的接口實現爲空。PersistenceEngine是一個trait。作爲對比,可以看一下ZooKeeper的實現。

class ZooKeeperPersistenceEngine(serialization: Serialization, conf: SparkConf)  
  extends PersistenceEngine  
  with Logging  
{  
  val WORKING_DIR = conf.get("spark.deploy.zookeeper.dir", "/spark") + "/master_status"  
  val zk: CuratorFramework = SparkCuratorUtil.newClient(conf)  
  
  SparkCuratorUtil.mkdir(zk, WORKING_DIR)  
  // 將app的信息序列化到文件WORKING_DIR/app_{app.id}中  
  override def addApplication(app: ApplicationInfo) {  
    serializeIntoFile(WORKING_DIR + "/app_" + app.id, app)  
  }  
  
  override def removeApplication(app: ApplicationInfo) {  
    zk.delete().forPath(WORKING_DIR + "/app_" + app.id)  
  }  

Spark使用的並不是ZooKeeper的API,而是使用的org.apache.curator.framework.CuratorFramework 和 org.apache.curator.framework.recipes.leader.{LeaderLatchListener, LeaderLatch} 。Curator在ZooKeeper上做了一層很友好的封裝。

2. 集羣啓動參數的配置

簡單總結一下參數的設置,通過上述代碼的分析,我們知道爲了使用ZooKeeper至少應該設置一下參數(實際上,僅僅需要設置這些參數。通過設置spark-env.sh:

spark.deploy.recoveryMode=ZOOKEEPER  
spark.deploy.zookeeper.url=zk_server_1:2181,zk_server_2:2181  
spark.deploy.zookeeper.dir=/dir     
// OR 通過一下方式設置  
export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER "  
export SPARK_DAEMON_JAVA_OPTS="${SPARK_DAEMON_JAVA_OPTS} 
-Dspark.deploy.zookeeper.url=zk_server1:2181,zk_server_2:2181"

各個參數的意義:

3. CuratorFramework簡介

CuratorFramework極大的簡化了ZooKeeper的使用,它提供了high-level的API,並且基於ZooKeeper添加了很多特性,包括

1.自動連接管理:連接到ZooKeeper的Client有可能會連接中斷,Curator處理了這種情況,對於Client來說自動重連是透明的。

2.簡潔的API:簡化了原生態的ZooKeeper的方法,事件等;提供了一個簡單易用的接口。

3.Recipe的實現(更多介紹請點擊Recipes):

1)Leader的選擇

2)共享鎖

3)緩存和監控

4)分佈式的隊列

5)分佈式的優先隊列

CuratorFrameworks通過CuratorFrameworkFactory來創建線程安全的ZooKeeper的實例。

CuratorFrameworkFactory.newClient()提供了一個簡單的方式來創建ZooKeeper的實例,可以傳入不同的參數來對實例進行完全的控制。獲取實例後,必須通過start()來啓動這個實例,在結束時,需要調用close()。

/** 
     * Create a new client 
     * 
     * 
     * @param connectString list of servers to connect to 
     * @param sessionTimeoutMs session timeout 
     * @param connectionTimeoutMs connection timeout 
     * @param retryPolicy retry policy to use 
     * @return client 
     */  
    public static CuratorFramework newClient
(String connectString, int sessionTimeoutMs, int connectionTimeoutMs, RetryPolicy retryPolicy)  
    {  
        return builder().  
            connectString(connectString).  
            sessionTimeoutMs(sessionTimeoutMs).  
            connectionTimeoutMs(connectionTimeoutMs).  
            retryPolicy(retryPolicy).  
            build();  
    }  

需要關注的還有兩個Recipe:org.apache.curator.framework.recipes.leader.{LeaderLatchListener, LeaderLatch}。
首先看一下LeaderlatchListener,它在LeaderLatch狀態變化的時候被通知:

1.在該節點被選爲Leader的時候,接口isLeader()會被調用

2.在節點被剝奪Leader的時候,接口notLeader()會被調用

由於通知是異步的,因此有可能在接口被調用的時候,這個狀態是準確的,需要確認一下LeaderLatch的hasLeadership()是否的確是true/false。這一點在接下來Spark的實現中可以得到體現。

/** 
* LeaderLatchListener can be used to be notified asynchronously about when the state of the LeaderLatch has changed. 
* 
* Note that just because you are in the middle of one of these method calls, it does not necessarily mean that 
* hasLeadership() is the corresponding true/false value. It is possible for the state to change behind the scenes 
* before these methods get called. The contract is that if that happens, you should see another call to the other 
* method pretty quickly. 
*/  
public interface LeaderLatchListener  
{  
  /** 
* This is called when the LeaderLatch's state goes from hasLeadership = false to hasLeadership = true. 
* 
* Note that it is possible that by the time this method call happens, hasLeadership has fallen back to false. If 
* this occurs, you can expect {@link #notLeader()} to also be called. 
*/  
  public void isLeader();  
  
  /** 
* This is called when the LeaderLatch's state goes from hasLeadership = true to hasLeadership = false. 
* 
* Note that it is possible that by the time this method call happens, hasLeadership has become true. If 
* this occurs, you can expect {@link #isLeader()} to also be called. 
*/  
  public void notLeader();  
}  

LeaderLatch負責在衆多連接到ZooKeeper Cluster的競爭者中選擇一個Leader。Leader的選擇機制可以看ZooKeeper的具體實現,LeaderLatch這是完成了很好的封裝。我們只需要要知道在初始化它的實例後,需要通過

public class LeaderLatch implements Closeable  
{  
    private final Logger log = LoggerFactory.getLogger(getClass());  
    private final CuratorFramework client;  
    private final String latchPath;  
    private final String id;  
    private final AtomicReference<State> state = new AtomicReference<State>(State.LATENT);  
    private final AtomicBoolean hasLeadership = new AtomicBoolean(false);  
    private final AtomicReference<String> ourPath = new AtomicReference<String>();  
    private final ListenerContainer<LeaderLatchListener> listeners = new ListenerContainer<LeaderLatchListener>();  
    private final CloseMode closeMode;  
    private final AtomicReference<Future<?>> startTask = new AtomicReference<Future<?>>();  
.  
.  
.  
    /** 
     * Attaches a listener to this LeaderLatch 
     * <p/> 
     * Attaching the same listener multiple times is a noop from the second time on. 
     * <p/> 
     * All methods for the listener are run using the provided Executor.  It is common to pass in a single-threaded 
     * executor so that you can be certain that listener methods are called in sequence, but if you are fine with 
     * them being called out of order you are welcome to use multiple threads. 
     * 
     * @param listener the listener to attach 
     */  
    public void addListener(LeaderLatchListener listener)  
    {  
        listeners.addListener(listener);  
    }  

通過addListener可以將我們實現的Listener添加到LeaderLatch。在Listener裏,我們在兩個接口裏實現了被選爲Leader或者被剝奪Leader角色時的邏輯即可。

4. ZooKeeperLeaderElectionAgent的實現

實際上因爲有Curator的存在,Spark實現Master的HA就變得非常簡單了,ZooKeeperLeaderElectionAgent實現了接口LeaderLatchListener,在isLeader()確認所屬的Master被選爲Leader後,向Master發送消息ElectedLeader,Master會將自己的狀態改爲ALIVE。當noLeader()被調用時,它會向Master發送消息RevokedLeadership時,Master會關閉。

private[spark] class ZooKeeperLeaderElectionAgent(val masterActor: ActorRef,  
    masterUrl: String, conf: SparkConf)  
  extends LeaderElectionAgent with LeaderLatchListener with Logging  {  
  val WORKING_DIR = conf.get("spark.deploy.zookeeper.dir", "/spark") + "/leader_election"  
  // zk是通過CuratorFrameworkFactory創建的ZooKeeper實例  
  private var zk: CuratorFramework = _  
  // leaderLatch:Curator負責選出Leader。  
  private var leaderLatch: LeaderLatch = _  
  private var status = LeadershipStatus.NOT_LEADER  
  
  override def preStart() {  
  
    logInfo("Starting ZooKeeper LeaderElection agent")  
    zk = SparkCuratorUtil.newClient(conf)  
    leaderLatch = new LeaderLatch(zk, WORKING_DIR)  
    leaderLatch.addListener(this)  
  
    leaderLatch.start()  
  } 

在prestart中,啓動了leaderLatch來處理選舉ZK中的Leader。就如在上節分析的,主要的邏輯在isLeader和noLeader中。

override def isLeader() {  
  synchronized {  
    // could have lost leadership by now.  
    //現在leadership可能已經被剝奪了。。詳情參見Curator的實現。  
    if (!leaderLatch.hasLeadership) {  
      return  
    }  
  
    logInfo("We have gained leadership")  
    updateLeadershipStatus(true)  
  }  
}  
  
override def notLeader() {  
  synchronized {  
    // 現在可能賦予leadership了。詳情參見Curator的實現。  
    if (leaderLatch.hasLeadership) {  
      return  
    }  
  
    logInfo("We have lost leadership")  
    updateLeadershipStatus(false)  
  }  
}  

updateLeadershipStatus的邏輯很簡單,就是向Master發送消息。

def updateLeadershipStatus(isLeader: Boolean) {  
    if (isLeader && status == LeadershipStatus.NOT_LEADER) {  
      status = LeadershipStatus.LEADER  
      masterActor ! ElectedLeader  
    } else if (!isLeader && status == LeadershipStatus.LEADER) {  
      status = LeadershipStatus.NOT_LEADER  
      masterActor ! RevokedLeadership  
    }  
  }  

5. 設計理念

爲了解決Standalone模式下的Master的SPOF,Spark採用了ZooKeeper提供的選舉功能。Spark並沒有採用ZooKeeper原生的Java API,而是採用了Curator,一個對ZooKeeper進行了封裝的框架。採用了Curator後,Spark不用管理與ZooKeeper的連接,這些對於Spark來說都是透明的。Spark僅僅使用了100行代碼,就實現了Master的HA。當然了,Spark是站在的巨人的肩膀上。誰又會去重複發明輪子呢?

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章