Master基於ZooKeeper的High Availability源碼實現

如果Spark的部署方式選擇Standalone，一個採用Master/Slaves的典型架構，那麼Master是有SPOF（單點故障，Single Point of Failure）。Spark可以選用ZooKeeper來實現HA。

ZooKeeper提供了一個Leader Election機制，利用這個機制可以保證雖然集羣存在多個Master但是隻有一個是Active的，其他的都是Standby，當Active的Master出現故障時，另外的一個Standby Master會被選舉出來。由於集羣的信息，包括Worker， Driver和Application的信息都已經持久化到文件系統，因此在切換的過程中只會影響新Job的提交，對於正在進行的Job沒有任何的影響。加入ZooKeeper的集羣整體架構如下圖所示。

1. Master的重啓策略

Master在啓動時，會根據啓動參數來決定不同的Master故障重啓策略：

1.ZOOKEEPER實現HA

2.FILESYSTEM：實現Master無數據丟失重啓，集羣的運行時數據會保存到本地/網絡文件系統上

3.丟棄所有原來的數據重啓

Master::preStart()可以看出這三種不同邏輯的實現。

override def preStart() {  
    logInfo("Starting Spark master at " + masterUrl)  
    ...  
    //persistenceEngine是持久化Worker，Driver和Application信息的，這樣在Master重新啓動時不會影響  
    //已經提交Job的運行  
    persistenceEngine = RECOVERY_MODE match {  
      case "ZOOKEEPER" =>  
        logInfo("Persisting recovery state to ZooKeeper")  
        new ZooKeeperPersistenceEngine(SerializationExtension(context.system), conf)  
      case "FILESYSTEM" =>  
        logInfo("Persisting recovery state to directory: " + RECOVERY_DIR)  
        new FileSystemPersistenceEngine(RECOVERY_DIR, SerializationExtension(context.system))  
      case _ =>  
        new BlackHolePersistenceEngine()  
    }  
    //leaderElectionAgent負責Leader的選取。  
    leaderElectionAgent = RECOVERY_MODE match {  
        case "ZOOKEEPER" =>  
          context.actorOf(Props(classOf[ZooKeeperLeaderElectionAgent], self, masterUrl, conf))  
        case _ => // 僅僅有一個Master的集羣，那麼當前的Master就是Active的  
          context.actorOf(Props(classOf[MonarchyLeaderAgent], self))  
      }  
  }

RECOVERY_MODE是一個字符串，可以從spark-env.sh中去設置。

val RECOVERY_MODE = conf.get("spark.deploy.recoveryMode", "NONE")

如果不設置spark.deploy.recoveryMode的話，那麼集羣的所有運行數據在Master重啓是都會丟失，這個結論是從BlackHolePersistenceEngine的實現得出的。

private[spark] class BlackHolePersistenceEngine extends PersistenceEngine {  
  override def addApplication(app: ApplicationInfo) {}  
  override def removeApplication(app: ApplicationInfo) {}  
  override def addWorker(worker: WorkerInfo) {}  
  override def removeWorker(worker: WorkerInfo) {}  
  override def addDriver(driver: DriverInfo) {}  
  override def removeDriver(driver: DriverInfo) {}  
  
  override def readPersistedData() = (Nil, Nil, Nil)  
}

它把所有的接口實現爲空。PersistenceEngine是一個trait。作爲對比，可以看一下ZooKeeper的實現。

class ZooKeeperPersistenceEngine(serialization: Serialization, conf: SparkConf)  
  extends PersistenceEngine  
  with Logging  
{  
  val WORKING_DIR = conf.get("spark.deploy.zookeeper.dir", "/spark") + "/master_status"  
  val zk: CuratorFramework = SparkCuratorUtil.newClient(conf)  
  
  SparkCuratorUtil.mkdir(zk, WORKING_DIR)  
  // 將app的信息序列化到文件WORKING_DIR/app_{app.id}中  
  override def addApplication(app: ApplicationInfo) {  
    serializeIntoFile(WORKING_DIR + "/app_" + app.id, app)  
  }  
  
  override def removeApplication(app: ApplicationInfo) {  
    zk.delete().forPath(WORKING_DIR + "/app_" + app.id)  
  }

Spark使用的並不是ZooKeeper的API，而是使用的org.apache.curator.framework.CuratorFramework 和 org.apache.curator.framework.recipes.leader.{LeaderLatchListener, LeaderLatch} 。Curator在ZooKeeper上做了一層很友好的封裝。

2. 集羣啓動參數的配置

簡單總結一下參數的設置，通過上述代碼的分析，我們知道爲了使用ZooKeeper至少應該設置一下參數（實際上，僅僅需要設置這些參數。通過設置spark-env.sh：

spark.deploy.recoveryMode=ZOOKEEPER  
spark.deploy.zookeeper.url=zk_server_1:2181,zk_server_2:2181  
spark.deploy.zookeeper.dir=/dir     
// OR 通過一下方式設置  
export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER "  
export SPARK_DAEMON_JAVA_OPTS="${SPARK_DAEMON_JAVA_OPTS} 
-Dspark.deploy.zookeeper.url=zk_server1:2181,zk_server_2:2181"

各個參數的意義：

3. CuratorFramework簡介

CuratorFramework極大的簡化了ZooKeeper的使用，它提供了high-level的API，並且基於ZooKeeper添加了很多特性，包括

1.自動連接管理：連接到ZooKeeper的Client有可能會連接中斷，Curator處理了這種情況，對於Client來說自動重連是透明的。

2.簡潔的API：簡化了原生態的ZooKeeper的方法，事件等；提供了一個簡單易用的接口。

3.Recipe的實現（更多介紹請點擊Recipes）：

1)Leader的選擇

2)共享鎖

3)緩存和監控

4)分佈式的隊列

5)分佈式的優先隊列

CuratorFrameworks通過CuratorFrameworkFactory來創建線程安全的ZooKeeper的實例。

CuratorFrameworkFactory.newClient()提供了一個簡單的方式來創建ZooKeeper的實例，可以傳入不同的參數來對實例進行完全的控制。獲取實例後，必須通過start()來啓動這個實例，在結束時，需要調用close()。

/** 
     * Create a new client 
     * 
     * 
     * @param connectString list of servers to connect to 
     * @param sessionTimeoutMs session timeout 
     * @param connectionTimeoutMs connection timeout 
     * @param retryPolicy retry policy to use 
     * @return client 
     */  
    public static CuratorFramework newClient
(String connectString, int sessionTimeoutMs, int connectionTimeoutMs, RetryPolicy retryPolicy)  
    {  
        return builder().  
            connectString(connectString).  
            sessionTimeoutMs(sessionTimeoutMs).  
            connectionTimeoutMs(connectionTimeoutMs).  
            retryPolicy(retryPolicy).  
            build();  
    }

需要關注的還有兩個Recipe：org.apache.curator.framework.recipes.leader.{LeaderLatchListener, LeaderLatch}。
首先看一下LeaderlatchListener，它在LeaderLatch狀態變化的時候被通知：

1.在該節點被選爲Leader的時候，接口isLeader()會被調用

2.在節點被剝奪Leader的時候，接口notLeader()會被調用

由於通知是異步的，因此有可能在接口被調用的時候，這個狀態是準確的，需要確認一下LeaderLatch的hasLeadership()是否的確是true/false。這一點在接下來Spark的實現中可以得到體現。

/** 
* LeaderLatchListener can be used to be notified asynchronously about when the state of the LeaderLatch has changed. 
* 
* Note that just because you are in the middle of one of these method calls, it does not necessarily mean that 
* hasLeadership() is the corresponding true/false value. It is possible for the state to change behind the scenes 
* before these methods get called. The contract is that if that happens, you should see another call to the other 
* method pretty quickly. 
*/  
public interface LeaderLatchListener  
{  
  /** 
* This is called when the LeaderLatch's state goes from hasLeadership = false to hasLeadership = true. 
* 
* Note that it is possible that by the time this method call happens, hasLeadership has fallen back to false. If 
* this occurs, you can expect {@link #notLeader()} to also be called. 
*/  
  public void isLeader();  
  
  /** 
* This is called when the LeaderLatch's state goes from hasLeadership = true to hasLeadership = false. 
* 
* Note that it is possible that by the time this method call happens, hasLeadership has become true. If 
* this occurs, you can expect {@link #isLeader()} to also be called. 
*/  
  public void notLeader();  
}

LeaderLatch負責在衆多連接到ZooKeeper Cluster的競爭者中選擇一個Leader。Leader的選擇機制可以看ZooKeeper的具體實現，LeaderLatch這是完成了很好的封裝。我們只需要要知道在初始化它的實例後，需要通過

public class LeaderLatch implements Closeable  
{  
    private final Logger log = LoggerFactory.getLogger(getClass());  
    private final CuratorFramework client;  
    private final String latchPath;  
    private final String id;  
    private final AtomicReference<State> state = new AtomicReference<State>(State.LATENT);  
    private final AtomicBoolean hasLeadership = new AtomicBoolean(false);  
    private final AtomicReference<String> ourPath = new AtomicReference<String>();  
    private final ListenerContainer<LeaderLatchListener> listeners = new ListenerContainer<LeaderLatchListener>();  
    private final CloseMode closeMode;  
    private final AtomicReference<Future<?>> startTask = new AtomicReference<Future<?>>();  
.  
.  
.  
    /** 
     * Attaches a listener to this LeaderLatch 
     * <p/> 
     * Attaching the same listener multiple times is a noop from the second time on. 
     * <p/> 
     * All methods for the listener are run using the provided Executor.  It is common to pass in a single-threaded 
     * executor so that you can be certain that listener methods are called in sequence, but if you are fine with 
     * them being called out of order you are welcome to use multiple threads. 
     * 
     * @param listener the listener to attach 
     */  
    public void addListener(LeaderLatchListener listener)  
    {  
        listeners.addListener(listener);  
    }

通過addListener可以將我們實現的Listener添加到LeaderLatch。在Listener裏，我們在兩個接口裏實現了被選爲Leader或者被剝奪Leader角色時的邏輯即可。

4. ZooKeeperLeaderElectionAgent的實現

實際上因爲有Curator的存在，Spark實現Master的HA就變得非常簡單了，ZooKeeperLeaderElectionAgent實現了接口LeaderLatchListener，在isLeader()確認所屬的Master被選爲Leader後，向Master發送消息ElectedLeader，Master會將自己的狀態改爲ALIVE。當noLeader()被調用時，它會向Master發送消息RevokedLeadership時，Master會關閉。

private[spark] class ZooKeeperLeaderElectionAgent(val masterActor: ActorRef,  
    masterUrl: String, conf: SparkConf)  
  extends LeaderElectionAgent with LeaderLatchListener with Logging  {  
  val WORKING_DIR = conf.get("spark.deploy.zookeeper.dir", "/spark") + "/leader_election"  
  // zk是通過CuratorFrameworkFactory創建的ZooKeeper實例  
  private var zk: CuratorFramework = _  
  // leaderLatch：Curator負責選出Leader。  
  private var leaderLatch: LeaderLatch = _  
  private var status = LeadershipStatus.NOT_LEADER  
  
  override def preStart() {  
  
    logInfo("Starting ZooKeeper LeaderElection agent")  
    zk = SparkCuratorUtil.newClient(conf)  
    leaderLatch = new LeaderLatch(zk, WORKING_DIR)  
    leaderLatch.addListener(this)  
  
    leaderLatch.start()  
  }

在prestart中，啓動了leaderLatch來處理選舉ZK中的Leader。就如在上節分析的，主要的邏輯在isLeader和noLeader中。

override def isLeader() {  
  synchronized {  
    // could have lost leadership by now.  
    //現在leadership可能已經被剝奪了。。詳情參見Curator的實現。  
    if (!leaderLatch.hasLeadership) {  
      return  
    }  
  
    logInfo("We have gained leadership")  
    updateLeadershipStatus(true)  
  }  
}  
  
override def notLeader() {  
  synchronized {  
    // 現在可能賦予leadership了。詳情參見Curator的實現。  
    if (leaderLatch.hasLeadership) {  
      return  
    }  
  
    logInfo("We have lost leadership")  
    updateLeadershipStatus(false)  
  }  
}

updateLeadershipStatus的邏輯很簡單，就是向Master發送消息。

def updateLeadershipStatus(isLeader: Boolean) {  
    if (isLeader && status == LeadershipStatus.NOT_LEADER) {  
      status = LeadershipStatus.LEADER  
      masterActor ! ElectedLeader  
    } else if (!isLeader && status == LeadershipStatus.LEADER) {  
      status = LeadershipStatus.NOT_LEADER  
      masterActor ! RevokedLeadership  
    }  
  }

5. 設計理念

爲了解決Standalone模式下的Master的SPOF，Spark採用了ZooKeeper提供的選舉功能。Spark並沒有採用ZooKeeper原生的Java API，而是採用了Curator，一個對ZooKeeper進行了封裝的框架。採用了Curator後，Spark不用管理與ZooKeeper的連接，這些對於Spark來說都是透明的。Spark僅僅使用了100行代碼，就實現了Master的HA。當然了，Spark是站在的巨人的肩膀上。誰又會去重複發明輪子呢？

Master基於ZooKeeper的High Availability源碼實現

杭州的 IT 崩盤了麼？

VS2022 解決方案打不開 .NET Framework 4.0 、 4.5 等老項目

Vue3 運行可以，build 打包發佈報錯，app.config.globalProperties 用法坑

程序員常見的文本查看工具

ITSM落地經驗之建設藍圖規劃

既然測試也要求寫代碼，那乾脆讓開發兼任測試不就好了嗎？

PDF 補丁丁 1.0.2 版更新

奇怪！應用的日誌呢？？

Master基於ZooKeeper的High Availability源碼實現

Hadoop日記Day20---ZooKeeper系列（三）

本地僞集羣測試Demo

Hadoop日記Day20---Zookeeper系列（一）

Java面試題二

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結