ZooKeeper源碼分析之服務端啓動和Leader選舉 v1.0

第一節 參考

1.參考
https://blog.csdn.net/zhangyuan19880606/article/details/51508294
https://blog.csdn.net/chinaCsdnV2/article/details/81049686
https://www.cnblogs.com/jxwch/p/6433310.html
http://godmoon.wicp.net/blog/index.php/post_421.html
https://github.com/apache/zookeeper
<從PAXOS到ZOOKEEPER分佈式一致性原理與實踐>
2.編譯註意:
調試時編譯的tag爲3.6.0的版本
ufpr.dl.sourceforge.net的地址使用https
只需要改動build.xml文件
以前版本用ant,新版都用maven,需要添加一個依賴
 <dependency>
  <groupId>com.codahale.metrics</groupId>
  <artifactId>metrics-core</artifactId>
  <version>3.0.0</version>
</dependency>
  <dependency>
      <groupId>org.xerial.snappy</groupId>
      <artifactId>snappy-java</artifactId>
      <version>1.1.7</version>
    </dependency>
3.啓動參數
/Users/feivirus/Documents/project/eclipse/zookeeper/conf/zoo.cfg
如果出現admin server的8080端口占用,配置admin.serverPort=8081
可以從其他zk的Servier進程啓動參數中,拷貝出參數,填到idea的調試啓動參數中.
3.客戶端連接命令 ./zkcli.sh -server 127.0.0.1:2181
查看節點主從命令 ./zkserver.sh status /xxx/zoo.cfg
4.問題: 
(1).zookeeper投票過程?投票包含哪些內容字段,如何比較大小?
(2).zookeeper的事務是怎麼實現的
(3).leader的事務請求,比如創建節點,刪除節點,修改數據,怎麼同步到follower的,什麼時候同步?
(4).zookeeper對客戶端請求的處理流程是什麼?包含哪幾個處理器?
(5).zk的提案Proposal和事務有什麼關聯


第二節 架構

一.應用場景
1.數據發佈/訂閱
2.負載均衡
3.命名服務
4.分佈式協調/通知
5.集羣管理
6.master選舉
7.分佈式鎖
8.分佈式隊列

二.cap理論
三.paxos算法
四.zab協議

第四節 源碼細節

standalong模式入口爲org.apache.zookeeper.server.ZooKeeperServerMain.main().集羣模式入口爲org.apache.zookeeper.server.quorum.QuorumPeerMain.main().

一.集羣模式服務端main方法啓動

啓動過程如下圖:


(一).進入QuorumPeerMain#main().調用QuorumPeerMain#initializeAndRun()。代碼如下:

protected void initializeAndRun(String[] args) throws ConfigException, IOException, AdminServerException {
 	//創建zk配置類,裏面的成員記錄了zk的zoo.cfg的各種配置項值
    QuorumPeerConfig config = new QuorumPeerConfig();
    if (args.length == 1) {
    	/* 解析進程參數.解析zoo.cfg文件的路徑,調用QuorumPeerConfig#parseProperties()解析到QuorumPeerConfig成員變量中,比如客戶端連接端口號,data目錄路徑,dataLog目錄路徑,選舉策略默認爲3,zk服務端機器數量(我調試配置的3臺),代碼在後面1分析 */
        config.parse(args[0]);
    }

    // Start and schedule the the purge task
    //對事務日誌和數據的快照文件進行定時清理
    DatadirCleanupManager purgeMgr = new DatadirCleanupManager(
        config.getDataDir(),
        config.getDataLogDir(),
        config.getSnapRetainCount(),
        config.getPurgeInterval());
    purgeMgr.start();

    if (args.length == 1 && config.isDistributed()) {
    	//啓動zookeeper,在後面2處分析
        runFromConfig(config);
    } else {
        LOG.warn("Either no config or no quorum defined in config, running in standalone mode");
        // there is only server in the quorum -- run as standalone
        ZooKeeperServerMain.main(args);
    }
}

1.解析zoo.cfg文件,讀取配置

public void parseProperties(Properties zkProp) throws IOException, ConfigException {
   	...
    VerifyingFileFactory vff = new VerifyingFileFactory.Builder(LOG).warnForRelativePath().build();
    //遍歷zoo.cfg文件的配置項
    for (Entry<Object, Object> entry : zkProp.entrySet()) {
        String key = entry.getKey().toString().trim();
        String value = entry.getValue().toString().trim();

        if (key.equals("dataDir")) {
        	//data目錄位置
            dataDir = vff.create(value);
        } else if (key.equals("dataLogDir")) {
        	//dataLog目錄位置
            dataLogDir = vff.create(value);
        } else if (key.equals("clientPort")) {
        	//客戶端連接端口號
            clientPort = Integer.parseInt(value);
        } ...
        } else if (key.equals("clientPortAddress")) {
            clientPortAddress = value.trim();
        } ...
        } else if (key.equals("tickTime")) {
        	//心跳毫秒數
            tickTime = Integer.parseInt(value);
        } else if (key.equals("maxClientCnxns")) {
        	//最大客戶端連接數
            maxClientCnxns = Integer.parseInt(value);
        } else if (key.equals("minSessionTimeout")) {
            minSessionTimeout = Integer.parseInt(value);
        } else if (key.equals("maxSessionTimeout")) {
        	//最大session超時時間
            maxSessionTimeout = Integer.parseInt(value);
        } else if (key.equals("initLimit")) {
            initLimit = Integer.parseInt(value);
        } else if (key.equals("syncLimit")) {
        	//客戶端和服務端最大的同步連接數
            syncLimit = Integer.parseInt(value);
        } ...
        } else if (key.equals("electionAlg")) {
        	//選舉策略,成員變量直接賦值3
            electionAlg = Integer.parseInt(value);
            if (electionAlg != 3) {
                throw new ConfigException("Invalid electionAlg value. Only 3 is supported.");
            }
        } ...
        } else if (key.equals("peerType")) {
            if (value.toLowerCase().equals("observer")) {
                peerType = LearnerType.OBSERVER;
            } else if (value.toLowerCase().equals("participant")) {
                peerType = LearnerType.PARTICIPANT;
            } else {
                throw new ConfigException("Unrecognised peertype: " + value);
            }
        } ...
        } else if (key.equals("standaloneEnabled")) {
            if (value.toLowerCase().equals("true")) {
                setStandaloneEnabled(true);
            } else if (value.toLowerCase().equals("false")) {
                setStandaloneEnabled(false);
            } else {
                throw new ConfigException("Invalid option "
                                          + value
                                          + " for standalone mode. Choose 'true' or 'false.'");
            }
        } ...
        } else if (key.equals("quorum.cnxn.threads.size")) {
            quorumCnxnThreadsSize = Integer.parseInt(value);
        } ...
        } else {
            System.setProperty("zookeeper." + key, value);
        }
    }
   	...
    ...
    try {
    	//創建監控類
        Class.forName(metricsProviderClassName, false, Thread.currentThread().getContextClassLoader());
    } catch (ClassNotFoundException error) {
        throw new IllegalArgumentException("metrics provider class was not found", error);
    }

    // backward compatibility - dynamic configuration in the same file as
    // static configuration params see writeDynamicConfig()
    if (dynamicConfigFileStr == null) {
    	/*這裏面讀取myid文件,設置servierId.*/
        setupQuorumPeerConfig(zkProp, true);
        if (isDistributed() && isReconfigEnabled()) {
            // we don't backup static config for standalone mode.
            // we also don't backup if reconfig feature is disabled.
            backupOldConfig();
        }
    }
}

2.啓動zookeeper.
進入QuorumPeerMain#runFromConfig().

public void runFromConfig(QuorumPeerConfig config) throws IOException, AdminServerException {
    try {
    	//添加log4j監控
        ManagedUtil.registerLog4jMBeans();
    } catch (JMException e) {
        LOG.warn("Unable to register log4j JMX control", e);
    }

    MetricsProvider metricsProvider;
    try {
    	//使用DefaultMetricsProvider進行性能監控
        metricsProvider = MetricsProviderBootstrap.startMetricsProvider(
            config.getMetricsProviderClassName(),
            config.getMetricsProviderConfiguration());
    } catch (MetricsProviderLifeCycleException error) {
        throw new IOException("Cannot boot MetricsProvider " + config.getMetricsProviderClassName(), error);
    }
    try {
        ServerMetrics.metricsProviderInitialized(metricsProvider);
        ServerCnxnFactory cnxnFactory = null;
        ServerCnxnFactory secureCnxnFactory = null;

        if (config.getClientPortAddress() != null) {
        	//使用NIO,創建NIOServerCnxnFactory的連接池,處理客戶端或者服務端發來的連接
            cnxnFactory = ServerCnxnFactory.createFactory();
            //配置連接池的客戶端地址
            cnxnFactory.configure(config.getClientPortAddress(), config.getMaxClientCnxns(), config.getClientPortListenBacklog(), false);
        }
      	...
      	//創建QuorumPeer選舉類對象,處理選舉的邏輯
        quorumPeer = getQuorumPeer();
        //添加事務日誌管理器,處理data目錄下日誌文件的備份恢復
        quorumPeer.setTxnFactory(new FileTxnSnapLog(config.getDataLogDir(), config.getDataDir()));
		...
		//設置選舉策略,myid,超時時間等從配置文件zoo.cfg中讀取的值
        quorumPeer.setElectionType(config.getElectionAlg());
        quorumPeer.setMyid(config.getServerId());
       	...
       	//創建內存數據庫,用來存儲zookeeper的樹形文件系統,文件系統代碼在後面3處分析.
        quorumPeer.setZKDatabase(new ZKDatabase(quorumPeer.getTxnFactory()));
        ...
        /* 初始化內存數據庫配置,主要是添加/zookeeper/config節點, 把zoo.cfg中的server.x,每個server信息寫入如下,
			get  /zookeeper/config
			server.1=127.0.0.1:2888:3888:participant
			server.2=127.0.0.1:2889:3889:participant
			server.3=127.0.0.1:2890:3890:participant
			version=0
		*/
        quorumPeer.initConfigInZKDatabase();
        //設置連接池
        quorumPeer.setCnxnFactory(cnxnFactory);
        //默認服務器角色是枚舉LearnerType#PARTICIPANT
        quorumPeer.setLearnerType(config.getPeerType());
       ...
        quorumPeer.setQuorumCnxnThreadsSize(config.quorumCnxnThreadsSize);
        //啓動賬號密碼認證服務器,裏面是空實現
        quorumPeer.initialize();

        if (config.jvmPauseMonitorToRun) {
            quorumPeer.setJvmPauseMonitor(new JvmPauseMonitor(config));
        }
        //進入QuorumPeer#start(),加載數據庫,啓動連接池,不是運行線程,代碼在後面4處分析
        quorumPeer.start();
        ZKAuditProvider.addZKStartStopAuditLog();
        quorumPeer.join();
    } catch (InterruptedException e) {
    ...
    }
}

3.zookeeper內存裏面的樹形文件系統
在ZKDatabase類中定義.代碼如下:

public class ZKDatabase {  
	//樹形結構根節點,即"/"路徑  
    protected DataTree dataTree;
    protected ConcurrentHashMap<Long, Integer> sessionsWithTimeouts;
    //事務日誌
    protected FileTxnSnapLog snapLog;
    protected long minCommittedLog, maxCommittedLog;

    /**
     * Default value is to use snapshot if txnlog size exceeds 1/3 the size of snapshot
     */
    public static final String SNAPSHOT_SIZE_FACTOR = "zookeeper.snapshotSizeFactor";
    public static final double DEFAULT_SNAPSHOT_SIZE_FACTOR = 0.33;
    private double snapshotSizeFactor;

    public static final String COMMIT_LOG_COUNT = "zookeeper.commitLogCount";
    public static final int DEFAULT_COMMIT_LOG_COUNT = 500;
    public int commitLogCount;
    protected static int commitLogBuffer = 700;
    protected Queue<Proposal> committedLog = new ArrayDeque<>();
    protected ReentrantReadWriteLock logLock = new ReentrantReadWriteLock();

    /**
     * Number of txn since last snapshot;
     */
    private AtomicInteger txnCount = new AtomicInteger(0);
}

樹形節點定義:

public class DataTree {
    private final RateLogger RATE_LOGGER = new RateLogger(LOG, 15 * 60 * 1000);

    /**
     * This map provides a fast lookup to the datanodes. The tree is the
     * source of truth and is where all the locking occurs
     */
    //孩子節點
    private final NodeHashMap nodes;
    //對數據變化的監控
    private IWatchManager dataWatches;
    //對子節點的監控
    private IWatchManager childWatches;
    ...
    /**
     * This hashtable lists the paths of the ephemeral nodes of a session.
     */
    //臨時節點的子節點
    private final Map<Long, HashSet<String>> ephemerals = new ConcurrentHashMap<Long, HashSet<String>>();
    //修改事務的最大id
    private volatile ZxidDigest lastProcessedZxidDigest;

}

4.QuorumPeer#start()加載數據庫,啓動連接池

public synchronized void start() {
    if (!getView().containsKey(myid)) {
        throw new RuntimeException("My id " + myid + " not in the peer list");
    }

    //從文件中恢復,加載數據庫,調用FileTxnSnapLog#restore(),在後面代碼6處分析
    loadDataBase();
    //調用NIOServerCnxnFactory#start()啓動連接池
    startServerCnxnFactory();
    try {
        adminServer.start();
    } catch (AdminServerException e) {
        LOG.warn("Problem starting AdminServer", e);
        System.out.println(e);
    }
    //開始選舉,創建myid,默認事務id爲0,epoch爲0的Vote選票,進入QuorumPeer#createElectionAlgorithm()代碼在後面7分析
    startLeaderElection();
    startJvmPauseMonitor();
    //進入下面的5.
    super.start();
}

5.zookeeper選舉主線程.
進入QuorumPeer#run()方法.

public void run() {
	//修改線程名爲QuorumPeer[myid=xx]xxx格式
    updateThreadName();

    LOG.debug("Starting quorum peer");
    try {
        jmxQuorumBean = new QuorumBean(this);
        MBeanRegistry.getInstance().register(jmxQuorumBean, null);
        for (QuorumServer s : getView().values()) {
        	//遍歷zoo.cfg中所有的server
            ZKMBeanInfo p;
            if (getId() == s.id) {
            	//記錄本地server
                p = jmxLocalPeerBean = new LocalPeerBean(this);
                try {
                    MBeanRegistry.getInstance().register(p, jmxQuorumBean);
                } catch (Exception e) {
                    LOG.warn("Failed to register with JMX", e);
                    jmxLocalPeerBean = null;
                }
            } else {
            	//記錄其他server
                RemotePeerBean rBean = new RemotePeerBean(this, s);
                try {
                    MBeanRegistry.getInstance().register(rBean, jmxQuorumBean);
                    jmxRemotePeerBean.put(s.id, rBean);
                } catch (Exception e) {
                    LOG.warn("Failed to register with JMX", e);
                }
            }
        }
    } catch (Exception e) {
      ...
    }

    try {
        /*
         * Main loop
         */
         //死循環
        while (running) {
        	//獲取當前機器狀態
            switch (getPeerState()) {
            case LOOKING:
            	//正在尋找Leader,剛啓動時處於這個狀態
                ServerMetrics.getMetrics().LOOKING_COUNT.add(1);

                if (Boolean.getBoolean("readonlymode.enabled")) {
                   ...
                } else {
                    try {
                        reconfigFlagClear();
                        if (shuttingDownLE) {
                            shuttingDownLE = false;
                            startLeaderElection();
            			}
						/* makeLEStrategy()獲取當前的選舉算法. lookForLeader()選舉過程,判斷誰是leader,進入FastLeaderElection#lookForLeader(),這裏就是進入zab算法的邏輯,在後面11處分析.
						這裏調用完,選舉結果就有了.調試時server3做leader*/
                        setCurrentVote(makeLEStrategy().lookForLeader());
                    } catch (Exception e) {
                        LOG.warn("Unexpected exception", e);
                        setPeerState(ServerState.LOOKING);
                    }
                }
                break;
            case OBSERVING:
                try {
                    LOG.info("OBSERVING");
                    setObserver(makeObserver(logFactory));
                    observer.observeLeader();
                } catch (Exception e) {
                    LOG.warn("Unexpected exception", e);
                } finally {
                    observer.shutdown();
                    setObserver(null);
                    updateServerState();

                    // Add delay jitter before we switch to LOOKING
                    // state to reduce the load of ObserverMaster
                    if (isRunning()) {
                        Observer.waitForObserverElectionDelay();
                    }
                }
                break;
            case FOLLOWING:
                try {
                    LOG.info("FOLLOWING");
                    setFollower(makeFollower(logFactory));
                    follower.followLeader();
                } catch (Exception e) {
                    LOG.warn("Unexpected exception", e);
                } finally {
                    follower.shutdown();
                    setFollower(null);
                    updateServerState();
                }
                break;
            case LEADING:
                LOG.info("LEADING");
                try {
                	//經過和1之間的投票結果判斷,server3做leader.循環進入這裏.設置自己爲leader
                    setLeader(makeLeader(logFactory));
                    //server3開始領導過程,在後面10處分析
                    leader.lead();
                    setLeader(null);
                } catch (Exception e) {
                    LOG.warn("Unexpected exception", e);
                } finally {
                    if (leader != null) {
                        leader.shutdown("Forcing shutdown");
                        setLeader(null);
                    }
                    updateServerState();
                }
                break;
            }
        }
    } finally {
        LOG.warn("QuorumPeer main thread exited");
        MBeanRegistry instance = MBeanRegistry.getInstance();
        instance.unregister(jmxQuorumBean);
        instance.unregister(jmxLocalPeerBean);

        for (RemotePeerBean remotePeerBean : jmxRemotePeerBean.values()) {
            instance.unregister(remotePeerBean);
        }

        jmxQuorumBean = null;
        jmxLocalPeerBean = null;
        jmxRemotePeerBean = null;
    }
}

6.從文件恢復zookeeper數據
進入FileTxnSnapLog#restore().

public long restore(DataTree dt, Map<Long, Integer> sessions, PlayBackListener listener) throws IOException {
    long snapLoadingStartTime = Time.currentElapsedTime();
    //從目錄data/version-2/snapshot中恢復日誌
    long deserializeResult = snapLog.deserialize(dt, sessions);
    ServerMetrics.getMetrics().STARTUP_SNAP_LOAD_TIME.add(Time.currentElapsedTime() - snapLoadingStartTime);
    FileTxnLog txnLog = new FileTxnLog(dataDir);
    boolean trustEmptyDB;
    //初始化文件
    File initFile = new File(dataDir.getParent(), "initialize");
    if (Files.deleteIfExists(initFile.toPath())) {
        LOG.info("Initialize file found, an empty database will not block voting participation");
        trustEmptyDB = true;
    } else {
        trustEmptyDB = autoCreateDB;
    }
    ...
    //之前沒有事務日誌記錄
    if (-1L == deserializeResult) {
        /* this means that we couldn't find any snapshot, so we need to
         * initialize an empty database (reported in ZOOKEEPER-2325) */
        if (txnLog.getLastLoggedZxid() != -1) {
            // ZOOKEEPER-3056: provides an escape hatch for users upgrading
            // from old versions of zookeeper (3.4.x, pre 3.5.3).
            if (!trustEmptySnapshot) {
                throw new IOException(EMPTY_SNAPSHOT_WARNING + "Something is broken!");
            } else {
                LOG.warn("{}This should only be allowed during upgrading.", EMPTY_SNAPSHOT_WARNING);
                return finalizer.run();
            }
        }

        if (trustEmptyDB) {
            /* TODO: (br33d) we should either put a ConcurrentHashMap on restore()
             *       or use Map on save() */
            //新建數據庫文件
            save(dt, (ConcurrentHashMap<Long, Integer>) sessions, false);

            /* return a zxid of 0, since we know the database is empty */
            return 0L;
        } else {
           ...
            dt.lastProcessedZxid = -1L;
            return -1L;
        }
    }

    return finalizer.run();
}

7.創建選舉線程
進入QuorumPeer#createElectionAlgorithm(),代碼如下:

protected Election createElectionAlgorithm(int electionAlgorithm) {
    Election le = null;

    //TODO: use a factory rather than a switch
    switch (electionAlgorithm) {
    case 1:
        throw new UnsupportedOperationException("Election Algorithm 1 is not supported.");
    case 2:
        throw new UnsupportedOperationException("Election Algorithm 2 is not supported.");
    case 3:
    	//3是FastLeaderElection,其他兩個不再支持.創建選舉連接管理器
        QuorumCnxManager qcm = createCnxnManager();
        QuorumCnxManager oldQcm = qcmRef.getAndSet(qcm);
        if (oldQcm != null) {
            LOG.warn("Clobbering already-set QuorumCnxManager (restarting leader election?)");
            oldQcm.halt();
        }
        QuorumCnxManager.Listener listener = qcm.listener;
        if (listener != null) {
        	//啓動新線程,監聽其他服務器的選舉結果
            listener.start();
            FastLeaderElection fle = new FastLeaderElection(this, qcm);
            //創建線程,FastLeaderElection.Messenger.WorkerReceiver#run()處理其他服務器選票.開始選舉.
            fle.start();
            le = fle;
        } else {
            LOG.error("Null listener when initializing cnx manager");
        }
        break;
    default:
        assert false;
    }
    return le;
}

8.處理其他服務器投票
進入FastLeaderElection.Messenger.WorkerReceiver#run()方法.

public void run() {

        Message response;
        while (!stop) {
            // Sleeps on receive
            try {
            	//從隊列裏面取出消息
                response = manager.pollRecvQueue(3000, TimeUnit.MILLISECONDS);
                if (response == null) {
                    continue;
                }
                ...
                //選舉相關信息的封裝
                Notification n = new Notification();
                int rstate = response.buffer.getInt();
                //源服務器的推薦的leader
                long rleader = response.buffer.getLong();
                //源服務器的最新事務id
                long rzxid = response.buffer.getLong();
                //源服務器的epoch代數
                long relectionEpoch = response.buffer.getLong();
                long rpeerepoch;

                int version = 0x0;
                ...

                QuorumVerifier rqv = null;
                ...
                /*
                 * If it is from a non-voting server (such as an observer or
                 * a non-voting follower), respond right away.
                 */
                if (!validVoter(response.sid)) {
                	//源服務器不能投票,不是follower,直接返回
                    Vote current = self.getCurrentVote();
                    QuorumVerifier qv = self.getQuorumVerifier();
                    ToSend notmsg = new ToSend(
                        ToSend.mType.notification,
                        current.getId(),
                        current.getZxid(),
                        logicalclock.get(),
                        self.getPeerState(),
                        response.sid,
                        current.getPeerEpoch(),
                        qv.toString().getBytes());

                    sendqueue.offer(notmsg);
                } else {
                  	...
                    QuorumPeer.ServerState ackstate = QuorumPeer.ServerState.LOOKING;
                    switch (rstate) {
                    case 0:
                    	//啓動一臺zk1,然後啓動調試服務器做3後,進入這裏,狀態是尋找服務器
                        ackstate = QuorumPeer.ServerState.LOOKING;
                        break;
                    case 1:
                        ackstate = QuorumPeer.ServerState.FOLLOWING;
                        break;
                    case 2:
                        ackstate = QuorumPeer.ServerState.LEADING;
                        break;
                    case 3:
                        ackstate = QuorumPeer.ServerState.OBSERVING;
                        break;
                    default:
                        continue;
                    }
                    //選舉消息的填充,n是Notification類型,dto的作用
                    n.leader = rleader;
                    n.zxid = rzxid;
                    n.electionEpoch = relectionEpoch;
                    n.state = ackstate;
                    n.sid = response.sid;
                    n.peerEpoch = rpeerepoch;
                    n.version = version;
                    n.qv = rqv;                   
                   
                    /*
                     * If this server is looking, then send proposed leader
                     */

                    if (self.getPeerState() == QuorumPeer.ServerState.LOOKING) {
                    	//投票放入隊列裏面,然後QuorumPeer#run()死循環,異步取出選票處理
                        recvqueue.offer(n);
						...
                        }
                    } else {
                        /*
                         * If this server is not looking, but the one that sent the ack
                         * is looking, then send back what it believes to be the leader.
                         */
                        Vote current = self.getCurrentVote();
                        if (ackstate == QuorumPeer.ServerState.LOOKING) {
                            if (self.leader != null) {
                                if (leadingVoteSet != null) {
                                    self.leader.setLeadingVoteSet(leadingVoteSet);
                                    leadingVoteSet = null;
                                }
                                self.leader.reportLookingSid(response.sid);
                            }                          

                            QuorumVerifier qv = self.getQuorumVerifier();
                            ToSend notmsg = new ToSend(
                                ToSend.mType.notification,
                                current.getId(),
                                current.getZxid(),
                                current.getElectionEpoch(),
                                self.getPeerState(),
                                response.sid,
                                current.getPeerEpoch(),
                                qv.toString().getBytes());
                            sendqueue.offer(notmsg);
                        }
                    }
                }
            } catch (InterruptedException e) {
               ...
            }
        }
        ...
    }

}

9.選票類Vote,代碼如下:

public class Vote {
	//版本
    private final int version;
    private final long id;
    //事務id
    private final long zxid;
    //選票代數
    private final long electionEpoch;
    //推薦服務器代數
    private final long peerEpoch;
}

10.開始Leader領導過程
從上面5調用過來,進入Leader#lead()

void lead() throws IOException, InterruptedException {
	...
    try {
    	//設置當前Zab爲DISCOVERY,尋找leader狀態
        self.setZabState(QuorumPeer.ZabState.DISCOVERY);
        self.tick.set(0);
        //設置最新的事務id,調用ZooKeeperServer#takeSnapshot(boolean)做一個乾淨的快照,此時沒有事務進來
        zk.loadData();

        leaderStateSummary = new StateSummary(self.getCurrentEpoch(), zk.getLastProcessedZxid());

        // Start thread that waits for connection requests from
        // new followers.
        //創建線程,等待Flollower角色的連接請求,線程啓動進入LearnerHandler#run()方法.Leader以外的server都是Learner
        cnxAcceptor = new LearnerCnxAcceptor();
        cnxAcceptor.start();

        //獲取提案的代數
        long epoch = getEpochToPropose(self.getId(), self.getAcceptedEpoch());

        zk.setZxid(ZxidUtils.makeZxid(epoch, 0));

        synchronized (this) {
            lastProposed = zk.getZxid();
        }

        newLeaderProposal.packet = new QuorumPacket(NEWLEADER, zk.getZxid(), null, null);

        if ((newLeaderProposal.packet.getZxid() & 0xffffffffL) != 0) {
            LOG.info("NEWLEADER proposal has Zxid of {}", Long.toHexString(newLeaderProposal.packet.getZxid()));
        }

        QuorumVerifier lastSeenQV = self.getLastSeenQuorumVerifier();
        QuorumVerifier curQV = self.getQuorumVerifier();
        if (curQV.getVersion() == 0 && curQV.getVersion() == lastSeenQV.getVersion()) {
            try {
                QuorumVerifier newQV = self.configFromString(curQV.toString());
                newQV.setVersion(zk.getZxid());
                self.setLastSeenQuorumVerifier(newQV, true);
            } catch (Exception e) {
                throw new IOException(e);
            }
        }

        newLeaderProposal.addQuorumVerifier(self.getQuorumVerifier());
        if (self.getLastSeenQuorumVerifier().getVersion() > self.getQuorumVerifier().getVersion()) {
            newLeaderProposal.addQuorumVerifier(self.getLastSeenQuorumVerifier());
        }

        // We have to get at least a majority of servers in sync with
        // us. We do this by waiting for the NEWLEADER packet to get
        // acknowledged

        waitForEpochAck(self.getId(), leaderStateSummary);
        self.setCurrentEpoch(epoch);
        self.setLeaderAddressAndId(self.getQuorumAddress(), self.getId());
        self.setZabState(QuorumPeer.ZabState.SYNCHRONIZATION);

        try {
            waitForNewLeaderAck(self.getId(), zk.getZxid());
        } catch (InterruptedException e) {
            shutdown("Waiting for a quorum of followers, only synced with sids: [ "
                     + newLeaderProposal.ackSetsToString()
                     + " ]");
            HashSet<Long> followerSet = new HashSet<Long>();

            for (LearnerHandler f : getLearners()) {
                if (self.getQuorumVerifier().getVotingMembers().containsKey(f.getSid())) {
                    followerSet.add(f.getSid());
                }
            }
            boolean initTicksShouldBeIncreased = true;
            for (Proposal.QuorumVerifierAcksetPair qvAckset : newLeaderProposal.qvAcksetPairs) {
                if (!qvAckset.getQuorumVerifier().containsQuorum(followerSet)) {
                    initTicksShouldBeIncreased = false;
                    break;
                }
            }
            if (initTicksShouldBeIncreased) {
                LOG.warn("Enough followers present. Perhaps the initTicks need to be increased.");
            }
            return;
        }

        startZkServer();

        String initialZxid = System.getProperty("zookeeper.testingonly.initialZxid");
        if (initialZxid != null) {
            long zxid = Long.parseLong(initialZxid);
            zk.setZxid((zk.getZxid() & 0xffffffff00000000L) | zxid);
        }

        if (!System.getProperty("zookeeper.leaderServes", "yes").equals("no")) {
            self.setZooKeeperServer(zk);
        }

        self.setZabState(QuorumPeer.ZabState.BROADCAST);
        self.adminServer.setZooKeeperServer(zk);

    
        boolean tickSkip = true;
        // If not null then shutdown this leader
        String shutdownMessage = null;

        while (true) {
            synchronized (this) {
                long start = Time.currentElapsedTime();
                long cur = start;
                long end = start + self.tickTime / 2;
                while (cur < end) {
                    wait(end - cur);
                    cur = Time.currentElapsedTime();
                }

                if (!tickSkip) {
                    self.tick.incrementAndGet();
                }

                // We use an instance of SyncedLearnerTracker to
                // track synced learners to make sure we still have a
                // quorum of current (and potentially next pending) view.
                SyncedLearnerTracker syncedAckSet = new SyncedLearnerTracker();
                syncedAckSet.addQuorumVerifier(self.getQuorumVerifier());
                if (self.getLastSeenQuorumVerifier() != null
                    && self.getLastSeenQuorumVerifier().getVersion() > self.getQuorumVerifier().getVersion()) {
                    syncedAckSet.addQuorumVerifier(self.getLastSeenQuorumVerifier());
                }

                syncedAckSet.addAck(self.getId());

                for (LearnerHandler f : getLearners()) {
                    if (f.synced()) {
                        syncedAckSet.addAck(f.getSid());
                    }
                }

                // check leader running status
                if (!this.isRunning()) {
                    // set shutdown flag
                    shutdownMessage = "Unexpected internal error";
                    break;
                }

                if (!tickSkip && !syncedAckSet.hasAllQuorums()) {
                    // Lost quorum of last committed and/or last proposed
                    // config, set shutdown flag
                    shutdownMessage = "Not sufficient followers synced, only synced with sids: [ "
                                      + syncedAckSet.ackSetsToString()
                                      + " ]";
                    break;
                }
                tickSkip = !tickSkip;
            }
            for (LearnerHandler f : getLearners()) {
                f.ping();
            }
        }
        if (shutdownMessage != null) {
            shutdown(shutdownMessage);
            // leader goes in looking state
        }
    } finally {
        zk.unregisterJMX(this);
    }
}

11.zab選舉過程.
從前面5調用過來.進入FastLeaderElection#lookForLeader()方法.代碼如下:

public Vote lookForLeader() throws InterruptedException {
    try {
    	//監控統計用的bean
        self.jmxLeaderElectionBean = new LeaderElectionBean();
        MBeanRegistry.getInstance().register(self.jmxLeaderElectionBean, self.jmxLocalPeerBean);
    } catch (Exception e) {
        ...
    }

    self.start_fle = Time.currentElapsedTime();
    try {
        /*
         * The votes from the current leader election are stored in recvset. In other words, a vote v is in recvset
         * if v.electionEpoch == logicalclock. The current participant uses recvset to deduce on whether a majority
         * of participants has voted for it.
         */
         //如上面英文註釋,存儲當前的選票
        Map<Long, Vote> recvset = new HashMap<Long, Vote>();

        /*
         * The votes from previous leader elections, as well as the votes from the current leader election are
         * stored in outofelection. Note that notifications in a LOOKING state are not stored in outofelection.
         * Only FOLLOWING or LEADING notifications are stored in outofelection. The current participant could use
         * outofelection to learn which participant is the leader if it arrives late (i.e., higher logicalclock than
         * the electionEpoch of the received notifications) in a leader election.
         */
         //上一任leader的選票
        Map<Long, Vote> outofelection = new HashMap<Long, Vote>();

        int notTimeout = minNotificationInterval;

        synchronized (this) {
        	//累加時鐘週期
            logicalclock.incrementAndGet();
            //填充成員變量,投給leader的id(系統啓動時先投給自己),最新事務id,最新代數
            updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
        }

        LOG.info(
            "New election. My id = {}, proposed zxid=0x{}",
            self.getId(),
            Long.toHexString(proposedZxid));
        //把自己的選票發給其他server.每個server叫一個peer
        sendNotifications();

        SyncedLearnerTracker voteSet;

        /*
         * Loop in which we exchange notifications until we find a leader
         */
        //死循環,接收其他server發送的選票,做選舉判斷,直到有leader選出
        while ((self.getPeerState() == ServerState.LOOKING) && (!stop)) {
            /*
             * Remove next notification from queue, times out after 2 times
             * the termination time
             */
            Notification n = recvqueue.poll(notTimeout, TimeUnit.MILLISECONDS);

            /*
             * Sends more notifications if haven't received enough.
             * Otherwise processes new notification.
             */
            if (n == null) {
                if (manager.haveDelivered()) {
                    sendNotifications();
                } else {
                    manager.connectAll();
                }

                /*
                 * Exponential backoff
                 */
                int tmpTimeOut = notTimeout * 2;
                notTimeout = Math.min(tmpTimeOut, maxNotificationInterval);
                LOG.info("Notification time out: {}", notTimeout);
            } else if (validVoter(n.sid) && validVoter(n.leader)) {
            	//上面if條件判斷其他server發來的選票的sid,是否包含在zoo.cfg配置的server id列表中
            	//調試源碼時,先啓動server 1,調試server做爲server3,沒有啓動server2,所以這裏收到的是server1的消息
            	//消息是類型,在後面12處
                /*
                 * Only proceed if the vote comes from a replica in the current or next
                 * voting view for a replica in the current or next voting view.
                 */
                switch (n.state) {
                case LOOKING:
                    if (getInitLastLoggedZxid() == -1) {
                        LOG.debug("Ignoring notification as our zxid is -1");
                        break;
                    }
                    if (n.zxid == -1) {
                        LOG.debug("Ignoring notification from member with -1 zxid {}", n.sid);
                        break;
                    }
                    // If notification > current, replace and send messages out
                    //源server的選舉代數>當前server自己的代數,進入這裏.調試時server1是2,自己是1.所以進入.
                    if (n.electionEpoch > logicalclock.get()) {
                        logicalclock.set(n.electionEpoch);
                        recvset.clear();
                        //zab協議ide判斷邏輯,很簡單就一個條件判斷,代碼在後面13處分析
                        if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
                            updateProposal(n.leader, n.zxid, n.peerEpoch);
                        } else {
                        	//調試時server1和server3的選舉代數,事務id相同,但是server3的sid大,做leader。進入這裏
                        	//更新自己的選票
                            updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
                        }
                        //向其他server發送自己投票的結果.
                        sendNotifications();
                    } else if (n.electionEpoch < logicalclock.get()) {
                            LOG.debug(
                                "Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x{}, logicalclock=0x{}",
                                Long.toHexString(n.electionEpoch),
                                Long.toHexString(logicalclock.get()));
                        break;
                    } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)) {
                        updateProposal(n.leader, n.zxid, n.peerEpoch);
                        sendNotifications();
                    }

                    LOG.debug(
                        "Adding vote: from={}, proposed leader={}, proposed zxid=0x{}, proposed election epoch=0x{}",
                        n.sid,
                        n.leader,
                        Long.toHexString(n.zxid),
                        Long.toHexString(n.electionEpoch));

                    // don't care about the version if it's in LOOKING state
                    recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));

                    voteSet = getVoteTracker(recvset, new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch));

                    if (voteSet.hasAllQuorums()) {

                        // Verify if there is any change in the proposed leader
                        while ((n = recvqueue.poll(finalizeWait, TimeUnit.MILLISECONDS)) != null) {
                            if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)) {
                                recvqueue.put(n);
                                break;
                            }
                        }

                        /*
                         * This predicate is true once we don't read any new
                         * relevant message from the reception queue
                         */
                        if (n == null) {
                            setPeerState(proposedLeader, voteSet);
                            Vote endVote = new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch);
                            leaveInstance(endVote);
                            return endVote;
                        }
                    }
                    break;
                case OBSERVING:
                    LOG.debug("Notification from observer: {}", n.sid);
                    break;
                case FOLLOWING:
                case LEADING:
                    /*
                     * Consider all notifications from the same epoch
                     * together.
                     */
                    if (n.electionEpoch == logicalclock.get()) {
                        recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));
                        voteSet = getVoteTracker(recvset, new Vote(n.version, n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));
                        if (voteSet.hasAllQuorums() && checkLeader(recvset, n.leader, n.electionEpoch)) {
                            setPeerState(n.leader, voteSet);
                            Vote endVote = new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch);
                            leaveInstance(endVote);
                            return endVote;
                        }
                    }

                    /*
                     * Before joining an established ensemble, verify that
                     * a majority are following the same leader.
                     *
                     * Note that the outofelection map also stores votes from the current leader election.
                     * See ZOOKEEPER-1732 for more information.
                     */
                    outofelection.put(n.sid, new Vote(n.version, n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));
                    voteSet = getVoteTracker(outofelection, new Vote(n.version, n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));

                    if (voteSet.hasAllQuorums() && checkLeader(outofelection, n.leader, n.electionEpoch)) {
                        synchronized (this) {
                            logicalclock.set(n.electionEpoch);
                            setPeerState(n.leader, voteSet);
                        }
                        Vote endVote = new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch);
                        leaveInstance(endVote);
                        return endVote;
                    }
                    break;
                default:
                    LOG.warn("Notification state unrecoginized: {} (n.state), {}(n.sid)", n.state, n.sid);
                    break;
                }
            } else {
               ...
            }
        }
        return null;
    } finally {
        ...
    }
}

12.其他server發過來的投票類型

public static class Notification {
    /*
     * Format version, introduced in 3.4.6
     */
    public static final int CURRENTVERSION = 0x2;
    //固定值2
    int version;
    /*
     * Proposed leader
     */ 
     //剛啓動時server1推薦自己,所以這裏是1
     long leader;
    /*
     * zxid of the proposed leader
     */ 
     //剛啓動時server1創建了節點,這裏有事務id值
     long zxid;
    /*
     * Epoch
     */ 
     //第幾代選舉,server1傳過來2
     long electionEpoch;
    /*
     * current state of sender
     */ 
     //源server目前的狀態,server1傳過來也是Looking,尋找leader中
     QuorumPeer.ServerState state;
    /*
     * Address of sender
     */ 
     //源server的id號,這裏是1
     long sid;
    QuorumVerifier qv;
    /*
     * epoch of the proposed leader
     */ 
     //被推薦的leader的代數,server1傳過來1
     long peerEpoch;
}

13.zab算法判斷選票大小邏輯,從前面11調用過來
進入FastLeaderElection#totalOrderPredicate(),如果新選票更大,返回true.調試過程中,代數和事務id,server1和server3相等,但是調試用的server3的sid更大,所以選擇server3做leader

protected boolean totalOrderPredicate(long newId, long newZxid, long newEpoch, long curId, long curZxid, long curEpoch) {

    if (self.getQuorumVerifier().getWeight(newId) == 0) {
        return false;
    }

    /*
     * We return true if one of the following three cases hold:
     * 1- New epoch is higher
     * 2- New epoch is the same as current epoch, but new zxid is higher
     * 3- New epoch is the same as current epoch, new zxid is the same
     *  as current zxid, but server id is higher.
     */
    /* 
    (1).新選票代數 > 當前server的代數,返回true,否則進入(2)
     (2).代數相等,新選票的事務id大於當前server的事務id,返回true,否則進入(3)
     (3).事務id相等,新選票的sid大於當前server的sid.返回true.sid是zoo.cfg中配置的server.1,server.2等這個數字
     */
    return ((newEpoch > curEpoch)
            || ((newEpoch == curEpoch)
                && ((newZxid > curZxid)
                    || ((newZxid == curZxid)
                        && (newId > curId)))));
}


二.standalong模式服務端main方法啓動

 

主要邏輯是解析zoo.cfg文件,初始化數據管理器,ServerCnxnFactory網絡管理器,恢復本地數據,初始化會話管理器,JMX服務等.
(一).調用ManagedUtil.registerLog4jMBeans()初始化Log4J.
(二).調用ServerConfig.parse()解析zoo.cfg配置文件.進入QuorumPeerConfig.parse()解析zoo.cfg爲properties文件,配置值都解析到QuorumPeerConfig類的屬性中.parse()內部調用setupQuorumPeerConfig()解析serverId(myid文件中配置的數字),初始化本機爲PARTICIPANT參與者還是OBSERVER觀察者,默認是PARTICIPANT.
(三).調用ZooKeeperServerMain.runFromConfig(ServerConfig)
1.根據dataDir,dataLogDir目錄初始化FileTxnSnapLog事務快照
2.初始化ZooKeeperServer服務器.
3.初始化並且啓動AdminServer管理服務,端口8080,通過Jetty服務器管理.處理Http請求的類爲JettyAdminServer.CommandServlet.
4.創建ServerCnxnFactory的ZooKeeper連接池.連接池的實現類爲NIOServerCnxnFactory.調用NIOServerCnxnFactory.startup().啓動連接池。
(1).在連接池內創建工作線程池SelectorThread,等待接受網絡連接。如果有網絡連接進來,則進入NIOServerCnxnFactory.cnxnExpiryQueue的JMX隊列裏面處理.
(2).創建ZKDatabase內存數據庫,即ZNode節點文件系統.在ZKDatabase構造方法內部,調用createDataTree()創建節點樹.每一個節點是一個DataNode類型.調用WatchManagerFactory.createWatchManager()創建節點變化,數據變化,監聽管理器WatchManager賦值給DataTree的屬性.調用ZKDatabase.loadDataBase()進入FileTxnSnapLog.restore()方法從文件中恢復節點樹.restore()方法從data/version-2/目錄中遍歷快照文件,調用jute組件反序列化爲BinaryInputArchive格式.調用FileSnap.deserialize()方法解析到DataTree對象中,主要是解析文件偏移的一些操作.調用FileTxnSnapLog.fastForwardFromEdits()獲取事務id.調用FileTxnSnapLog.save()保存當前Snapsot.
(3).調用ZooKeeperServer.createSessionTracker()創建Session管理器.
(4).調用ZooKeeperServer.setupRequestProcessors()創建請求處理的職責鏈.依次添加FinalRequestProcessor,SyncRequestProcessor,PrepRequestProcessor.職責鏈是分散管理的,每個職責鏈上的節點存儲了下一個節點的引用.每個節點的處理都是異步的,放到節點內部的隊列中,通過新線程while循環依次取出一個個請求,處理請求.
(5).調用ZooKeeperServer.registerJMX()註冊JMX消息服務.
(6).調用ZooKeeperServer.registerMetrics()註冊統計信息.調用DefaultMetricsProvider.DefaultMetricsContext.registerGauge()註冊每個統計指標的函數指針,java8的::新特性用法.
5.創建ContainerManager.
6.調用 shutdownLatch.await()等待zookeeper關閉.
7.如果是QuorumPeerMain啓動,默認角色類型爲LearnerType.PARTICIPANT。在QuorumPeer.start()中進入QuorumPeer.startLeaderElection()開始選舉過程.具體選舉過程見"五.Leader選舉/zab".然後進入QuorumPeer.run()死循環運行直到進程退出,根據自己的角色運行不同的邏輯.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章