第一節 參考
1.參考
https://blog.csdn.net/zhangyuan19880606/article/details/51508294
https://blog.csdn.net/chinaCsdnV2/article/details/81049686
https://www.cnblogs.com/jxwch/p/6433310.html
http://godmoon.wicp.net/blog/index.php/post_421.html
https://github.com/apache/zookeeper
<從PAXOS到ZOOKEEPER分佈式一致性原理與實踐>
2.編譯註意:
調試時編譯的tag爲3.6.0的版本
ufpr.dl.sourceforge.net的地址使用https
只需要改動build.xml文件
以前版本用ant,新版都用maven,需要添加一個依賴
<dependency>
<groupId>com.codahale.metrics</groupId>
<artifactId>metrics-core</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>org.xerial.snappy</groupId>
<artifactId>snappy-java</artifactId>
<version>1.1.7</version>
</dependency>
3.啓動參數
/Users/feivirus/Documents/project/eclipse/zookeeper/conf/zoo.cfg
如果出現admin server的8080端口占用,配置admin.serverPort=8081
可以從其他zk的Servier進程啓動參數中,拷貝出參數,填到idea的調試啓動參數中.
3.客戶端連接命令 ./zkcli.sh -server 127.0.0.1:2181
查看節點主從命令 ./zkserver.sh status /xxx/zoo.cfg
4.問題:
(1).zookeeper投票過程?投票包含哪些內容字段,如何比較大小?
(2).zookeeper的事務是怎麼實現的
(3).leader的事務請求,比如創建節點,刪除節點,修改數據,怎麼同步到follower的,什麼時候同步?
(4).zookeeper對客戶端請求的處理流程是什麼?包含哪幾個處理器?
(5).zk的提案Proposal和事務有什麼關聯
第二節 架構
一.應用場景
1.數據發佈/訂閱
2.負載均衡
3.命名服務
4.分佈式協調/通知
5.集羣管理
6.master選舉
7.分佈式鎖
8.分佈式隊列
二.cap理論
三.paxos算法
四.zab協議
第四節 源碼細節
standalong模式入口爲org.apache.zookeeper.server.ZooKeeperServerMain.main().集羣模式入口爲org.apache.zookeeper.server.quorum.QuorumPeerMain.main().
一.集羣模式服務端main方法啓動
啓動過程如下圖:
(一).進入QuorumPeerMain#main().調用QuorumPeerMain#initializeAndRun()。代碼如下:
protected void initializeAndRun(String[] args) throws ConfigException, IOException, AdminServerException {
//創建zk配置類,裏面的成員記錄了zk的zoo.cfg的各種配置項值
QuorumPeerConfig config = new QuorumPeerConfig();
if (args.length == 1) {
/* 解析進程參數.解析zoo.cfg文件的路徑,調用QuorumPeerConfig#parseProperties()解析到QuorumPeerConfig成員變量中,比如客戶端連接端口號,data目錄路徑,dataLog目錄路徑,選舉策略默認爲3,zk服務端機器數量(我調試配置的3臺),代碼在後面1分析 */
config.parse(args[0]);
}
// Start and schedule the the purge task
//對事務日誌和數據的快照文件進行定時清理
DatadirCleanupManager purgeMgr = new DatadirCleanupManager(
config.getDataDir(),
config.getDataLogDir(),
config.getSnapRetainCount(),
config.getPurgeInterval());
purgeMgr.start();
if (args.length == 1 && config.isDistributed()) {
//啓動zookeeper,在後面2處分析
runFromConfig(config);
} else {
LOG.warn("Either no config or no quorum defined in config, running in standalone mode");
// there is only server in the quorum -- run as standalone
ZooKeeperServerMain.main(args);
}
}
1.解析zoo.cfg文件,讀取配置
public void parseProperties(Properties zkProp) throws IOException, ConfigException {
...
VerifyingFileFactory vff = new VerifyingFileFactory.Builder(LOG).warnForRelativePath().build();
//遍歷zoo.cfg文件的配置項
for (Entry<Object, Object> entry : zkProp.entrySet()) {
String key = entry.getKey().toString().trim();
String value = entry.getValue().toString().trim();
if (key.equals("dataDir")) {
//data目錄位置
dataDir = vff.create(value);
} else if (key.equals("dataLogDir")) {
//dataLog目錄位置
dataLogDir = vff.create(value);
} else if (key.equals("clientPort")) {
//客戶端連接端口號
clientPort = Integer.parseInt(value);
} ...
} else if (key.equals("clientPortAddress")) {
clientPortAddress = value.trim();
} ...
} else if (key.equals("tickTime")) {
//心跳毫秒數
tickTime = Integer.parseInt(value);
} else if (key.equals("maxClientCnxns")) {
//最大客戶端連接數
maxClientCnxns = Integer.parseInt(value);
} else if (key.equals("minSessionTimeout")) {
minSessionTimeout = Integer.parseInt(value);
} else if (key.equals("maxSessionTimeout")) {
//最大session超時時間
maxSessionTimeout = Integer.parseInt(value);
} else if (key.equals("initLimit")) {
initLimit = Integer.parseInt(value);
} else if (key.equals("syncLimit")) {
//客戶端和服務端最大的同步連接數
syncLimit = Integer.parseInt(value);
} ...
} else if (key.equals("electionAlg")) {
//選舉策略,成員變量直接賦值3
electionAlg = Integer.parseInt(value);
if (electionAlg != 3) {
throw new ConfigException("Invalid electionAlg value. Only 3 is supported.");
}
} ...
} else if (key.equals("peerType")) {
if (value.toLowerCase().equals("observer")) {
peerType = LearnerType.OBSERVER;
} else if (value.toLowerCase().equals("participant")) {
peerType = LearnerType.PARTICIPANT;
} else {
throw new ConfigException("Unrecognised peertype: " + value);
}
} ...
} else if (key.equals("standaloneEnabled")) {
if (value.toLowerCase().equals("true")) {
setStandaloneEnabled(true);
} else if (value.toLowerCase().equals("false")) {
setStandaloneEnabled(false);
} else {
throw new ConfigException("Invalid option "
+ value
+ " for standalone mode. Choose 'true' or 'false.'");
}
} ...
} else if (key.equals("quorum.cnxn.threads.size")) {
quorumCnxnThreadsSize = Integer.parseInt(value);
} ...
} else {
System.setProperty("zookeeper." + key, value);
}
}
...
...
try {
//創建監控類
Class.forName(metricsProviderClassName, false, Thread.currentThread().getContextClassLoader());
} catch (ClassNotFoundException error) {
throw new IllegalArgumentException("metrics provider class was not found", error);
}
// backward compatibility - dynamic configuration in the same file as
// static configuration params see writeDynamicConfig()
if (dynamicConfigFileStr == null) {
/*這裏面讀取myid文件,設置servierId.*/
setupQuorumPeerConfig(zkProp, true);
if (isDistributed() && isReconfigEnabled()) {
// we don't backup static config for standalone mode.
// we also don't backup if reconfig feature is disabled.
backupOldConfig();
}
}
}
2.啓動zookeeper.
進入QuorumPeerMain#runFromConfig().
public void runFromConfig(QuorumPeerConfig config) throws IOException, AdminServerException {
try {
//添加log4j監控
ManagedUtil.registerLog4jMBeans();
} catch (JMException e) {
LOG.warn("Unable to register log4j JMX control", e);
}
MetricsProvider metricsProvider;
try {
//使用DefaultMetricsProvider進行性能監控
metricsProvider = MetricsProviderBootstrap.startMetricsProvider(
config.getMetricsProviderClassName(),
config.getMetricsProviderConfiguration());
} catch (MetricsProviderLifeCycleException error) {
throw new IOException("Cannot boot MetricsProvider " + config.getMetricsProviderClassName(), error);
}
try {
ServerMetrics.metricsProviderInitialized(metricsProvider);
ServerCnxnFactory cnxnFactory = null;
ServerCnxnFactory secureCnxnFactory = null;
if (config.getClientPortAddress() != null) {
//使用NIO,創建NIOServerCnxnFactory的連接池,處理客戶端或者服務端發來的連接
cnxnFactory = ServerCnxnFactory.createFactory();
//配置連接池的客戶端地址
cnxnFactory.configure(config.getClientPortAddress(), config.getMaxClientCnxns(), config.getClientPortListenBacklog(), false);
}
...
//創建QuorumPeer選舉類對象,處理選舉的邏輯
quorumPeer = getQuorumPeer();
//添加事務日誌管理器,處理data目錄下日誌文件的備份恢復
quorumPeer.setTxnFactory(new FileTxnSnapLog(config.getDataLogDir(), config.getDataDir()));
...
//設置選舉策略,myid,超時時間等從配置文件zoo.cfg中讀取的值
quorumPeer.setElectionType(config.getElectionAlg());
quorumPeer.setMyid(config.getServerId());
...
//創建內存數據庫,用來存儲zookeeper的樹形文件系統,文件系統代碼在後面3處分析.
quorumPeer.setZKDatabase(new ZKDatabase(quorumPeer.getTxnFactory()));
...
/* 初始化內存數據庫配置,主要是添加/zookeeper/config節點, 把zoo.cfg中的server.x,每個server信息寫入如下,
get /zookeeper/config
server.1=127.0.0.1:2888:3888:participant
server.2=127.0.0.1:2889:3889:participant
server.3=127.0.0.1:2890:3890:participant
version=0
*/
quorumPeer.initConfigInZKDatabase();
//設置連接池
quorumPeer.setCnxnFactory(cnxnFactory);
//默認服務器角色是枚舉LearnerType#PARTICIPANT
quorumPeer.setLearnerType(config.getPeerType());
...
quorumPeer.setQuorumCnxnThreadsSize(config.quorumCnxnThreadsSize);
//啓動賬號密碼認證服務器,裏面是空實現
quorumPeer.initialize();
if (config.jvmPauseMonitorToRun) {
quorumPeer.setJvmPauseMonitor(new JvmPauseMonitor(config));
}
//進入QuorumPeer#start(),加載數據庫,啓動連接池,不是運行線程,代碼在後面4處分析
quorumPeer.start();
ZKAuditProvider.addZKStartStopAuditLog();
quorumPeer.join();
} catch (InterruptedException e) {
...
}
}
3.zookeeper內存裏面的樹形文件系統
在ZKDatabase類中定義.代碼如下:
public class ZKDatabase {
//樹形結構根節點,即"/"路徑
protected DataTree dataTree;
protected ConcurrentHashMap<Long, Integer> sessionsWithTimeouts;
//事務日誌
protected FileTxnSnapLog snapLog;
protected long minCommittedLog, maxCommittedLog;
/**
* Default value is to use snapshot if txnlog size exceeds 1/3 the size of snapshot
*/
public static final String SNAPSHOT_SIZE_FACTOR = "zookeeper.snapshotSizeFactor";
public static final double DEFAULT_SNAPSHOT_SIZE_FACTOR = 0.33;
private double snapshotSizeFactor;
public static final String COMMIT_LOG_COUNT = "zookeeper.commitLogCount";
public static final int DEFAULT_COMMIT_LOG_COUNT = 500;
public int commitLogCount;
protected static int commitLogBuffer = 700;
protected Queue<Proposal> committedLog = new ArrayDeque<>();
protected ReentrantReadWriteLock logLock = new ReentrantReadWriteLock();
/**
* Number of txn since last snapshot;
*/
private AtomicInteger txnCount = new AtomicInteger(0);
}
樹形節點定義:
public class DataTree {
private final RateLogger RATE_LOGGER = new RateLogger(LOG, 15 * 60 * 1000);
/**
* This map provides a fast lookup to the datanodes. The tree is the
* source of truth and is where all the locking occurs
*/
//孩子節點
private final NodeHashMap nodes;
//對數據變化的監控
private IWatchManager dataWatches;
//對子節點的監控
private IWatchManager childWatches;
...
/**
* This hashtable lists the paths of the ephemeral nodes of a session.
*/
//臨時節點的子節點
private final Map<Long, HashSet<String>> ephemerals = new ConcurrentHashMap<Long, HashSet<String>>();
//修改事務的最大id
private volatile ZxidDigest lastProcessedZxidDigest;
}
4.QuorumPeer#start()加載數據庫,啓動連接池
public synchronized void start() {
if (!getView().containsKey(myid)) {
throw new RuntimeException("My id " + myid + " not in the peer list");
}
//從文件中恢復,加載數據庫,調用FileTxnSnapLog#restore(),在後面代碼6處分析
loadDataBase();
//調用NIOServerCnxnFactory#start()啓動連接池
startServerCnxnFactory();
try {
adminServer.start();
} catch (AdminServerException e) {
LOG.warn("Problem starting AdminServer", e);
System.out.println(e);
}
//開始選舉,創建myid,默認事務id爲0,epoch爲0的Vote選票,進入QuorumPeer#createElectionAlgorithm()代碼在後面7分析
startLeaderElection();
startJvmPauseMonitor();
//進入下面的5.
super.start();
}
5.zookeeper選舉主線程.
進入QuorumPeer#run()方法.
public void run() {
//修改線程名爲QuorumPeer[myid=xx]xxx格式
updateThreadName();
LOG.debug("Starting quorum peer");
try {
jmxQuorumBean = new QuorumBean(this);
MBeanRegistry.getInstance().register(jmxQuorumBean, null);
for (QuorumServer s : getView().values()) {
//遍歷zoo.cfg中所有的server
ZKMBeanInfo p;
if (getId() == s.id) {
//記錄本地server
p = jmxLocalPeerBean = new LocalPeerBean(this);
try {
MBeanRegistry.getInstance().register(p, jmxQuorumBean);
} catch (Exception e) {
LOG.warn("Failed to register with JMX", e);
jmxLocalPeerBean = null;
}
} else {
//記錄其他server
RemotePeerBean rBean = new RemotePeerBean(this, s);
try {
MBeanRegistry.getInstance().register(rBean, jmxQuorumBean);
jmxRemotePeerBean.put(s.id, rBean);
} catch (Exception e) {
LOG.warn("Failed to register with JMX", e);
}
}
}
} catch (Exception e) {
...
}
try {
/*
* Main loop
*/
//死循環
while (running) {
//獲取當前機器狀態
switch (getPeerState()) {
case LOOKING:
//正在尋找Leader,剛啓動時處於這個狀態
ServerMetrics.getMetrics().LOOKING_COUNT.add(1);
if (Boolean.getBoolean("readonlymode.enabled")) {
...
} else {
try {
reconfigFlagClear();
if (shuttingDownLE) {
shuttingDownLE = false;
startLeaderElection();
}
/* makeLEStrategy()獲取當前的選舉算法. lookForLeader()選舉過程,判斷誰是leader,進入FastLeaderElection#lookForLeader(),這裏就是進入zab算法的邏輯,在後面11處分析.
這裏調用完,選舉結果就有了.調試時server3做leader*/
setCurrentVote(makeLEStrategy().lookForLeader());
} catch (Exception e) {
LOG.warn("Unexpected exception", e);
setPeerState(ServerState.LOOKING);
}
}
break;
case OBSERVING:
try {
LOG.info("OBSERVING");
setObserver(makeObserver(logFactory));
observer.observeLeader();
} catch (Exception e) {
LOG.warn("Unexpected exception", e);
} finally {
observer.shutdown();
setObserver(null);
updateServerState();
// Add delay jitter before we switch to LOOKING
// state to reduce the load of ObserverMaster
if (isRunning()) {
Observer.waitForObserverElectionDelay();
}
}
break;
case FOLLOWING:
try {
LOG.info("FOLLOWING");
setFollower(makeFollower(logFactory));
follower.followLeader();
} catch (Exception e) {
LOG.warn("Unexpected exception", e);
} finally {
follower.shutdown();
setFollower(null);
updateServerState();
}
break;
case LEADING:
LOG.info("LEADING");
try {
//經過和1之間的投票結果判斷,server3做leader.循環進入這裏.設置自己爲leader
setLeader(makeLeader(logFactory));
//server3開始領導過程,在後面10處分析
leader.lead();
setLeader(null);
} catch (Exception e) {
LOG.warn("Unexpected exception", e);
} finally {
if (leader != null) {
leader.shutdown("Forcing shutdown");
setLeader(null);
}
updateServerState();
}
break;
}
}
} finally {
LOG.warn("QuorumPeer main thread exited");
MBeanRegistry instance = MBeanRegistry.getInstance();
instance.unregister(jmxQuorumBean);
instance.unregister(jmxLocalPeerBean);
for (RemotePeerBean remotePeerBean : jmxRemotePeerBean.values()) {
instance.unregister(remotePeerBean);
}
jmxQuorumBean = null;
jmxLocalPeerBean = null;
jmxRemotePeerBean = null;
}
}
6.從文件恢復zookeeper數據
進入FileTxnSnapLog#restore().
public long restore(DataTree dt, Map<Long, Integer> sessions, PlayBackListener listener) throws IOException {
long snapLoadingStartTime = Time.currentElapsedTime();
//從目錄data/version-2/snapshot中恢復日誌
long deserializeResult = snapLog.deserialize(dt, sessions);
ServerMetrics.getMetrics().STARTUP_SNAP_LOAD_TIME.add(Time.currentElapsedTime() - snapLoadingStartTime);
FileTxnLog txnLog = new FileTxnLog(dataDir);
boolean trustEmptyDB;
//初始化文件
File initFile = new File(dataDir.getParent(), "initialize");
if (Files.deleteIfExists(initFile.toPath())) {
LOG.info("Initialize file found, an empty database will not block voting participation");
trustEmptyDB = true;
} else {
trustEmptyDB = autoCreateDB;
}
...
//之前沒有事務日誌記錄
if (-1L == deserializeResult) {
/* this means that we couldn't find any snapshot, so we need to
* initialize an empty database (reported in ZOOKEEPER-2325) */
if (txnLog.getLastLoggedZxid() != -1) {
// ZOOKEEPER-3056: provides an escape hatch for users upgrading
// from old versions of zookeeper (3.4.x, pre 3.5.3).
if (!trustEmptySnapshot) {
throw new IOException(EMPTY_SNAPSHOT_WARNING + "Something is broken!");
} else {
LOG.warn("{}This should only be allowed during upgrading.", EMPTY_SNAPSHOT_WARNING);
return finalizer.run();
}
}
if (trustEmptyDB) {
/* TODO: (br33d) we should either put a ConcurrentHashMap on restore()
* or use Map on save() */
//新建數據庫文件
save(dt, (ConcurrentHashMap<Long, Integer>) sessions, false);
/* return a zxid of 0, since we know the database is empty */
return 0L;
} else {
...
dt.lastProcessedZxid = -1L;
return -1L;
}
}
return finalizer.run();
}
7.創建選舉線程
進入QuorumPeer#createElectionAlgorithm(),代碼如下:
protected Election createElectionAlgorithm(int electionAlgorithm) {
Election le = null;
//TODO: use a factory rather than a switch
switch (electionAlgorithm) {
case 1:
throw new UnsupportedOperationException("Election Algorithm 1 is not supported.");
case 2:
throw new UnsupportedOperationException("Election Algorithm 2 is not supported.");
case 3:
//3是FastLeaderElection,其他兩個不再支持.創建選舉連接管理器
QuorumCnxManager qcm = createCnxnManager();
QuorumCnxManager oldQcm = qcmRef.getAndSet(qcm);
if (oldQcm != null) {
LOG.warn("Clobbering already-set QuorumCnxManager (restarting leader election?)");
oldQcm.halt();
}
QuorumCnxManager.Listener listener = qcm.listener;
if (listener != null) {
//啓動新線程,監聽其他服務器的選舉結果
listener.start();
FastLeaderElection fle = new FastLeaderElection(this, qcm);
//創建線程,FastLeaderElection.Messenger.WorkerReceiver#run()處理其他服務器選票.開始選舉.
fle.start();
le = fle;
} else {
LOG.error("Null listener when initializing cnx manager");
}
break;
default:
assert false;
}
return le;
}
8.處理其他服務器投票
進入FastLeaderElection.Messenger.WorkerReceiver#run()方法.
public void run() {
Message response;
while (!stop) {
// Sleeps on receive
try {
//從隊列裏面取出消息
response = manager.pollRecvQueue(3000, TimeUnit.MILLISECONDS);
if (response == null) {
continue;
}
...
//選舉相關信息的封裝
Notification n = new Notification();
int rstate = response.buffer.getInt();
//源服務器的推薦的leader
long rleader = response.buffer.getLong();
//源服務器的最新事務id
long rzxid = response.buffer.getLong();
//源服務器的epoch代數
long relectionEpoch = response.buffer.getLong();
long rpeerepoch;
int version = 0x0;
...
QuorumVerifier rqv = null;
...
/*
* If it is from a non-voting server (such as an observer or
* a non-voting follower), respond right away.
*/
if (!validVoter(response.sid)) {
//源服務器不能投票,不是follower,直接返回
Vote current = self.getCurrentVote();
QuorumVerifier qv = self.getQuorumVerifier();
ToSend notmsg = new ToSend(
ToSend.mType.notification,
current.getId(),
current.getZxid(),
logicalclock.get(),
self.getPeerState(),
response.sid,
current.getPeerEpoch(),
qv.toString().getBytes());
sendqueue.offer(notmsg);
} else {
...
QuorumPeer.ServerState ackstate = QuorumPeer.ServerState.LOOKING;
switch (rstate) {
case 0:
//啓動一臺zk1,然後啓動調試服務器做3後,進入這裏,狀態是尋找服務器
ackstate = QuorumPeer.ServerState.LOOKING;
break;
case 1:
ackstate = QuorumPeer.ServerState.FOLLOWING;
break;
case 2:
ackstate = QuorumPeer.ServerState.LEADING;
break;
case 3:
ackstate = QuorumPeer.ServerState.OBSERVING;
break;
default:
continue;
}
//選舉消息的填充,n是Notification類型,dto的作用
n.leader = rleader;
n.zxid = rzxid;
n.electionEpoch = relectionEpoch;
n.state = ackstate;
n.sid = response.sid;
n.peerEpoch = rpeerepoch;
n.version = version;
n.qv = rqv;
/*
* If this server is looking, then send proposed leader
*/
if (self.getPeerState() == QuorumPeer.ServerState.LOOKING) {
//投票放入隊列裏面,然後QuorumPeer#run()死循環,異步取出選票處理
recvqueue.offer(n);
...
}
} else {
/*
* If this server is not looking, but the one that sent the ack
* is looking, then send back what it believes to be the leader.
*/
Vote current = self.getCurrentVote();
if (ackstate == QuorumPeer.ServerState.LOOKING) {
if (self.leader != null) {
if (leadingVoteSet != null) {
self.leader.setLeadingVoteSet(leadingVoteSet);
leadingVoteSet = null;
}
self.leader.reportLookingSid(response.sid);
}
QuorumVerifier qv = self.getQuorumVerifier();
ToSend notmsg = new ToSend(
ToSend.mType.notification,
current.getId(),
current.getZxid(),
current.getElectionEpoch(),
self.getPeerState(),
response.sid,
current.getPeerEpoch(),
qv.toString().getBytes());
sendqueue.offer(notmsg);
}
}
}
} catch (InterruptedException e) {
...
}
}
...
}
}
9.選票類Vote,代碼如下:
public class Vote {
//版本
private final int version;
private final long id;
//事務id
private final long zxid;
//選票代數
private final long electionEpoch;
//推薦服務器代數
private final long peerEpoch;
}
10.開始Leader領導過程
從上面5調用過來,進入Leader#lead()
void lead() throws IOException, InterruptedException {
...
try {
//設置當前Zab爲DISCOVERY,尋找leader狀態
self.setZabState(QuorumPeer.ZabState.DISCOVERY);
self.tick.set(0);
//設置最新的事務id,調用ZooKeeperServer#takeSnapshot(boolean)做一個乾淨的快照,此時沒有事務進來
zk.loadData();
leaderStateSummary = new StateSummary(self.getCurrentEpoch(), zk.getLastProcessedZxid());
// Start thread that waits for connection requests from
// new followers.
//創建線程,等待Flollower角色的連接請求,線程啓動進入LearnerHandler#run()方法.Leader以外的server都是Learner
cnxAcceptor = new LearnerCnxAcceptor();
cnxAcceptor.start();
//獲取提案的代數
long epoch = getEpochToPropose(self.getId(), self.getAcceptedEpoch());
zk.setZxid(ZxidUtils.makeZxid(epoch, 0));
synchronized (this) {
lastProposed = zk.getZxid();
}
newLeaderProposal.packet = new QuorumPacket(NEWLEADER, zk.getZxid(), null, null);
if ((newLeaderProposal.packet.getZxid() & 0xffffffffL) != 0) {
LOG.info("NEWLEADER proposal has Zxid of {}", Long.toHexString(newLeaderProposal.packet.getZxid()));
}
QuorumVerifier lastSeenQV = self.getLastSeenQuorumVerifier();
QuorumVerifier curQV = self.getQuorumVerifier();
if (curQV.getVersion() == 0 && curQV.getVersion() == lastSeenQV.getVersion()) {
try {
QuorumVerifier newQV = self.configFromString(curQV.toString());
newQV.setVersion(zk.getZxid());
self.setLastSeenQuorumVerifier(newQV, true);
} catch (Exception e) {
throw new IOException(e);
}
}
newLeaderProposal.addQuorumVerifier(self.getQuorumVerifier());
if (self.getLastSeenQuorumVerifier().getVersion() > self.getQuorumVerifier().getVersion()) {
newLeaderProposal.addQuorumVerifier(self.getLastSeenQuorumVerifier());
}
// We have to get at least a majority of servers in sync with
// us. We do this by waiting for the NEWLEADER packet to get
// acknowledged
waitForEpochAck(self.getId(), leaderStateSummary);
self.setCurrentEpoch(epoch);
self.setLeaderAddressAndId(self.getQuorumAddress(), self.getId());
self.setZabState(QuorumPeer.ZabState.SYNCHRONIZATION);
try {
waitForNewLeaderAck(self.getId(), zk.getZxid());
} catch (InterruptedException e) {
shutdown("Waiting for a quorum of followers, only synced with sids: [ "
+ newLeaderProposal.ackSetsToString()
+ " ]");
HashSet<Long> followerSet = new HashSet<Long>();
for (LearnerHandler f : getLearners()) {
if (self.getQuorumVerifier().getVotingMembers().containsKey(f.getSid())) {
followerSet.add(f.getSid());
}
}
boolean initTicksShouldBeIncreased = true;
for (Proposal.QuorumVerifierAcksetPair qvAckset : newLeaderProposal.qvAcksetPairs) {
if (!qvAckset.getQuorumVerifier().containsQuorum(followerSet)) {
initTicksShouldBeIncreased = false;
break;
}
}
if (initTicksShouldBeIncreased) {
LOG.warn("Enough followers present. Perhaps the initTicks need to be increased.");
}
return;
}
startZkServer();
String initialZxid = System.getProperty("zookeeper.testingonly.initialZxid");
if (initialZxid != null) {
long zxid = Long.parseLong(initialZxid);
zk.setZxid((zk.getZxid() & 0xffffffff00000000L) | zxid);
}
if (!System.getProperty("zookeeper.leaderServes", "yes").equals("no")) {
self.setZooKeeperServer(zk);
}
self.setZabState(QuorumPeer.ZabState.BROADCAST);
self.adminServer.setZooKeeperServer(zk);
boolean tickSkip = true;
// If not null then shutdown this leader
String shutdownMessage = null;
while (true) {
synchronized (this) {
long start = Time.currentElapsedTime();
long cur = start;
long end = start + self.tickTime / 2;
while (cur < end) {
wait(end - cur);
cur = Time.currentElapsedTime();
}
if (!tickSkip) {
self.tick.incrementAndGet();
}
// We use an instance of SyncedLearnerTracker to
// track synced learners to make sure we still have a
// quorum of current (and potentially next pending) view.
SyncedLearnerTracker syncedAckSet = new SyncedLearnerTracker();
syncedAckSet.addQuorumVerifier(self.getQuorumVerifier());
if (self.getLastSeenQuorumVerifier() != null
&& self.getLastSeenQuorumVerifier().getVersion() > self.getQuorumVerifier().getVersion()) {
syncedAckSet.addQuorumVerifier(self.getLastSeenQuorumVerifier());
}
syncedAckSet.addAck(self.getId());
for (LearnerHandler f : getLearners()) {
if (f.synced()) {
syncedAckSet.addAck(f.getSid());
}
}
// check leader running status
if (!this.isRunning()) {
// set shutdown flag
shutdownMessage = "Unexpected internal error";
break;
}
if (!tickSkip && !syncedAckSet.hasAllQuorums()) {
// Lost quorum of last committed and/or last proposed
// config, set shutdown flag
shutdownMessage = "Not sufficient followers synced, only synced with sids: [ "
+ syncedAckSet.ackSetsToString()
+ " ]";
break;
}
tickSkip = !tickSkip;
}
for (LearnerHandler f : getLearners()) {
f.ping();
}
}
if (shutdownMessage != null) {
shutdown(shutdownMessage);
// leader goes in looking state
}
} finally {
zk.unregisterJMX(this);
}
}
11.zab選舉過程.
從前面5調用過來.進入FastLeaderElection#lookForLeader()方法.代碼如下:
public Vote lookForLeader() throws InterruptedException {
try {
//監控統計用的bean
self.jmxLeaderElectionBean = new LeaderElectionBean();
MBeanRegistry.getInstance().register(self.jmxLeaderElectionBean, self.jmxLocalPeerBean);
} catch (Exception e) {
...
}
self.start_fle = Time.currentElapsedTime();
try {
/*
* The votes from the current leader election are stored in recvset. In other words, a vote v is in recvset
* if v.electionEpoch == logicalclock. The current participant uses recvset to deduce on whether a majority
* of participants has voted for it.
*/
//如上面英文註釋,存儲當前的選票
Map<Long, Vote> recvset = new HashMap<Long, Vote>();
/*
* The votes from previous leader elections, as well as the votes from the current leader election are
* stored in outofelection. Note that notifications in a LOOKING state are not stored in outofelection.
* Only FOLLOWING or LEADING notifications are stored in outofelection. The current participant could use
* outofelection to learn which participant is the leader if it arrives late (i.e., higher logicalclock than
* the electionEpoch of the received notifications) in a leader election.
*/
//上一任leader的選票
Map<Long, Vote> outofelection = new HashMap<Long, Vote>();
int notTimeout = minNotificationInterval;
synchronized (this) {
//累加時鐘週期
logicalclock.incrementAndGet();
//填充成員變量,投給leader的id(系統啓動時先投給自己),最新事務id,最新代數
updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
}
LOG.info(
"New election. My id = {}, proposed zxid=0x{}",
self.getId(),
Long.toHexString(proposedZxid));
//把自己的選票發給其他server.每個server叫一個peer
sendNotifications();
SyncedLearnerTracker voteSet;
/*
* Loop in which we exchange notifications until we find a leader
*/
//死循環,接收其他server發送的選票,做選舉判斷,直到有leader選出
while ((self.getPeerState() == ServerState.LOOKING) && (!stop)) {
/*
* Remove next notification from queue, times out after 2 times
* the termination time
*/
Notification n = recvqueue.poll(notTimeout, TimeUnit.MILLISECONDS);
/*
* Sends more notifications if haven't received enough.
* Otherwise processes new notification.
*/
if (n == null) {
if (manager.haveDelivered()) {
sendNotifications();
} else {
manager.connectAll();
}
/*
* Exponential backoff
*/
int tmpTimeOut = notTimeout * 2;
notTimeout = Math.min(tmpTimeOut, maxNotificationInterval);
LOG.info("Notification time out: {}", notTimeout);
} else if (validVoter(n.sid) && validVoter(n.leader)) {
//上面if條件判斷其他server發來的選票的sid,是否包含在zoo.cfg配置的server id列表中
//調試源碼時,先啓動server 1,調試server做爲server3,沒有啓動server2,所以這裏收到的是server1的消息
//消息是類型,在後面12處
/*
* Only proceed if the vote comes from a replica in the current or next
* voting view for a replica in the current or next voting view.
*/
switch (n.state) {
case LOOKING:
if (getInitLastLoggedZxid() == -1) {
LOG.debug("Ignoring notification as our zxid is -1");
break;
}
if (n.zxid == -1) {
LOG.debug("Ignoring notification from member with -1 zxid {}", n.sid);
break;
}
// If notification > current, replace and send messages out
//源server的選舉代數>當前server自己的代數,進入這裏.調試時server1是2,自己是1.所以進入.
if (n.electionEpoch > logicalclock.get()) {
logicalclock.set(n.electionEpoch);
recvset.clear();
//zab協議ide判斷邏輯,很簡單就一個條件判斷,代碼在後面13處分析
if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
updateProposal(n.leader, n.zxid, n.peerEpoch);
} else {
//調試時server1和server3的選舉代數,事務id相同,但是server3的sid大,做leader。進入這裏
//更新自己的選票
updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
}
//向其他server發送自己投票的結果.
sendNotifications();
} else if (n.electionEpoch < logicalclock.get()) {
LOG.debug(
"Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x{}, logicalclock=0x{}",
Long.toHexString(n.electionEpoch),
Long.toHexString(logicalclock.get()));
break;
} else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)) {
updateProposal(n.leader, n.zxid, n.peerEpoch);
sendNotifications();
}
LOG.debug(
"Adding vote: from={}, proposed leader={}, proposed zxid=0x{}, proposed election epoch=0x{}",
n.sid,
n.leader,
Long.toHexString(n.zxid),
Long.toHexString(n.electionEpoch));
// don't care about the version if it's in LOOKING state
recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
voteSet = getVoteTracker(recvset, new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch));
if (voteSet.hasAllQuorums()) {
// Verify if there is any change in the proposed leader
while ((n = recvqueue.poll(finalizeWait, TimeUnit.MILLISECONDS)) != null) {
if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)) {
recvqueue.put(n);
break;
}
}
/*
* This predicate is true once we don't read any new
* relevant message from the reception queue
*/
if (n == null) {
setPeerState(proposedLeader, voteSet);
Vote endVote = new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch);
leaveInstance(endVote);
return endVote;
}
}
break;
case OBSERVING:
LOG.debug("Notification from observer: {}", n.sid);
break;
case FOLLOWING:
case LEADING:
/*
* Consider all notifications from the same epoch
* together.
*/
if (n.electionEpoch == logicalclock.get()) {
recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));
voteSet = getVoteTracker(recvset, new Vote(n.version, n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));
if (voteSet.hasAllQuorums() && checkLeader(recvset, n.leader, n.electionEpoch)) {
setPeerState(n.leader, voteSet);
Vote endVote = new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch);
leaveInstance(endVote);
return endVote;
}
}
/*
* Before joining an established ensemble, verify that
* a majority are following the same leader.
*
* Note that the outofelection map also stores votes from the current leader election.
* See ZOOKEEPER-1732 for more information.
*/
outofelection.put(n.sid, new Vote(n.version, n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));
voteSet = getVoteTracker(outofelection, new Vote(n.version, n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));
if (voteSet.hasAllQuorums() && checkLeader(outofelection, n.leader, n.electionEpoch)) {
synchronized (this) {
logicalclock.set(n.electionEpoch);
setPeerState(n.leader, voteSet);
}
Vote endVote = new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch);
leaveInstance(endVote);
return endVote;
}
break;
default:
LOG.warn("Notification state unrecoginized: {} (n.state), {}(n.sid)", n.state, n.sid);
break;
}
} else {
...
}
}
return null;
} finally {
...
}
}
12.其他server發過來的投票類型
public static class Notification {
/*
* Format version, introduced in 3.4.6
*/
public static final int CURRENTVERSION = 0x2;
//固定值2
int version;
/*
* Proposed leader
*/
//剛啓動時server1推薦自己,所以這裏是1
long leader;
/*
* zxid of the proposed leader
*/
//剛啓動時server1創建了節點,這裏有事務id值
long zxid;
/*
* Epoch
*/
//第幾代選舉,server1傳過來2
long electionEpoch;
/*
* current state of sender
*/
//源server目前的狀態,server1傳過來也是Looking,尋找leader中
QuorumPeer.ServerState state;
/*
* Address of sender
*/
//源server的id號,這裏是1
long sid;
QuorumVerifier qv;
/*
* epoch of the proposed leader
*/
//被推薦的leader的代數,server1傳過來1
long peerEpoch;
}
13.zab算法判斷選票大小邏輯,從前面11調用過來
進入FastLeaderElection#totalOrderPredicate(),如果新選票更大,返回true.調試過程中,代數和事務id,server1和server3相等,但是調試用的server3的sid更大,所以選擇server3做leader
protected boolean totalOrderPredicate(long newId, long newZxid, long newEpoch, long curId, long curZxid, long curEpoch) {
if (self.getQuorumVerifier().getWeight(newId) == 0) {
return false;
}
/*
* We return true if one of the following three cases hold:
* 1- New epoch is higher
* 2- New epoch is the same as current epoch, but new zxid is higher
* 3- New epoch is the same as current epoch, new zxid is the same
* as current zxid, but server id is higher.
*/
/*
(1).新選票代數 > 當前server的代數,返回true,否則進入(2)
(2).代數相等,新選票的事務id大於當前server的事務id,返回true,否則進入(3)
(3).事務id相等,新選票的sid大於當前server的sid.返回true.sid是zoo.cfg中配置的server.1,server.2等這個數字
*/
return ((newEpoch > curEpoch)
|| ((newEpoch == curEpoch)
&& ((newZxid > curZxid)
|| ((newZxid == curZxid)
&& (newId > curId)))));
}
二.standalong模式服務端main方法啓動
主要邏輯是解析zoo.cfg文件,初始化數據管理器,ServerCnxnFactory網絡管理器,恢復本地數據,初始化會話管理器,JMX服務等.
(一).調用ManagedUtil.registerLog4jMBeans()初始化Log4J.
(二).調用ServerConfig.parse()解析zoo.cfg配置文件.進入QuorumPeerConfig.parse()解析zoo.cfg爲properties文件,配置值都解析到QuorumPeerConfig類的屬性中.parse()內部調用setupQuorumPeerConfig()解析serverId(myid文件中配置的數字),初始化本機爲PARTICIPANT參與者還是OBSERVER觀察者,默認是PARTICIPANT.
(三).調用ZooKeeperServerMain.runFromConfig(ServerConfig)
1.根據dataDir,dataLogDir目錄初始化FileTxnSnapLog事務快照
2.初始化ZooKeeperServer服務器.
3.初始化並且啓動AdminServer管理服務,端口8080,通過Jetty服務器管理.處理Http請求的類爲JettyAdminServer.CommandServlet.
4.創建ServerCnxnFactory的ZooKeeper連接池.連接池的實現類爲NIOServerCnxnFactory.調用NIOServerCnxnFactory.startup().啓動連接池。
(1).在連接池內創建工作線程池SelectorThread,等待接受網絡連接。如果有網絡連接進來,則進入NIOServerCnxnFactory.cnxnExpiryQueue的JMX隊列裏面處理.
(2).創建ZKDatabase內存數據庫,即ZNode節點文件系統.在ZKDatabase構造方法內部,調用createDataTree()創建節點樹.每一個節點是一個DataNode類型.調用WatchManagerFactory.createWatchManager()創建節點變化,數據變化,監聽管理器WatchManager賦值給DataTree的屬性.調用ZKDatabase.loadDataBase()進入FileTxnSnapLog.restore()方法從文件中恢復節點樹.restore()方法從data/version-2/目錄中遍歷快照文件,調用jute組件反序列化爲BinaryInputArchive格式.調用FileSnap.deserialize()方法解析到DataTree對象中,主要是解析文件偏移的一些操作.調用FileTxnSnapLog.fastForwardFromEdits()獲取事務id.調用FileTxnSnapLog.save()保存當前Snapsot.
(3).調用ZooKeeperServer.createSessionTracker()創建Session管理器.
(4).調用ZooKeeperServer.setupRequestProcessors()創建請求處理的職責鏈.依次添加FinalRequestProcessor,SyncRequestProcessor,PrepRequestProcessor.職責鏈是分散管理的,每個職責鏈上的節點存儲了下一個節點的引用.每個節點的處理都是異步的,放到節點內部的隊列中,通過新線程while循環依次取出一個個請求,處理請求.
(5).調用ZooKeeperServer.registerJMX()註冊JMX消息服務.
(6).調用ZooKeeperServer.registerMetrics()註冊統計信息.調用DefaultMetricsProvider.DefaultMetricsContext.registerGauge()註冊每個統計指標的函數指針,java8的::新特性用法.
5.創建ContainerManager.
6.調用 shutdownLatch.await()等待zookeeper關閉.
7.如果是QuorumPeerMain啓動,默認角色類型爲LearnerType.PARTICIPANT。在QuorumPeer.start()中進入QuorumPeer.startLeaderElection()開始選舉過程.具體選舉過程見"五.Leader選舉/zab".然後進入QuorumPeer.run()死循環運行直到進程退出,根據自己的角色運行不同的邏輯.