有一次使用hadoop客戶端運行hadoop的fsck,客戶端報錯,顯示socket錯誤,連不上,rd童鞋恥笑說沒有配置http地址,教訓我等fsck是依靠http的一個工具,好吧,誰讓咱他媽的不懂呢,現在就來看看fsck到底是咋實現的。
elif [ "$COMMAND" = "fsck" ] ; then
CLASS=org.apache.hadoop.hdfs.tools.DFSck
還是查看hadoop腳本,fsck工具的啓動入口在org.apache.hadoop.hdfs.tools.DFSck, public static void main(String[] args) throws Exception {
// -files option is also used by GenericOptionsParser
// Make sure that is not the first argument for fsck
int res = -1;
if ((args.length == 0 ) || ("-files".equals(args[0])))
printUsage();
else
res = ToolRunner.run(new DFSck(new Configuration()), args);
System.exit(res);
}
從main可以看到,fsck的實現是DFSck,它實現工具類接口,重寫run方法。從構造方法看,它運行需要hdfs的配置文件,告訴它namenode的信息,還要告訴它用戶傳入的參數。這些廢話就不多說了,進去看看吧。
String proto = "http://";
if(UserGroupInformation.isSecurityEnabled()) {
System.setProperty("https.cipherSuites", Krb5AndCertsSslSocketConnector.KRB5_CIPHER_SUITES.get(0));
proto = "https://";
}//根據是否開啓了安全屬性,選擇http協議或者https,果然是通過web
final StringBuffer url = new StringBuffer(proto);
url.append(NameNode.getInfoServer(getConf())).append("/fsck?ugi=").append(ugi.getShortUserName()).append("&path=");
//從這裏/fsck可以看出,namenode實現了fsck的servlet,這個fsck命令行腳本只不過去向這個接口提交請求。
String dir = "/";
// find top-level dir first
for (int idx = 0; idx < args.length; idx++) {
if (!args[idx].startsWith("-")) { dir = args[idx]; break; }
}
url.append(URLEncoder.encode(dir, "UTF-8"));
for (int idx = 0; idx < args.length; idx++) {
if (args[idx].equals("-move")) { url.append("&move=1"); }
else if (args[idx].equals("-delete")) { url.append("&delete=1"); }
else if (args[idx].equals("-files")) { url.append("&files=1"); }
else if (args[idx].equals("-openforwrite")) { url.append("&openforwrite=1"); }
else if (args[idx].equals("-blocks")) { url.append("&blocks=1"); }
else if (args[idx].equals("-locations")) { url.append("&locations=1"); }
else if (args[idx].equals("-racks")) { url.append("&racks=1"); }
}
URL path = new URL(url.toString());
SecurityUtil.fetchServiceTicket(path);
URLConnection connection = path.openConnection();
InputStream stream = connection.getInputStream();
從上面的代碼可以看出,這個工具類實現比較簡單,就是調用namenode的fsck這個接口,採用servlet的方式,然後把用戶的參數拼成url去提交request。然後拿到結果打印到屏幕上。我擦,當時不配http地址當然沒法執行了啊!看一下getInfoServer方法吧就知道了。
public static String getInfoServer(Configuration conf) {
String http = UserGroupInformation.isSecurityEnabled() ? "dfs.https.address"
: "dfs.http.address";
return NetUtils.getServerAddress(conf, "dfs.info.bindAddress",
"dfs.info.port", http);
}
這個namenode的infoserver就是提供訪問頁面的那個jetty-httpserver,實際上namenode也就起了這麼一個。ok,fsck到底做了啥,現在就轉變成去看fsck對應的servlet了。
關於namenode提供了哪些servlet,可以到namenode的啓動代碼去看,找到httpserver啓動的部分,就可以看到,fsck對應的servlet叫做FsckServlet
httpServer.addInternalServlet("fsck", "/fsck", FsckServlet.class, true);
httpServer.addInternalServlet("getimage", "/getimage",
GetImageServlet.class, true);
httpServer.addInternalServlet("listPaths", "/listPaths/*",
ListPathsServlet.class, false);
httpServer.addInternalServlet("data", "/data/*",
FileDataServlet.class, false);
httpServer.addInternalServlet("checksum", "/fileChecksum/*",
FileChecksumServlets.RedirectServlet.class, false);
httpServer.addInternalServlet("contentSummary", "/contentSummary/*",
ContentSummaryServlet.class, false);
從這裏也可以看出,hdfs並沒有做插件支持,這個跟azkaban是不一樣的。下面繼續看這個fsckServerlet幹了啥
public Object run() throws Exception {
final NameNode nn = (NameNode) context.getAttribute("name.node");
final int totalDatanodes = nn.namesystem.getNumberOfDatanodes(DatanodeReportType.LIVE);
final short minReplication = nn.namesystem.getMinReplication();
new NamenodeFsck(conf, nn, nn.getNetworkTopology(), pmap, out,
totalDatanodes, minReplication, remoteAddress).fsck();
return null;
}
也很簡單,拿到namenode對象,看看現在有多少個活的dn,然後得到系統最小副本數,這些都是fsnameSystem對象管理的。然後把傳進來的參數、請求機器地址和前邊那些東西作爲NamenodeFsck類的構造參數,構造完了執行它的fsck方法。看來這個NamenodeFsck就是專門用來fsck的啦。
在繼續往下看代碼之前,我們先執行一個fsck看一下,對照着返回結果看看NamenodeFsck幹了啥
ok,就叫這個圖1吧,先不管它,這些都是我們熟悉的結果,看看NamenodeFsck的fsck方法吧。
public void fsck() {
final long startTime = System.currentTimeMillis();
try {
String msg = "FSCK started by " + UserGroupInformation.getCurrentUser()
+ " from " + remoteAddress + " for path " + path + " at " + new Date();
LOG.info(msg);//用戶的fsck操作會被打到namenode的log裏。
out.println(msg);//上邊這一句在圖1中確實有,哈哈。
namenode.getNamesystem().logFsckEvent(path, remoteAddress);//記錄審計日誌
Result res = new Result(conf);
final HdfsFileStatus file = namenode.getFileInfo(path);//這裏就是找到文件對應的inode,關於怎麼根據文件路徑名找到對應的inode,前面有介紹。
//http://blog.csdn.net/tracymkgld/article/details/17553173 hdfsfileStatues,就是封裝了inode那些屬性一個東西。
if (file != null) {
check(path, file, res);
//把check的結果放到Result對象。
out.println(res);
out.println(" Number of data-nodes:\t\t" + totalDatanodes);//活dn數目
out.println(" Number of racks:\t\t" + networktopology.getNumOfRacks());//機架數
//fsck可以給出活dn數,機架數。
out.println("FSCK ended at " + new Date() + " in "
+ (System.currentTimeMillis() - startTime + " milliseconds"));
// DFSck client scans for the string HEALTHY/CORRUPT to check the status
// of file system and return appropriate code. Changing the output string
// might break testcases. Also note this must be the last line
// of the report.
if (res.isHealthy()) {
out.print("\n\nThe filesystem under path '" + path + "' " + HEALTHY_STATUS);
} else {
out.print("\n\nThe filesystem under path '" + path + "' " + CORRUPT_STATUS);
}
} else {
out.print("\n\nPath '" + path + "' " + NONEXISTENT_STATUS);
}
} catch (Exception e) {
String errMsg = "Fsck on path '" + path + "' " + FAILURE_STATUS;
LOG.warn(errMsg, e);
out.println("FSCK ended at " + new Date() + " in "
+ (System.currentTimeMillis() - startTime + " milliseconds"));
out.println(e.getMessage());
out.print("\n\n"+errMsg);
} finally {
out.close();
}
}
看一下hdfsFileStatues裏邊裝了啥
private static HdfsFileStatus createFileStatus(byte[] path, INode node) {
// length is zero for directories
return new HdfsFileStatus(
node.isDirectory() ? 0 : ((INodeFile)node).computeContentSummary().getLength(),
node.isDirectory(),
node.isDirectory() ? 0 : ((INodeFile)node).getReplication(),
node.isDirectory() ? 0 : ((INodeFile)node).getPreferredBlockSize(),
node.getModificationTime(),
node.getAccessTime(),
node.getFsPermission(),
node.getUserName(),
node.getGroupName(),
path);
}
好了,看一下check吧,我擦,挺雞巴長,原理比較簡單,主要就是要拿到文件對應多少個塊,每個塊location信息,是不是損壞了等,這些都是dn塊彙報給namenode的,namenode獲取很簡單,這裏就不多說了,至於塊彙報那些東西,回頭再梳理。
private void check(String parent, HdfsFileStatus file, Result res) throws IOException {
String path = file.getFullName(parent);//不用管,就是從新得到你輸入的要fsck的文件名,之前拿去找inode了,轉成byte了
boolean isOpen = false;
if (file.isDir()) {//如果你要check的是個目錄
byte[] lastReturnedName = HdfsFileStatus.EMPTY_NAME;
DirectoryListing thisListing;
if (showFiles) {
out.println(path + " <dir>");
}
res.totalDirs++;
do {
assert lastReturnedName != null;
thisListing = namenode.getListing(path, lastReturnedName);
if (thisListing == null) {
return;
}
HdfsFileStatus[] files = thisListing.getPartialListing();
for (int i = 0; i < files.length; i++) {
check(path, files[i], res);
}
lastReturnedName = thisListing.getLastName();
} while (thisListing.hasMore());
return;
}
long fileLen = file.getLen();
// Get block locations without updating the file access time
// and without block access tokens
LocatedBlocks blocks = namenode.getNamesystem().getBlockLocations(path, 0,
fileLen, false, false);
if (blocks == null) { // the file is deleted
return;
}
isOpen = blocks.isUnderConstruction();
if (isOpen && !showOpenFiles) {
// We collect these stats about open files to report with default options
res.totalOpenFilesSize += fileLen;
res.totalOpenFilesBlocks += blocks.locatedBlockCount();
res.totalOpenFiles++;
return;
}
res.totalFiles++;
res.totalSize += fileLen;
res.totalBlocks += blocks.locatedBlockCount();
if (showOpenFiles && isOpen) {
out.print(path + " " + fileLen + " bytes, " +
blocks.locatedBlockCount() + " block(s), OPENFORWRITE: ");
} else if (showFiles) {
out.print(path + " " + fileLen + " bytes, " +
blocks.locatedBlockCount() + " block(s): ");
} else {
out.print('.');
}
if (res.totalFiles % 100 == 0) { out.println(); out.flush(); }
int missing = 0;
int corrupt = 0;
long missize = 0;
int underReplicatedPerFile = 0;
int misReplicatedPerFile = 0;
StringBuffer report = new StringBuffer();
int i = 0;
for (LocatedBlock lBlk : blocks.getLocatedBlocks()) {
Block block = lBlk.getBlock();
boolean isCorrupt = lBlk.isCorrupt();
String blkName = block.toString();
DatanodeInfo[] locs = lBlk.getLocations();
res.totalReplicas += locs.length;
short targetFileReplication = file.getReplication();
if (locs.length > targetFileReplication) {
res.excessiveReplicas += (locs.length - targetFileReplication);
res.numOverReplicatedBlocks += 1;
}
// Check if block is Corrupt
if (isCorrupt) {
corrupt++;
res.corruptBlocks++;
out.print("\n" + path + ": CORRUPT block " + block.getBlockName()+"\n");
}
if (locs.length >= minReplication)
res.numMinReplicatedBlocks++;
if (locs.length < targetFileReplication && locs.length > 0) {
res.missingReplicas += (targetFileReplication - locs.length);
res.numUnderReplicatedBlocks += 1;
underReplicatedPerFile++;
if (!showFiles) {
out.print("\n" + path + ": ");
}
out.println(" Under replicated " + block +
". Target Replicas is " +
targetFileReplication + " but found " +
locs.length + " replica(s).");
}
// verify block placement policy
int missingRacks = ReplicationTargetChooser.verifyBlockPlacement(
lBlk, targetFileReplication, networktopology);
if (missingRacks > 0) {
res.numMisReplicatedBlocks++;
misReplicatedPerFile++;
if (!showFiles) {
if(underReplicatedPerFile == 0)
out.println();
out.print(path + ": ");
}
out.println(" Replica placement policy is violated for " +
block +
". Block should be additionally replicated on " +
missingRacks + " more rack(s).");
}
report.append(i + ". " + blkName + " len=" + block.getNumBytes());
if (locs.length == 0) {
report.append(" MISSING!");
res.addMissing(block.toString(), block.getNumBytes());
missing++;
missize += block.getNumBytes();
} else {
report.append(" repl=" + locs.length);
if (showLocations || showRacks) {
StringBuffer sb = new StringBuffer("[");
for (int j = 0; j < locs.length; j++) {
if (j > 0) { sb.append(", "); }
if (showRacks)
sb.append(NodeBase.getPath(locs[j]));
else
sb.append(locs[j]);
}
sb.append(']');
report.append(" " + sb.toString());
}
}
report.append('\n');
i++;
}
if ((missing > 0) || (corrupt > 0)) {
if (!showFiles && (missing > 0)) {
out.print("\n" + path + ": MISSING " + missing
+ " blocks of total size " + missize + " B.");
}
res.corruptFiles++;
switch(fixing) {
case FIXING_NONE:
break;
case FIXING_MOVE:
if (!isOpen)
lostFoundMove(parent, file, blocks);
break;
case FIXING_DELETE:
if (!isOpen)
namenode.delete(path, true);
}
}
if (showFiles) {
if (missing > 0) {
out.print(" MISSING " + missing + " blocks of total size " + missize + " B\n");
} else if (underReplicatedPerFile == 0 && misReplicatedPerFile == 0) {
out.print(" OK\n");
}
if (showBlocks) {
out.print(report.toString() + "\n");
}
}
}
小結:
fsck是namenode本身提供的對外接口,通過servlet方式訪問調用,至於訪問方式,隨便,只要提交這個接口請求就行了,hadoop的shell命令行是通過一個工具類使用java提交的,你也可指直接在瀏覽器拼接url,例如:
等價於hadoop fsck /tmp/hadoop/wordcountjavain -fles -blocks -locations -racks (-racks可以沒有,沒有也出現racks信息擦)
fsck的實質是通過name的大管家FsNamesystem對象(FSDirectory)管理的那套命名空間,及其塊彙報上來的信息,從namenode的內存中讀取inode的屬性及其block信息等,副本數,多少個塊,有木有順壞,這些結果都是現成的,並不需要namenode再去dn找找到對應的塊,然後讓datanode去檢查,所以這種“現成”的結果,即存在namenode內存的信息,就是你fsck得到的結果有時候是過時的,你更改文件一段時間後,才能從fsck到準確的記過,比如我把dn上得文件blk給換一個壞的,這時候namenode沒有拿到塊彙報信息,你不會從fsck結果立即感知到它壞了。
但是刪除等會從fsck得到信息,因爲刪除的原理前面也介紹過了無論是trash還是skiptrash,都只是把要刪除的文件進行標記(寄一本臺賬,有清理線程週期發佈rpc調用對應的dn去刪除blk),它直接影響namenode的內存和命名空間(還包括塊彙報信息,即dn弄過來的塊信息也會因爲刪除操作而被修改)。