Nutch流程之Fetch

1. 概述

Fetch主要是從待抓取列表中取出url，進行抓取解析，期間產生crawl_parse，carwl_fetch，parse_data，parse_text文件夾。本次將講解Fetch的大致流程，重點將是各個文件夾的產生過程以及包含的內容。對於Fetch的生產者、消費者模型，這些將不會講解。

2. 正文

在Fetcher類的fetch（）方法中，設置了執行fetch操作的job。其中，

job.setOutputFormat(GeneralChannelFetcherOutputFormat.class);方法是重要的。後面的各個文件夾的產生都由它控制。（GeneralChannelFetcherOutputFormat.class是在Nutch源碼的基礎上修改過的代碼。）

實現抓取過程的是FetchThread類中的run()方法。

ProtocolOutput output = protocol.getProtocolOutput(fit.url,fit.datum);

ProtocolStatus status = output.getStatus();

Content content = output.getContent();

這幾行代碼實現url源碼的抓取，將生成的內容放到Content對象中。

接下來，根據status的狀態信息，進行相應的操作。

switch(status.getCode())

case ProtocolStatus.SUCCESS: // got a page

pstatus =output(fit.url, fit.datum, content, status,

CrawlDatum.STATUS_FETCH_SUCCESS);

當狀態時success時，會先執行output方法。Output方法也是一個重要的方法，下面來看看output方法。

private ParseStatus output(Text key, CrawlDatum datum,

Content content, ProtocolStatus pstatus, int status) {

datum.setStatus(status);

datum.setFetchTime(System.currentTimeMillis());

if (pstatus != null)datum.getMetaData().put(Nutch.WRITABLE_PROTO_STATUS_KEY, pstatus);

//上面的代碼實現爲value的抓取狀態設值。如抓取時間。

ParseResult parseResult= null;

if (content != null) {

Metadata metadata =content.getMetadata();

// add segment tometadata

metadata.set(Nutch.SEGMENT_NAME_KEY, segmentName);

// add score tocontent metadata so that ParseSegment can pick it up.

try {

scfilters.passScoreBeforeParsing(key, datum, content);

} catch (Exception e){

if(LOG.isWarnEnabled()) {

e.printStackTrace(LogUtil.getWarnStream(LOG));

LOG.warn("Couldn't pass score, url " + key + " (" +e + ")");

}

/* Note: Fetcher willonly follow meta-redirects coming from the

* original URL. */

if (parsing &&status == CrawlDatum.STATUS_FETCH_SUCCESS) {

try {

parseResult =this.parseUtil.parse(content);//對抓取到的源碼進行解析

} catch (Exceptione) {

LOG.warn("Error parsing: " + key + ": " +StringUtils.stringifyException(e));

}

if (parseResult ==null) {

byte[] signature =

SignatureFactory.getSignature(getConf()).calculate(content,

newParseStatus().getEmptyParse(conf));

datum.setSignature(signature);

}

/* Store status codein content So we can read this value during

* parsing (as aseparate job) and decide to parse or not.

content.getMetadata().add(Nutch.FETCH_STATUS_KEY,Integer.toString(status));

}

//涉及到setOutputFormat中設定的類了。

try {

output.collect(key,new NutchWritable(datum));

if (content != null&& storingContent)

output.collect(key,new NutchWritable(content));

if (parseResult !=null) {

for (Entry<Text,Parse> entry : parseResult) {

Text url =entry.getKey();

Parse parse =entry.getValue();

ParseStatusparseStatus = parse.getData().getStatus();

if(!parseStatus.isSuccess()) {

LOG.warn("Error parsing: " + key + ": " +parseStatus);

parse =parseStatus.getEmptyParse(getConf());

}

// Calculate pagesignature. For non-parsing fetchers this will

// be done inParseSegment

byte[] signature =

SignatureFactory.getSignature(getConf()).calculate(content, parse);

// Ensure segmentname and score are in parseData metadata

parse.getData().getContentMeta().set(Nutch.SEGMENT_NAME_KEY,

segmentName);

parse.getData().getContentMeta().set(Nutch.SIGNATURE_KEY,

StringUtil.toHexString(signature));

// Pass fetch timeto content meta

parse.getData().getContentMeta().set(Nutch.FETCH_TIME_KEY,

Long.toString(datum.getFetchTime()));

if (url.equals(key))

datum.setSignature(signature);

try {

scfilters.passScoreAfterParsing(url, content, parse);

} catch (Exceptione) {

if(LOG.isWarnEnabled()) {

e.printStackTrace(LogUtil.getWarnStream(LOG));

LOG.warn("Couldn't pass score, url " + key + " (" +e + ")");

}

output.collect(url, new NutchWritable(

newParseImpl(new ParseText(parse.getText()),

parse.getData(), parse.isCanonical())));

}

} catch (IOException e){

if(LOG.isFatalEnabled()) {

e.printStackTrace(LogUtil.getFatalStream(LOG));

LOG.fatal("fetchercaught:"+e.toString());

}

在output.collect()方法中，就涉及到相關文件的生成了。下面就來看看

GeneralChannelFetcherOutputFormat.class做了點什麼。

public RecordWriter<Text, NutchWritable> getRecordWriter(finalFileSystem fs,

final JobConf job,

final String name,

final Progressable progress) throws IOException {

Path out =FileOutputFormat.getOutputPath(job);

final Path fetch =

newPath(new Path(out, CrawlDatum.FETCH_DIR_NAME),

name);/*crawl-fetch��key-datum��

map��ͣ��ŵ��url��״̬��Ϣ*/

final Path content =

new Path(new Path(out,Content.DIR_NAME), name);

final CompressionTypecompType =

SequenceFileOutputFormat.getOutputCompressionType(job);

final MapFile.WriterfetchOut =

new MapFile.Writer(job,fs, fetch.toString(), Text.class, CrawlDatum.class,

compType, progress);

return newRecordWriter<Text, NutchWritable>() {

private MapFile.WritercontentOut;

privateRecordWriter<Text, Parse> parseOut;

{

if(GeneralChannelFetcher.isStoringContent(job)) {

contentOut = newMapFile.Writer(job, fs, content.toString(),

Text.class, Content.class,

compType, progress);

}

if(GeneralChannelFetcher.isParsing(job)) {

parseOut = newGeneralChannelParseOutputFormat().getRecordWriter(fs, job, name, progress);

}

public void write(Textkey, NutchWritable value)

throws IOException {

Writable w =value.get();

if (w instanceofCrawlDatum)

fetchOut.append(key, w);

else if (winstanceof Content)

contentOut.append(key, w);

else if (winstanceof Parse)

parseOut.write(key, (Parse)w);

}

public voidclose(Reporter reporter) throws IOException {

fetchOut.close();

if (contentOut !=null) {

contentOut.close();

}

if (parseOut !=null) {

parseOut.close(reporter);

}

};

}

從中可以看出，根據不同的crawlDatun的內容，輸出到不同的目錄中。

if (w instanceof CrawlDatum)

fetchOut.append(key, w);

else if (winstanceof Content)

contentOut.append(key, w);

else if (winstanceof Parse)

parseOut.write(key, (Parse)w);

從這段代碼可以看出，crawl_fetch中的內容是value，及其抓取狀態信息。Content中的內容是網頁的源碼。而segments中的其他文件內容的產生，則由另外一個類來實現——GeneralChannelParseOutputFormat。

下面就來了解下這個類。

public RecordWriter<Text, Parse> getRecordWriter(FileSystemfs, JobConf job,

String name, Progressable progress) throws IOException {

this.filters = newURLFilters(job);

this.normalizers = newURLNormalizers(job, URLNormalizers.SCOPE_OUTLINK);

this.scfilters = newScoringFilters(job);

final int interval =job.getInt("db.fetch.interval.default", 2592000);

final booleanignoreExternalLinks = job.getBoolean("db.ignore.external.links",false);

int maxOutlinksPerPage =job.getInt("db.max.outlinks.per.page", 100);

final int maxOutlinks =(maxOutlinksPerPage < 0) ? Integer.MAX_VALUE

: maxOutlinksPerPage;

final CompressionTypecompType = SequenceFileOutputFormat.getOutputCompressionType(job);

Path out =FileOutputFormat.getOutputPath(job);

Path text = new Path(newPath(out, ParseText.DIR_NAME), name);

Path data = new Path(newPath(out, ParseData.DIR_NAME), name);

Path crawl = new Path(newPath(out, CrawlDatum.PARSE_DIR_NAME), name);

final String[]parseMDtoCrawlDB =job.get("db.parsemeta.to.crawldb","").split(" *,*");

final MapFile.WritertextOut =

new MapFile.Writer(job,fs, text.toString(), Text.class, ParseText.class,

CompressionType.RECORD,progress);

final MapFile.WriterdataOut =

new MapFile.Writer(job,fs, data.toString(), Text.class, ParseData.class,

compType, progress);

final SequenceFile.WritercrawlOut =

SequenceFile.createWriter(fs, job, crawl, Text.class, CrawlDatum.class,

compType, progress);

return newRecordWriter<Text, Parse>() {

public void write(Textkey, Parse parse)

throws IOException {

String[]secondleveldoamin=new String[]{"org","com","edu","net","ac","gov"};//�д��

String fromUrl =key.toString();

String fromHost =null;

String toHost =null;

Stringfromdomain=null;

Stringtodomain=null;

textOut.append(key,new ParseText(parse.getText()));

ParseData parseData= parse.getData();

// recover thesignature prepared by Fetcher or ParseSegment

String sig =parseData.getContentMeta().get(Nutch.SIGNATURE_KEY);

if (sig != null) {

byte[] signature =StringUtil.fromHexString(sig);

if (signature !=null) {

// append aCrawlDatum with a signature

CrawlDatum d =new CrawlDatum(CrawlDatum.STATUS_SIGNATURE, 0);

d.setSignature(signature);

crawlOut.append(key, d);

}

// see if the parsemetadata contain things that we'd like

// to pass to themetadata of the crawlDB entry

CrawlDatum parseMDCrawlDatum= null;

for (String mdname :parseMDtoCrawlDB) {

String mdvalue =parse.getData().getParseMeta().get(mdname);

if (mdvalue != null){

if(parseMDCrawlDatum == null) parseMDCrawlDatum = new CrawlDatum(

CrawlDatum.STATUS_PARSE_META, 0);

parseMDCrawlDatum.getMetaData().put(new Text(mdname),

newText(mdvalue));

}

if (parseMDCrawlDatum!= null) crawlOut.append(key, parseMDCrawlDatum);

try {

ParseStatuspstatus = parseData.getStatus();

if (pstatus !=null && pstatus.isSuccess() &&

pstatus.getMinorCode() == ParseStatus.SUCCESS_REDIRECT) {

String newUrl =pstatus.getMessage();

int refreshTime =Integer.valueOf(pstatus.getArgs()[1]);

try {

newUrl =normalizers.normalize(newUrl,

URLNormalizers.SCOPE_FETCHER);

} catch(MalformedURLException mfue) {

newUrl = null;

}

if (newUrl !=null) newUrl = filters.filter(newUrl);

String url =key.toString();

if (newUrl !=null && !newUrl.equals(url)) {

String reprUrl=

URLUtil.chooseRepr(url, newUrl,

refreshTime < Fetcher.PERM_REFRESH_TIME);

CrawlDatumnewDatum = new CrawlDatum();

newDatum.setStatus(CrawlDatum.STATUS_LINKED);

if (reprUrl !=null && !reprUrl.equals(newUrl)) {

newDatum.getMetaData().put(Nutch.WRITABLE_REPR_URL_KEY,

new Text(reprUrl));

}

crawlOut.append(new Text(newUrl), newDatum);

}

} catch(URLFilterException e) {

// ignore

}

// collect outlinksfor subsequent db update

Outlink[] links =parseData.getOutlinks();

int outlinksToStore= Math.min(maxOutlinks, links.length);

if(ignoreExternalLinks) {

try {/*此處做了修改，對於外連接進行過濾，過濾的規則是，將由相同domain的取出，nutch自帶的是將

具有相同host的取出*/

fromHost = newURL(fromUrl).getHost().toLowerCase();

String[]fromHosts=fromHost.split("\\.");

inti=fromHosts.length-1;

if(fromHosts[i].equals("cn")){

for(intj=0;j<secondleveldoamin.length;j++){

if(fromHosts[i-1].equals(secondleveldoamin[j]))

{

fromdomain=fromHosts[i-2];

break;

}

else

continue;

}

if(fromdomain==null)

fromdomain=fromHosts[i-1];

}

if(fromHosts[i].equals("org")||fromHosts[i].equals("com")

||fromHosts[i].equals("net"))

fromdomain=fromHosts[i-1];

} catch(MalformedURLException e) {

fromHost = null;

}

} else {

fromHost = null;

}

int validCount = 0;

CrawlDatum adjust =null;

List<Entry<Text, CrawlDatum>> targets = newArrayList<Entry<Text, CrawlDatum>>(outlinksToStore);

List<Outlink>outlinkList = new ArrayList<Outlink>(outlinksToStore);

for (int i = 0; i< links.length && validCount < outlinksToStore; i++) {

String toUrl =links[i].getToUrl();

// ignore links toself (or anchors within the page)

if(fromUrl.equals(toUrl)) {

continue;

}

if(ignoreExternalLinks) {

try {

toHost = newURL(toUrl).getHost().toLowerCase();

String[]toHosts=toHost.split("\\.");

intk=toHosts.length-1;

if(toHosts[k].equals("cn")){

for(intj=0;j<secondleveldoamin.length;j++){

if(toHosts[k-1].equals(secondleveldoamin[j]))

{

todomain=toHosts[k-2];

break;

}

else

continue;

}

if(todomain==null)

todomain=toHosts[k-1];

}

if(toHosts[k].equals("org")||toHosts[k].equals("com")

||toHosts[k].equals("net"))

todomain=toHosts[k-1];

} catch (MalformedURLExceptione) {

toHost = null;

}

if (todomain ==null || !todomain.equals(fromdomain)) { // external links

continue; //skip it

}

// if(toHost==null||!toHost.equals(fromHost)){

// continue;

// }

}

try {

toUrl =normalizers.normalize(toUrl,

URLNormalizers.SCOPE_OUTLINK); // normalize the url

toUrl =filters.filter(toUrl); // filter theurl

if (toUrl ==null) {

continue;

}

} catch (Exceptione) {

continue;

}

CrawlDatum target= new CrawlDatum(CrawlDatum.STATUS_LINKED, interval);

Text targetUrl =new Text(toUrl);

try {

scfilters.initialScore(targetUrl, target);

} catch(ScoringFilterException e) {

LOG.warn("Cannot filter init score for url " + key +

",using default: " + e.getMessage());

target.setScore(0.0f);

}

targets.add(newSimpleEntry(targetUrl, target));

outlinkList.add(links[i]);

validCount++;

}

try {

// compute scorecontributions and adjustment to the original score

adjust =scfilters.distributeScoreToOutlinks((Text)key, parseData,

targets,null, links.length);

} catch (ScoringFilterExceptione) {

LOG.warn("Cannot distribute score from " + key + ":" + e.getMessage());

}

for (Entry<Text,CrawlDatum> target : targets) {

crawlOut.append(target.getKey(), target.getValue());

}

if (adjust != null)crawlOut.append(key, adjust);

Outlink[]filteredLinks = outlinkList.toArray(new Outlink[outlinkList.size()]);

parseData = newParseData(parseData.getStatus(), parseData.getTitle(),

filteredLinks,parseData.getContentMeta(),

parseData.getParseMeta());

dataOut.append(key,parseData);

if(!parse.isCanonical()) {

CrawlDatum datum =new CrawlDatum();

datum.setStatus(CrawlDatum.STATUS_FETCH_SUCCESS);

String timeString= parse.getData().getContentMeta().get(Nutch.FETCH_TIME_KEY);

try {

datum.setFetchTime(Long.parseLong(timeString));

} catch (Exceptione) {

LOG.warn("Can't read fetch time for: " + key);

datum.setFetchTime(System.currentTimeMillis());

}

crawlOut.append(key, datum);

}

public voidclose(Reporter reporter) throws IOException {

textOut.close();

dataOut.close();

crawlOut.close();

}

};

}

這個類大致做了以下幾件事情。產生crawl_parse、parse_text、parse_data三個文件夾。Prase_text就是網頁中解析出來的文本內容。Crawl_parse中最主要的是包含了從ParseData中提取出來的Outlink格式化了的外連接信息,外連接由CrawlDatum.STATUS_LINKED做標記。

此外，crawl_parse中還包含了其他一些內容。但是如果要提取外連接的話，根據Liked即可獲取。

在這段代碼中還有個參數可以設置——ignoreExternalLinks。這個BOOLEAN參數用來設置是否需要外連接。外連接是用來更新crawldb中的內容的，當然你可以設置db.update.additions.allowed，來要求外連接是否更新到crawldb中。

當ignoreExternalLinks設置爲true時，你可以更改外連接選取規則，來選擇你想要的外連接。Nutch自帶的是host相同的外連接，上面的代碼是domain一樣的外連接。

挑選出外連接之後，以裝有外連接的數組爲構造參數，重新構造一個ParseData，產生parse_data文件夾。

Nutch流程之Fetch

基於lucene的nutch索引詳解

(重要)項目整合nutch索引與查詢過程記錄

myeclipse svn authorization failed

對於數據庫的疑問

IK中文分詞擴展自定義詞典！！！

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結