基於lucene的nutch索引詳解

1. 索引流程詳解

1.1. crawl中涉及nutch的部分

1.1.1. nutch索引產生所需的文件路徑以及產生的索引路徑

Path linkDb = new Path(dir + "/linkdb");

Path segments = new Path(dir + "/segments");

Path indexes = new Path(dir + "/indexes");

這些都是產生索引文件必需的文件路徑，在crawl中的main()方法中被配置。此外，還得配置索引產生的路徑，如下：

Path index = new Path(dir + "/index");

1.1.2. nutch中選擇使用何種索引的方式

在nutch的crawl代碼剛開始不久，就有以下兩行代碼：

String indexerName = "lucene";

String solrUrl = null;

boolean isSolrIndex = StringUtils.equalsIgnoreCase(indexerName, "solr");

接下來，在以後的代碼中，會根據以上變量的值選擇使用何種索引方式，是採用lucene還是solr。也就是說，你可以通過改變這兩個變量的值，來選擇何種索引方式。

1.1.3. 索引開始

從這段代碼開始，索引就開始了。

FileStatus[] fstats = fs.listStatus(segments, HadoopFSUtil.getPassDirectoriesFilter(fs));

//這是FileSystem帶有的一個方法，你可以查看Hadoop API來了解詳細，該方法的目的旨在提取segments下的各個以時間命名的文件夾的路徑。

if (isSolrIndex) { //結合前面講解的內容，該判斷將選擇採用何種索引方式。此處將的是lucene索引，所以不再詳述solr。

SolrIndexer indexer = new SolrIndexer(conf);

indexer.indexSolr(solrUrl, crawlDb, linkDb,

Arrays.asList(HadoopFSUtil.getPaths(fstats)));

}

else {

//從這裏開始，就進行lucene索引了。

DeleteDuplicates dedup = new DeleteDuplicates(conf);

if(indexes != null) {

// Delete old indexes

if (fs.exists(indexes)) {

LOG.info("Deleting old indexes: " + indexes);

fs.delete(indexes, true);

}

// Delete old index

if (fs.exists(index)) {

LOG.info("Deleting old merged index: " + index);

fs.delete(index, true);

}

Indexer indexer = new Indexer(conf);

indexer.index(indexes, crawlDb, linkDb,

Arrays.asList(HadoopFSUtil.getPaths(fstats)));

/*從該方法中可以看出，創建索引所需的文件需要哪些了，分別是crawlDb、lindDb、segments中的內容，當然這是隻是初步的認識，在後面講解每個文件中的哪些子文件被用到了，又有什麼作用，重點是segments中的內容。*/

//Arrays.asList方法是用來獲得諸如E:\out\segments\20120221153925的path數組的。在Indexer

//中的index方法中可以得到體現：

// public void index(Path luceneDir, Path crawlDb,

// Path linkDb, List<Path> segments)

IndexMerger merger = new IndexMerger(conf);

/*上一步產生的indexes文件夾，下面是對indexes的合併，最終產生index索引文件,將重點講解上面的索引過程，下面的索引合併過程有必要的話自己去了解吧。*/

if(indexes != null) {

dedup.dedup(new Path[] { indexes });

fstats = fs.listStatus(indexes, HadoopFSUtil.getPassDirectoriesFilter(fs));

merger.merge(HadoopFSUtil.getPaths(fstats), index, tmpDir);

}

} else {

LOG.warn("No URLs to fetch - check your seed list and URL filters.");

}

if (LOG.isInfoEnabled()) { LOG.info("crawl finished: " + dir); }

}

1.1.4. 分析Indexer的index方法

public void index(Path luceneDir, Path crawlDb,

Path linkDb, List<Path> segments)

/*這個方法運行了一個job，該job用來創建indexes。要看懂分佈式程序首先得去了解mapreduce*/

throws IOException {

LOG.info("Indexer: starting");

final JobConf job = new NutchJob(getConf());

job.setJobName("index-lucene " + luceneDir);

//該方法中最重要的方法，將在1.1.5中詳解

IndexerMapReduce.initMRJob(crawlDb, linkDb, segments, job);

FileOutputFormat.setOutputPath(job, luceneDir);

//下面就涉及到索引管理了。

LuceneWriter.addFieldOptions("segment", LuceneWriter.STORE.YES, LuceneWriter.INDEX.NO, job);

LuceneWriter.addFieldOptions("digest", LuceneWriter.STORE.YES, LuceneWriter.INDEX.NO, job);

LuceneWriter.addFieldOptions("boost", LuceneWriter.STORE.YES, LuceneWriter.INDEX.NO, job);

NutchIndexWriterFactory.addClassToConf(job, LuceneWriter.class);

JobClient.runJob(job);

LOG.info("Indexer: done");

}

1.1.5. initMRJob方法

public static void initMRJob(Path crawlDb, Path linkDb,

Collection<Path> segments,

JobConf job) {

LOG.info("IndexerMapReduce: crawldb: " + crawlDb);

LOG.info("IndexerMapReduce: linkdb: " + linkDb);

for (final Path segment : segments) {

/*將sgements下的路徑作爲job的文件添加路徑，添加的內容包括（全在segments/xxxxxxx下）：

parse_text

parse_data

crawl_fetch

crawl_parse

這些文件夾下的包含的內容如下：

parse_text:包了網頁中的文本內容，列如博客中的正文內容

parse_data：包含了網頁的一些狀態信息，如：標題、外連接等，以crawlDatum的格式存放。

crawl_fetch：包含了每個url的抓取狀態信息。Crawldb中狀態信息的更新需要它做參數。

crawl_parse:這個不清楚。

LOG.info("IndexerMapReduces: adding segment: " + segment);

FileInputFormat.addInputPath(job, new Path(segment, CrawlDatum.FETCH_DIR_NAME));

FileInputFormat.addInputPath(job, new Path(segment, CrawlDatum.PARSE_DIR_NAME));

FileInputFormat.addInputPath(job, new Path(segment, ParseData.DIR_NAME));

FileInputFormat.addInputPath(job, new Path(segment, ParseText.DIR_NAME));

}

FileInputFormat.addInputPath(job, new Path(crawlDb, CrawlDb.CURRENT_NAME));

FileInputFormat.addInputPath(job, new Path(linkDb, LinkDb.CURRENT_NAME));

job.setInputFormat(SequenceFileInputFormat.class);

//最重要的是MapReduce.class,將在1.1.6中講解

job.setMapperClass(IndexerMapReduce.class);

job.setReducerClass(IndexerMapReduce.class);

job.setOutputFormat(IndexerOutputFormat.class);

job.setOutputKeyClass(Text.class);

job.setMapOutputValueClass(NutchWritable.class);

job.setOutputValueClass(NutchWritable.class);

}

1.1.6. 核心分佈式類講解

public class IndexerMapReduce extends Configured

implements Mapper<Text, Writable, Text, NutchWritable>,

Reducer<Text, NutchWritable, Text, NutchDocument> {

public static final Log LOG = LogFactory.getLog(IndexerMapReduce.class);

private IndexingFilters filters;

private ScoringFilters scfilters;

public void configure(JobConf job) {

setConf(job);

this.filters = new IndexingFilters(getConf());

/*初始化nutch索引插件，這是nutch索引管理的核心，所有Nutch索引都是通過插件機制完成的*/

this.scfilters = new ScoringFilters(getConf());

}

public void map(Text key, Writable value,

OutputCollector<Text, NutchWritable> output, Reporter reporter) throws IOException {

output.collect(key, new NutchWritable(value));

}

/*最重要的是reduce方法*/

public void reduce(Text key, Iterator<NutchWritable> values,

OutputCollector<Text, NutchDocument> output, Reporter reporter)

throws IOException {

Inlinks inlinks = null;

CrawlDatum dbDatum = null;

CrawlDatum fetchDatum = null;

ParseData parseData = null;

ParseText parseText = null;

/*注意values是Iterator類型，同時注意前面FileInputPath添加的路徑是有多個的，所以value的值的類型可以是多種類型的。這個循環通過判斷每個value的不同類型來將他們分類，在CrawlDatum類中，有一堆final類型的靜態常量，每個階段的value，其狀態都在這些靜態常量中細分着，每種狀態會有一個常量來區分，而每種狀態又有多個靜態常量來表述。所以對於CrawlDatum，對其又可以進行細分。*/

while (values.hasNext()) {

final Writable value = values.next().get(); // unwrap

if (value instanceof Inlinks) {

inlinks = (Inlinks)value;

} else if (value instanceof CrawlDatum) {

final CrawlDatum datum = (CrawlDatum)value;

if (CrawlDatum.hasDbStatus(datum))

dbDatum = datum;

else if (CrawlDatum.hasFetchStatus(datum)) {

// don't index unmodified (empty) pages

if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED)

fetchDatum = datum;

} else if (CrawlDatum.STATUS_LINKED == datum.getStatus() ||

CrawlDatum.STATUS_SIGNATURE == datum.getStatus() ||

CrawlDatum.STATUS_PARSE_META == datum.getStatus()) {

continue;

} else {

throw new RuntimeException("Unexpected status: "+datum.getStatus());

}

} else if (value instanceof ParseData) {

parseData = (ParseData)value;

} else if (value instanceof ParseText) {

parseText = (ParseText)value;

} else if (LOG.isWarnEnabled()) {

LOG.warn("Unrecognized type: "+value.getClass());

}

if (fetchDatum == null || dbDatum == null

|| parseText == null || parseData == null) {

return; // only have inlinks

}

if (!parseData.getStatus().isSuccess() ||

fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS) {

return;

}

//創建索引

NutchDocument doc = new NutchDocument();

final Metadata metadata = parseData.getContentMeta();

//parseData對應一個類來管理，就是parseData,可以通過其中的一些方法獲得相應的內容。

// add segment, used to map from merged index back to segment files

doc.add("segment", metadata.get(Nutch.SEGMENT_NAME_KEY));

// add digest, used by dedup

doc.add("digest", metadata.get(Nutch.SIGNATURE_KEY));

final Parse parse = new ParseImpl(parseText, parseData);

try {

// extract information from dbDatum and pass it to

// fetchDatum so that indexing filters can use it

final Text url = (Text) dbDatum.getMetaData().get(Nutch.WRITABLE_REPR_URL_KEY);

if (url != null) {

fetchDatum.getMetaData().put(Nutch.WRITABLE_REPR_URL_KEY, url);

}

// run indexing filters

doc = this.filters.filter(doc, parse, key, fetchDatum, inlinks);

//此處調用插件進行索引建立。關於插件，將在第2節中講解。

} catch (final IndexingException e) {

if (LOG.isWarnEnabled()) { LOG.warn("Error indexing "+key+": "+e); }

return;

}

// skip documents discarded by indexing filters

if (doc == null) return;

float boost = 1.0f;

// run scoring filters

try {

boost = this.scfilters.indexerScore(key, doc, dbDatum,

fetchDatum, parse, inlinks, boost);

} catch (final ScoringFilterException e) {

if (LOG.isWarnEnabled()) {

LOG.warn("Error calculating score " + key + ": " + e);

}

return;

}

// apply boost to all indexed fields.

doc.setScore(boost);

// store boost for use by explain and dedup

doc.add("boost", Float.toString(boost));

output.collect(key, doc);

}

2. 索引管理詳解

2.1. 插件

Nutch的索引都是以插件的形式實現的。nutch-site.xml中添加插件擴展，實現插件使用。如：<name>plugin.includes</name>

以分詞爲例：

首先，在插件的包中，有個plugin.xml，裏面有

<plugin

id="analysis-zh"

name="Chinese Analysis Plug-in"

version="1.0.0"

provider-name="net.paoding.analysis">。nutch-site.xml中的插件擴展要和id相匹配纔行。

2.2. 自定義插件

在索引插件中，關鍵是得到你想添加到索引field中的內容。對於索引中的一些細節，參考《lucene+nutch搜索引擎開發》這本書。

基於lucene的nutch索引詳解

基於lucene的nutch索引詳解

(重要)項目整合nutch索引與查詢過程記錄

myeclipse svn authorization failed

對於數據庫的疑問

IK中文分詞擴展自定義詞典！！！

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結