Nutch1.3和Hadoop0.20.203.0的整合

原創

deqingguo

2020-06-21 18:46

一、Hadoop的安裝。

http://blog.csdn.net/deqingguo/article/details/6907372

二、Nutch1.3的下載安裝

svn co http://svn.apache.org/repos/asf/nutch/branches/branch-1.3/ ~/nutch

也可以直接在http://labs.renren.com/apache-mirror//nutch/ 上下載，我下的是1.3版本。

三、修改conf/下的nutch-site.xml

<configuration>
<property>
  <name>http.agent.name</name>
  <value>HD nutch agent</value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty -
  please set this to a single word uniquely related to your organization.
  </description>
</property>
 
<property>
  <name>http.robots.agents</name>
  <value>HD nutch agent</value>
  <description>The agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence. You should
  put the value of http.agent.name as the first agent name, and keep the
  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
  </description>
</property> 
</configuration>

四、將hadoop中conf下的所有文件考到nutch的conf下。

五、用ant重新編譯Nutch，如果ant沒安裝apt-get install ant可以直接安裝~

注：如果沒有重新編譯，對於nutch-site.xml的修改是無效的，會出現Nutch Fetcher: No agents listed in ‘http.ag ent.name’ property的錯誤

六、進入到runtime/deploy/bin下：

./nutch crawl hdfs://localhost:9000/user/fzuir/urls.txt -dir hdfs://localhost:9000/user/fzuir/crawled -depth 3 -topN 10

這個時候，還會報一個錯誤：

NullPointerException at org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs.

這個是因爲Nutch1.3的一個bug，在Nutch的官網上有提到在1.4的版本上有修改，但是1.4還麼有發佈，所有就根據官網的提示自己改下兩個java文件，然後重新編譯下：

修改的第一個文件是：src/java/org/apache/nutch/parse/ParseOutputFormat.java

 public void checkOutputSpecs(FileSystem fs, JobConf job) throws IOException {
-    Path out = FileOutputFormat.getOutputPath(job);
-    if (fs.exists(new Path(out, CrawlDatum.PARSE_DIR_NAME)))
-      throw new IOException("Segment already parsed!");
+      Path out = FileOutputFormat.getOutputPath(job);
+      if ((out == null) && (job.getNumReduceTasks() != 0)) {
+          throw new InvalidJobConfException(
+                  "Output directory not set in JobConf.");
+      }
+      if (fs == null) {
+          fs = out.getFileSystem(job);
+      }
+      if (fs.exists(new Path(out, CrawlDatum.PARSE_DIR_NAME)))
+          throw new IOException("Segment already parsed!");
   }

修改的第二個文件是：src/java/org/apache/nutch/fetcher/FetcherOutputFormat.java

import org.apache.hadoop.io.SequenceFile.CompressionType;
 
 import org.apache.hadoop.mapred.FileOutputFormat;
+import org.apache.hadoop.mapred.InvalidJobConfException;
 import org.apache.hadoop.mapred.OutputFormat;
 import org.apache.hadoop.mapred.RecordWriter;
 import org.apache.hadoop.mapred.JobConf;
@@ -46,8 +47,15 @@
 
   public void checkOutputSpecs(FileSystem fs, JobConf job) throws IOException {
     Path out = FileOutputFormat.getOutputPath(job);
+    if ((out == null) && (job.getNumReduceTasks() != 0)) {
+    	throw new InvalidJobConfException(
+    			"Output directory not set in JobConf.");
+    }
+    if (fs == null) {
+    	fs = out.getFileSystem(job);
+    }
     if (fs.exists(new Path(out, CrawlDatum.FETCH_DIR_NAME)))
-      throw new IOException("Segment already fetched!");
+    	throw new IOException("Segment already fetched!");
   }

七、問題解決~~

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Nutch1.3和Hadoop0.20.203.0的整合

firtex2-0.4.1到編譯安裝

Hadoop中國2011雲計算大會

Linux下SVN學習筆記

Hadoop0.20.203.0在關機重啓後，namenode啓動報錯（/dfs/name is in an inconsistent state）

Ubuntu10.04 上安裝 samba

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結