一、Hadoop的安裝。
http://blog.csdn.net/deqingguo/article/details/6907372
二、Nutch1.3的下載安裝
svn co http://svn.apache.org/repos/asf/nutch/branches/branch-1.3/ ~/nutch
也可以直接在http://labs.renren.com/apache-mirror//nutch/ 上下載,我下的是1.3版本。
三、修改conf/下的nutch-site.xml
<configuration>
<property>
<name>http.agent.name</name>
<value>HD nutch agent</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
</description>
</property>
<property>
<name>http.robots.agents</name>
<value>HD nutch agent</value>
<description>The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence. You should
put the value of http.agent.name as the first agent name, and keep the
default * at the end of the list. E.g.: BlurflDev,Blurfl,*
</description>
</property>
</configuration>
四、將hadoop中conf下的所有文件考到nutch的conf下。
這個是因爲Nutch1.3的一個bug,在Nutch的官網上有提到在1.4的版本上有修改,但是1.4還麼有發佈,所有就根 據官網的提示自己改下兩個java文件,然後重新編譯下:
修改的第一個文件是:src/java/org/apache/nutch/parse/ParseOutputFormat.java
public void checkOutputSpecs(FileSystem fs, JobConf job) throws IOException {
- Path out = FileOutputFormat.getOutputPath(job);
- if (fs.exists(new Path(out, CrawlDatum.PARSE_DIR_NAME)))
- throw new IOException("Segment already parsed!");
+ Path out = FileOutputFormat.getOutputPath(job);
+ if ((out == null) && (job.getNumReduceTasks() != 0)) {
+ throw new InvalidJobConfException(
+ "Output directory not set in JobConf.");
+ }
+ if (fs == null) {
+ fs = out.getFileSystem(job);
+ }
+ if (fs.exists(new Path(out, CrawlDatum.PARSE_DIR_NAME)))
+ throw new IOException("Segment already parsed!");
}
修改的第二個文件是:src/java/org/apache/nutch/fetcher/FetcherOutputFormat.java
import org.apache.hadoop.io.SequenceFile.CompressionType;
import org.apache.hadoop.mapred.FileOutputFormat;
+import org.apache.hadoop.mapred.InvalidJobConfException;
import org.apache.hadoop.mapred.OutputFormat;
import org.apache.hadoop.mapred.RecordWriter;
import org.apache.hadoop.mapred.JobConf;
@@ -46,8 +47,15 @@
public void checkOutputSpecs(FileSystem fs, JobConf job) throws IOException {
Path out = FileOutputFormat.getOutputPath(job);
+ if ((out == null) && (job.getNumReduceTasks() != 0)) {
+ throw new InvalidJobConfException(
+ "Output directory not set in JobConf.");
+ }
+ if (fs == null) {
+ fs = out.getFileSystem(job);
+ }
if (fs.exists(new Path(out, CrawlDatum.FETCH_DIR_NAME)))
- throw new IOException("Segment already fetched!");
+ throw new IOException("Segment already fetched!");
}
七、問題解決~~