windows下配置nutch1.0

1.需要安裝的軟件
(1)jdk1.6
(2)Cygwin
(3)nutch1.0
(4)tomcat 6.0
2.安裝過程。
1.jdk1.6的安裝就像用文本編寫java代碼那樣,需要配置環境變量
    PATH ,JAVA_HOME, CLASSPATH都要配置。
    我的配置如下:
    JAVA_HOME=C:\Program Files\Java\jdk1.6.0_06
    Path=;%JAVA_HOME%\bin;
    CLASSPATH=.;%JAVA_HOME%\lib\dt.jar;%JAVA_HOME%\lib\tools.jar
    注意:一定要使用jdk1.6。因爲nutch1.0是在1.6下開發的(自己猜的。。),因爲使 用1.5會有一個提示version不匹配的錯誤發生:

-----------------------------------------------------------------------------------------------------------------------------------

java.lang.UnsupportedClassVersionError: Bad version number in .class file
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:620)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:260)
at java.net.URLClassLoader.access$100(URLClassLoader.java:56)
at java.net.URLClassLoader$1.run(URLClassLoader.java:195)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:268)
at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)
Exception in thread "main" java.lang.UnsupportedClassVersionError: Bad version number in .class file
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:620)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:260)
at java.net.URLClassLoader.access$100(URLClassLoader.java:56)
at java.net.URLClassLoader$1.run(URLClassLoader.java:195)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:268)
at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)
Exception in thread "main"

-----------------------------------------------------------------------------------------------------------------------------------
3.Cygwin安裝
Cygwin是在windows下執行linux腳本的工具。
安裝過程參考:http://www.wangchao.net.cn/bbsdetail_1759714.html
我安裝過程中並沒有看上面的,我選擇的是 install from internet。安裝這個應該不成問題。
4.安裝tomcat6.0
設置TOMCAT_HOME環境變量 C:\Program Files\Apache Software Foundation\Tomcat 6.0
5.安裝nutch1.0
(1)下載nutch包。

(2)將包nutch-1.0.tar.gz放到cygwin的安裝目錄根目錄下。

          打開Cygwin的快捷方式,退到根目錄,運行dir會看到nutch-1.0.tar.gz.

(3)運行tar xvf nutch-0.9.tar.gz進行解包,會在根目錄下面生成nutch-0.9文件夾。

(4)將該文件改名, mv nutch-0.9 nutch

(5)在nutch目錄下,建立urls目錄,然後建立一個url(不帶後綴名哦)文件,在url文件內寫入一個希望爬行的url,例如:http://www.jlu.edu.cn/    (後面的/不能丟)

(6)打開nutch\conf\crawl-urlfilter.txt文件.

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
改爲
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*jlu.edu.cn/
7)修改nutch/conf/nutch-default.xml
將文件裏面的對應內容,修改成如下樣子,其實你完全可以根據自己的實際情況修改,比如http.agent.name它要求MUST NOT be empty,你就隨便寫上點東西。http.robots.agents它要求put the value of http.agent.name as the first agent name, and keep the default * at the end of the list.對於http.agent.url 只是個advertise,隨便寫個網址就可以。http.agent.email的要求是>An email address to advertise in the HTTP 'From' request header and User-Agent header. A good practice is to mangle this
   address (e.g. 'info at example dot com') to avoid spamming.隨便寫一個就可以。有的人說可以像這樣修改nutch/conf/nutch-site.xml.一樣可以配置好nutch。雖然我沒 試,但應該可以。因爲從文件名可以看出,這個default文件是個默認文件,那個site文件是一個個性化文件,修改site應該是靈活的一種表現。

----------------------------------------nutch-default.xml內容---------------------------------------------------------------

<property>
<name>http.agent.name</name>
<value>guoliqiang</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.

NOTE: You should also check other related properties:

http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version

and set their values appropriately.

</description>
</property>

<property>
<name>http.robots.agents</name>
<value>guoliqiang,*</value>
<description>The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence. You should
put the value of http.agent.name as the first agent name, and keep the
default * at the end of the list. E.g.: BlurflDev,Blurfl,*
</description>
</property>

<property>
<name>http.robots.403.allow</name>
<value>true</value>
<description>Some servers return HTTP status 403 (Forbidden) if
/robots.txt doesn't exist. This should probably mean that we are
allowed to crawl the site nonetheless. If this is set to false,
then such sites will be treated as forbidden.</description>
</property>

<property>
<name>http.agent.description</name>
<value>jlu</value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>

<property>
<name>http.agent.url</name>
<value>www.baidu.com</value>
<description>A URL to advertise in the User-Agent header. This will
   appear in parenthesis after the agent name. Custom dictates that this
   should be a URL of a page explaining the purpose and behavior of this
   crawler.
</description>
</property>

<property>
<name>http.agent.email</name>
<value>[email protected]</value>
<description>An email address to advertise in the HTTP 'From' request
   header and User-Agent header. A good practice is to mangle this
   address (e.g. 'info at example dot com') to avoid spamming.
</description>
</property>

-----------------------------------------------------------------------------------------------------------------------------------
6.用nutch進行爬行

進入nutch目錄

$ sh ./bin/nutch crawl urls -dir mydir -depth 2 -threads 4 -topN 50 >& ./log.txt

crawl:通知nutch.jar,執行crawl的main方法。

urls:存放需要爬行的url.txt文件的目錄

-dir mydir 爬行後文件保存的位置

-depth 2:爬行次數,或者成爲深度,不過還是覺得次數更貼切,建議測試時改爲1。

-threads 指定併發的進程 這是設定爲 4
-topN 50:一個網站保存的最大頁面數。
-log.txt :是記錄日誌的,如果有錯誤發生可以在裏面找到。

注意爬網的時候mydir目錄不能存在,要不然會出錯

7.配置tomcat

(1)將nutch-1.0.war改名nutch.war ,並複製到到Tomcat 6.0\webapps目錄下。
(2)啓動Tomcat,等nutch.war解壓後,打開nutch\WEB-INF\classes\nutch-site.xml
修改:
<nutch-conf><property><name>searcher.dir</name> <value> C:\cygwin\nutch\mydir </value></property></nutch-conf>
(3)在Tomcat 6.0\webapps\nutch\zh\include 下面新建header.jsp,內容就是複製header.html,但是
前面加上
<%@ page contentType="text/html; charset=UTF-8" pageEncoding="UTF-8"%>
在D:\tomcat\webapps\nutch\search.jsp裏面,找到並修改爲
<jsp:include page="<%= language + "/include/header.jsp"%>"/>
順便把下面js註釋掉
function queryfocus() {
//search.query.focus(); }

(4)在Tomcat 6.0\conf\server.xml 找到以下段,並修改
<Connector port="8080" maxThreads="150" minSpareThreads="25" maxSpareThreads="75" enableLookups="false" redirectPort="8443" acceptCount="100" debug="0" connectionTimeout="20000" disableUploadTimeout="true" URIEncoding="UTF-8" useBodyEncodingForURI="true" />
(5)重啓tomcat,訪問 http://localhost:8080/nutch/ 就可以看到搜索主頁了,而且搜索支持中文和分詞。
也可以放到將nutch目錄下內容放於webapps/ROOT目錄下,通過http://localhost:8080/ 即可直接訪問。

 

注意啓動tomcat時可能遇到如下錯誤:

-----------------------------------------------------------------------------------------------------------------------------------
2009-04-09 17:09:02,984 INFO NutchBean - creating new bean
2009-04-09 17:09:03,093 WARN FileSystem - uri=file:///
javax.security.auth.login.LoginException: Login failed: Cannot run program "whoami": CreateProcess error=2, ?????????
at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:250)
at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:275)
at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:257)
at org.apache.hadoop.security.UserGroupInformation.login(UserGroupInformation.java:67)
at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:1438)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1376)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
at org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:89)
at org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:77)
at org.apache.nutch.searcher.NutchBean$NutchBeanConstructor.contextInitialized(NutchBean.java:425)
at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:3843)
at org.apache.catalina.core.StandardContext.start(StandardContext.java:4342)
at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:791)
at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:771)
at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:525)
at org.apache.catalina.startup.HostConfig.deployDirectory(HostConfig.java:926)
at org.apache.catalina.startup.HostConfig.deployDirectories(HostConfig.java:889)
at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:492)
at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1149)
at org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:311)
at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:117)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1053)
at org.apache.catalina.core.StandardHost.start(StandardHost.java:719)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045)
at org.apache.catalina.core.StandardEngine.start(StandardEngine.java:443)
at org.apache.catalina.core.StandardService.start(StandardService.java:516)
at org.apache.catalina.core.StandardServer.start(StandardServer.java:710)
at org.apache.catalina.startup.Catalina.start(Catalina.java:578)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:288)
at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:413)
2009-04-09 17:09:03,125 WARN FileSystem - uri=file:///
javax.security.auth.login.LoginException: Login failed: Cannot run program "whoami": CreateProcess error=2, ?????????
at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:250)
at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:275)
at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:257)
at org.apache.hadoop.security.UserGroupInformation.login(UserGroupInformation.java:67)
at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:1438)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1376)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
at org.apache.nutch.searcher.LuceneSearchBean.<init>(LuceneSearchBean.java:50)
at org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:102)
at org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:77)
at org.apache.nutch.searcher.NutchBean$NutchBeanConstructor.contextInitialized(NutchBean.java:425)
at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:3843)
at org.apache.catalina.core.StandardContext.start(StandardContext.java:4342)
at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:791)
at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:771)
at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:525)
at org.apache.catalina.startup.HostConfig.deployDirectory(HostConfig.java:926)
at org.apache.catalina.startup.HostConfig.deployDirectories(HostConfig.java:889)
at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:492)
at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1149)
at org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:311)
at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:117)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1053)
at org.apache.catalina.core.StandardHost.start(StandardHost.java:719)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045)
at org.apache.catalina.core.StandardEngine.start(StandardEngine.java:443)
at org.apache.catalina.core.StandardService.start(StandardService.java:516)
at org.apache.catalina.core.StandardServer.start(StandardServer.java:710)
at org.apache.catalina.startup.Catalina.start(Catalina.java:578)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:288)
at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:413)
2009-04-09 17:09:03,125 INFO SearchBean - opening merged index in C:/cygwin/nutch/mydir/index
2009-04-09 17:09:03,140 WARN FileSystem - uri=file:///
javax.security.auth.login.LoginException: Login failed: Cannot run program "whoami": CreateProcess error=2, ?????????
at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:250)
at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:275)
at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:257)
at org.apache.hadoop.security.UserGroupInformation.login(UserGroupInformation.java:67)
at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:1438)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1376)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
at org.apache.nutch.searcher.IndexSearcher.<init>(IndexSearcher.java:70)
at org.apache.nutch.searcher.LuceneSearchBean.init(LuceneSearchBean.java:58)
at org.apache.nutch.searcher.LuceneSearchBean.<init>(LuceneSearchBean.java:51)
at org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:102)
at org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:77)
at org.apache.nutch.searcher.NutchBean$NutchBeanConstructor.contextInitialized(NutchBean.java:425)
at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:3843)
at org.apache.catalina.core.StandardContext.start(StandardContext.java:4342)
at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:791)
at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:771)
at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:525)
at org.apache.catalina.startup.HostConfig.deployDirectory(HostConfig.java:926)
at org.apache.catalina.startup.HostConfig.deployDirectories(HostConfig.java:889)
at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:492)
at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1149)
at org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:311)
at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:117)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1053)
at org.apache.catalina.core.StandardHost.start(StandardHost.java:719)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045)
at org.apache.catalina.core.StandardEngine.start(StandardEngine.java:443)
at org.apache.catalina.core.StandardService.start(StandardService.java:516)
at org.apache.catalina.core.StandardServer.start(StandardServer.java:710)
at org.apache.catalina.startup.Catalina.start(Catalina.java:578)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:288)
at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:413)
2009-04-09 17:09:03,156 WARN FileSystem - uri=file:///
javax.security.auth.login.LoginException: Login failed: Cannot run program "whoami": CreateProcess error=2, ?????????
at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:250)
at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:275)
at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:257)
at org.apache.hadoop.security.UserGroupInformation.login(UserGroupInformation.java:67)
at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:1438)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1376)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
at org.apache.hadoop.fs.FileSystem.getLocal(FileSystem.java:191)
at org.apache.nutch.searcher.IndexSearcher.getDirectory(IndexSearcher.java:84)
at org.apache.nutch.searcher.IndexSearcher.<init>(IndexSearcher.java:71)
at org.apache.nutch.searcher.LuceneSearchBean.init(LuceneSearchBean.java:58)
at org.apache.nutch.searcher.LuceneSearchBean.<init>(LuceneSearchBean.java:51)
at org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:102)
at org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:77)
at org.apache.nutch.searcher.NutchBean$NutchBeanConstructor.contextInitialized(NutchBean.java:425)
at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:3843)
at org.apache.catalina.core.StandardContext.start(StandardContext.java:4342)
at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:791)
at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:771)
at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:525)
at org.apache.catalina.startup.HostConfig.deployDirectory(HostConfig.java:926)
at org.apache.catalina.startup.HostConfig.deployDirectories(HostConfig.java:889)
at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:492)
at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1149)
at org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:311)
at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:117)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1053)
at org.apache.catalina.core.StandardHost.start(StandardHost.java:719)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045)
at org.apache.catalina.core.StandardEngine.start(StandardEngine.java:443)
at org.apache.catalina.core.StandardService.start(StandardService.java:516)
at org.apache.catalina.core.StandardServer.start(StandardServer.java:710)
at org.apache.catalina.startup.Catalina.start(Catalina.java:578)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:288)
at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:413)
2009-04-09 17:09:03,203 INFO PluginRepository - Plugins: looking in: C:\Program Files\Apache Software Foundation\Tomcat 6.0\webapps\nutch\WEB-INF\classes\plugins
2009-04-09 17:09:03,312 INFO PluginRepository - Plugin Auto-activation mode: [true]
2009-04-09 17:09:03,312 INFO PluginRepository - Registered Plugins:
2009-04-09 17:09:03,312 INFO PluginRepository - the nutch core extension points (nutch-extensionpoints)
2009-04-09 17:09:03,312 INFO PluginRepository - Basic Query Filter (query-basic)
2009-04-09 17:09:03,312 INFO PluginRepository - Basic URL Normalizer (urlnormalizer-basic)
2009-04-09 17:09:03,312 INFO PluginRepository - Html Parse Plug-in (parse-html)
2009-04-09 17:09:03,312 INFO PluginRepository - Basic Indexing Filter (index-basic)
2009-04-09 17:09:03,312 INFO PluginRepository - Site Query Filter (query-site)
2009-04-09 17:09:03,312 INFO PluginRepository - Basic Summarizer Plug-in (summary-basic)
2009-04-09 17:09:03,312 INFO PluginRepository - HTTP Framework (lib-http)
2009-04-09 17:09:03,312 INFO PluginRepository - Text Parse Plug-in (parse-text)
2009-04-09 17:09:03,312 INFO PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass)
2009-04-09 17:09:03,312 INFO PluginRepository - Regex URL Filter (urlfilter-regex)
2009-04-09 17:09:03,312 INFO PluginRepository - Http Protocol Plug-in (protocol-http)
2009-04-09 17:09:03,312 INFO PluginRepository - XML Response Writer Plug-in (response-xml)
2009-04-09 17:09:03,312 INFO PluginRepository - Regex URL Normalizer (urlnormalizer-regex)
2009-04-09 17:09:03,312 INFO PluginRepository - OPIC Scoring Plug-in (scoring-opic)
2009-04-09 17:09:03,312 INFO PluginRepository - CyberNeko HTML Parser (lib-nekohtml)
2009-04-09 17:09:03,312 INFO PluginRepository - Anchor Indexing Filter (index-anchor)
2009-04-09 17:09:03,312 INFO PluginRepository - JavaScript Parser (parse-js)
2009-04-09 17:09:03,312 INFO PluginRepository - URL Query Filter (query-url)
2009-04-09 17:09:03,312 INFO PluginRepository - Regex URL Filter Framework (lib-regex-filter)
2009-04-09 17:09:03,312 INFO PluginRepository - JSON Response Writer Plug-in (response-json)
2009-04-09 17:09:03,312 INFO PluginRepository - Registered Extension-Points:
2009-04-09 17:09:03,312 INFO PluginRepository - Nutch Summarizer (org.apache.nutch.searcher.Summarizer)
2009-04-09 17:09:03,312 INFO PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol)
2009-04-09 17:09:03,312 INFO PluginRepository - Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2009-04-09 17:09:03,312 INFO PluginRepository - Nutch Field Filter (org.apache.nutch.indexer.field.FieldFilter)
2009-04-09 17:09:03,312 INFO PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2009-04-09 17:09:03,312 INFO PluginRepository - Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
2009-04-09 17:09:03,312 INFO PluginRepository - Nutch Search Results Response Writer (org.apache.nutch.searcher.response.ResponseWriter)
2009-04-09 17:09:03,312 INFO PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2009-04-09 17:09:03,312 INFO PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter)
2009-04-09 17:09:03,312 INFO PluginRepository - Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2009-04-09 17:09:03,312 INFO PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2009-04-09 17:09:03,312 INFO PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser)
2009-04-09 17:09:03,312 INFO PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2009-04-09 17:09:03,312 INFO PluginRepository - Ontology Model Loader (org.apache.nutch.ontology.Ontology)
2009-04-09 17:09:03,343 INFO Configuration - found resource common-terms.utf8 at file:/C:/Program%20Files/Apache%20Software%20Foundation/Tomcat%206.0/webapps/nutch/WEB-INF/classes/common-terms.utf8
2009-04-09 17:09:03,359 WARN FileSystem - uri=file:///
javax.security.auth.login.LoginException: Login failed: Cannot run program "whoami": CreateProcess error=2, ?????????
at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:250)
at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:275)
at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:257)
at org.apache.hadoop.security.UserGroupInformation.login(UserGroupInformation.java:67)
at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:1438)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1376)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
at org.apache.nutch.searcher.FetchedSegments.<init>(FetchedSegments.java:204)
at org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:110)
at org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:77)
at org.apache.nutch.searcher.NutchBean$NutchBeanConstructor.contextInitialized(NutchBean.java:425)
at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:3843)
at org.apache.catalina.core.StandardContext.start(StandardContext.java:4342)
at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:791)
at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:771)
at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:525)
at org.apache.catalina.startup.HostConfig.deployDirectory(HostConfig.java:926)
at org.apache.catalina.startup.HostConfig.deployDirectories(HostConfig.java:889)
at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:492)
at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1149)
at org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:311)
at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:117)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1053)
at org.apache.catalina.core.StandardHost.start(StandardHost.java:719)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045)
at org.apache.catalina.core.StandardEngine.start(StandardEngine.java:443)
at org.apache.catalina.core.StandardService.start(StandardService.java:516)
at org.apache.catalina.core.StandardServer.start(StandardServer.java:710)
at org.apache.catalina.startup.Catalina.start(Catalina.java:578)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:288)
at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:413)
2009-04-09 17:09:03,375 INFO SummarizerFactory - Using the first summarizer extension found: Basic Summarizer
2009-04-09 17:09:03,375 WARN FileSystem - uri=file:///
javax.security.auth.login.LoginException: Login failed: Cannot run program "whoami": CreateProcess error=2, ?????????
at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:250)
at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:275)
at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:257)
at org.apache.hadoop.security.UserGroupInformation.login(UserGroupInformation.java:67)
at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:1438)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1376)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
at org.apache.nutch.crawl.LinkDbReader.init(LinkDbReader.java:59)
at org.apache.nutch.crawl.LinkDbReader.<init>(LinkDbReader.java:55)
at org.apache.nutch.searcher.LinkDbInlinks.<init>(LinkDbInlinks.java:42)
at org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:113)
at org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:77)
at org.apache.nutch.searcher.NutchBean$NutchBeanConstructor.contextInitialized(NutchBean.java:425)
at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:3843)
at org.apache.catalina.core.StandardContext.start(StandardContext.java:4342)
at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:791)
at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:771)
at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:525)
at org.apache.catalina.startup.HostConfig.deployDirectory(HostConfig.java:926)
at org.apache.catalina.startup.HostConfig.deployDirectories(HostConfig.java:889)
at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:492)
at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1149)
at org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:311)
at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:117)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1053)
at org.apache.catalina.core.StandardHost.start(StandardHost.java:719)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045)
at org.apache.catalina.core.StandardEngine.start(StandardEngine.java:443)
at org.apache.catalina.core.StandardService.start(StandardService.java:516)
at org.apache.catalina.core.StandardServer.start(StandardServer.java:710)
at org.apache.catalina.startup.Catalina.start(Catalina.java:578)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:288)
at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:413)

-----------------------------------------------------------------------------------------------------------------------------------
它的意思是命令whoami無法運行,原因是您用得是windows不是linux解決方法就是用cygwin,將環境變量path中加入:C:\cygwin\bin然後重啓tomcat。

我試過nutch-0.9有一個很麻煩的錯誤,我就直接換1.0了:

---------------------------------------------------------------------------------------------------------------------------------
2007-06-09 12:37:28,187 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2007-06-09 12:37:28,281 INFO indexer.Indexer - Optimizing index.
2007-06-09 12:37:28,421 INFO indexer.Indexer - Indexer: done
2007-06-09 12:37:28,421 INFO indexer.DeleteDuplicates - Dedup: starting
2007-06-09 12:37:28,453 INFO indexer.DeleteDuplicates - Dedup: adding indexes in: mydir/indexes
2007-06-09 12:37:28,750 WARN mapred.LocalJobRunner - job_hlqfpx
java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
at org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176)
at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)

-----------------------------------------------------------------------------------------------------------------------------------
原因參考:http://blog.sina.com.cn/s/blog_537c07f6010009t9.html
雖然我有試,但是我見過說法最權威的,那些說是配置有問題,純粹是在che。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章