轉自 http://blog.csdn.net/songzhen640/archive/2008/07/16/2662443.aspx
heritrix 以CrawlController(後臺)形式運行的代碼實現
理解不一定對,不過我實現了
在文件內可以改寫order.xml 在寫一個 seeds.txt 其中抓取的內容就在該文件夾內
package main;
import java.io.File;
import javax.management.InvalidAttributeValueException;
import org.archive.crawler.datamodel.CrawlOrder;
import org.archive.crawler.framework.CrawlController;
import org.archive.crawler.framework.exceptions.InitializationException;
import org.archive.crawler.settings.SettingsHandler;
import org.archive.crawler.settings.XMLSettingsHandler;
public class RunMain
{
public static void main(String args[]) throws InvalidAttributeValueException, InitializationException
{
//把order.xml寫成文件實例
File order = new File(
"F://web//eclipse//workspace//heritrix//jobs//car//order.xml");
//加載Order
XMLSettingsHandler xml = new XMLSettingsHandler(order);
xml.initialize();
CrawlController crawl = new CrawlController();
crawl.initialize(xml);
//運行抓取
crawl.requestCrawlStart();
}
}
本文來自CSDN博客,轉載請標明出處:http://blog.csdn.net/songzhen640/archive/2008/07/16/2662443.aspx