1.elasticsearch-5.6.12
2.elastic search header
3.fscrawler-es5-2.6
安裝和啓動請看:https://blog.csdn.net/fulq1234/article/details/96485228
文檔:https://fscrawler.readthedocs.io/en/fscrawler-2.6/user/rest.html
翻譯後的:https://www.helplib.com/GitHub/article_94667
1.To start FSCrawler with the REST service,
E:\soft\fscrawler-es5-2.6\bin>fscrawler test --loop 0 --rest
10:03:20,487 INFO [f.p.e.c.f.c.v.ElasticsearchClientV5] Elasticsearch Client for version 5.x connected to a node running version 5.6.12
10:03:20,528 INFO [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
10:03:21,038 WARN [o.g.j.i.i.Providers] A provider fr.pilato.elasticsearch.crawler.fs.rest.UploadApi registered in SERVER runtime does not implement any provider interfaces applicable in the SERVER runtime. Due to constraint configuration problems the provider fr.pilato.elasticsearch.crawler.fs.rest.UploadApi will be ignored.
10:03:21,039 WARN [o.g.j.i.i.Providers] A provider fr.pilato.elasticsearch.crawler.fs.rest.ServerStatusApi registered in SERVER runtime does not implement any provider interfaces applicable in the SERVER runtime. Due to constraint configuration problems the provider fr.pilato.elasticsearch.crawler.fs.rest.ServerStatusApi will be ignored.
10:03:21,250 INFO [f.p.e.c.f.r.RestServer] FS crawler Rest service started on [http://127.0.0.1:8080/fscrawler]
Check the service is working with:
curl http://127.0.0.1:8080/fscrawler/
你可以通過在表單數據中添加 id=YOUR_ID 來強制任何標識:
curl -F "[email protected]" -F "id=my-test""http://127.0.0.1:8080/fscrawler/_upload"
2.上傳文件
E:\soft\fscrawler-es5-2.6\bin>curl -F "file=@C:\tmp\folderA\subfolderA\ni.pdf" "http://127.0.0.1:8080/fscrawler/_upload"
{"ok":true,"filename":"ni.pdf","url":"http://127.0.0.1:9200/test/doc/bfce7866cb7abdd1232cc7604f60e3"}
E:\soft\fscrawler-es5-2.6\bin>curl -F "file=@C:\Users\lunmei\Desktop\bx\光大永明達爾文.pdf" "http://127.0.0.1:8080/fscrawler/_upload"
{"ok":true,"filename":"錕斤拷錕斤拷錕斤拷錕斤拷錕斤拷錕斤拷.pdf","url":"http://127.0.0.1:9200/test/doc/31749656282e2e9afcc4368d58f9331b"}
上傳成功後可以在elastic search header裏面看到
3.新建一個
E:\soft\fscrawler-es5-2.6\bin>fscrawler lm
13:06:22,092 WARN [f.p.e.c.f.c.FsCrawlerCli] job [lm] does not exist
13:06:22,094 INFO [f.p.e.c.f.c.FsCrawlerCli] Do you want to create it (Y/N)?
y
13:06:24,744 INFO [f.p.e.c.f.c.FsCrawlerCli] Settings have been created in [C:\Users\lunmei\.fscrawler\lm\_settings.json]. Please review and edit before relaunch
修改配置文件
參數 | |
---|---|
url | job_name,必須 |
includes | 包括。不應用於目錄名稱,而是應用於文件名。 |
filename_as_id | 使用文件名作爲 _id |
continue_on_error | 默認情況下,如果爬行器遇到拒絕權限異常,它將立即停止索引。 如果你想跳過這裏文件並繼續使用目錄樹的其餘部分 |
再重新啓動
E:\soft\fscrawler-es5-2.6\bin>fscrawler lm
13:09:15,268 INFO [f.p.e.c.f.c.v.ElasticsearchClientV5] Elasticsearch Client for version 5.x connected to a node running version 5.6.12
13:09:15,309 INFO [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
13:09:15,310 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
13:09:15,324 INFO [f.p.e.c.f.FsParserAbstract] FS crawler started for [lm] for [C:\tmp\tmp\es] every [15m]
13:09:15,489 WARN [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
成功