python 動態遷移solr數據過程解析

原創

2019-09-08 15:13

這篇文章主要介紹了python 動態遷移solr數據過程解析,文中通過示例代碼介紹的非常詳細，對大家的學習或者工作具有一定的參考學習價值,需要的朋友可以參考下

前言

上項目的時候，遇見一次需求，需要把在線的其中一個 collection 裏面的數據遷移到另外一個collection下，於是就百度了看到好多文章，其中大部分都是使用導入的方法，沒有找到在線數據的遷移方法。於是寫了python腳本，分享出來。

思路： collection數據量比較大，所以一次性操作所有數據太大，於是分段執行操作。

先分段按1000條數據量進行查詢，處理成json數據

把處理後的json數據發送到目的collection上即可

實現:

一、使用http的接口先進行查詢

使用如下格式查詢：

其中：collection_name 是你查詢的collection的名稱

rows 是需要查詢多少行，這裏設置爲1000

start 從多少行開始進行查詢，待會兒腳本里面就是控制這個參數進行循環查詢

http://host:port/solr/collection_name/select?q=*:*&rows=1000&start=0

查詢處理後會得到如下圖片裏面的數據格式，其中

在response裏面，有兩個鍵值數據是我們需要的，一個是numFound（總的數據條數），docs（所有json數據都在這裏面）

在docs裏面，每條數據都帶有version 鍵值，這個需要給去掉

二、使用http的接口提交數據

wt：使用json格式提交

http://host:port/solr/collection_name/update?wt=json

header 需設置爲 {"Content-Type": "application/json"}

提交參數：solr在做索引的時候，如果文檔已經存在，就替換。（這裏的參數也可以直接加到url裏面）

{"overwrite":"true","commit":"true"}

data_dict 就是我們處理後的 docs數據

提交數據：data={"add":{ "doc":data_dict}}

三、實現的腳本如下：

#coding=utf-8
import requests as r
import json
import threading
import time
#發送數據到目的url des_url，data_dict 參數爲去掉version鍵值後的一條字典數據
def send_data(des_url,data_dict):
 data={"add":{ "doc":data_dict}}
 headers = {"Content-Type": "application/json"}
 params = {"boost":1.0,"overwrite":"true","&commitWithin":1000,"commit":"true"}
 url = "%s/update?wt=json"%(des_url)
 re = r.post(url,json = data,params=params,headers=headers)
 if re.status_code != 200:
  print("導入出錯",data)

#獲取數據，調用send_data 發送數據到目的url
def get_data(des_url,src_url):
  #定義起始行
 start = 0
 #先獲取到總的數據條數
 se_data=r.get("%s/select?q=*:*&rows=0&start=%s"%(src_url,start)).text
 se_dict = json.loads(se_data)
 numFound = int(se_dict["response"]["numFound"])
 #while循環，1000條數據爲一個循環
 while start < numFound:
  #定義存放多線程的列表
  th_li = []
    #獲取1000條數據
  se_data=r.get("%s/select?q=*:*&rows=1000&start=%s"%(src_url,start)).text
    #把獲取的數據轉換成字典
  se_dict = json.loads(se_data)
    #獲取數據裏的docs數據
  s_data = (se_dict["response"]["docs"])

  #循環得到的數據，刪除 version鍵值，並使用多線程調用send_data 方法發送數據
  for i in s_data:
   del i["_version_"]
   th = threading.Thread(target=send_data,args=(des_url,i))
   th_li.append(th)

  for t in th_li:
   t.start()
   t.join()

  start += 1000
  print(start)

if __name__ == "__main__":
 #源數據，查詢數據的collection地址
 src_url = "http://ip:port/solr/src_connection"
 #導入數據導目的collection 的地址
 des_url = "http://ip:port/solr/des_connection"
 start_time = time.time()
 get_data(des_url,src_url)
 end_time = time.time()
 print("耗時：",end_time-start_time,"秒")

備註：

一、如果你的collection 不在同一個網絡，不能實現在線傳輸，可以先把for循環刪除了version鍵值的數據，寫入一個文件中，然後copy到目的網絡的服務器上，循環讀取文件進行上傳,如下寫入文件（這個就根據各位大佬的喜好來寫了），但讀取後，需要把每一條數據都轉換成字典進行上傳：

file = open("solr.json","a+")
for i in s_data:
del i["version"]
file.write(str(i)+"\n")
file.close()

二、清除數據可使用一下方法，自測比較方便的一種

在你要清除collection裏面

選擇 documents

document type 選擇xml

將一下內容複製到如圖位置，最後點擊submit document 按鈕即可

#控制web界面刪除數據
<delete><query>:</query></delete>
<commit/>

以上就是本文的全部內容，希望對大家的學習有所幫助，也希望大家多多支持神馬文庫。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python 動態遷移solr數據過程解析

查看本地出口IP

python 動態遷移solr數據過程解析

python 動態遷移solr數據

python3 selenium + fiddler 爬取動態js頁面數據

centos6.5下 svn通過apache訪問

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結