百分點認知智能實驗室:基於不完全標註樣本集的信息抽取實踐

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"信息抽取是從文本數據中抽取特定信息的一種技術,命名實體識別(Named Entity Recognition, NER)是信息抽取的基礎任務之一,其目標是抽取文本中具有基本語義的實體單元,在知識圖譜構建、信息抽取、信息檢索、機器翻譯、智能問答等系統中都有廣泛應用。基於監督學習的NER系統通常需要大規模的細粒度、高精度標註數據集,一旦數據標註質量下降,模型的表現也會急劇下降。利用不完全標註的數據進行NER系統的建立,越來越受到專家學者們的關注。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第九屆國際自然語言處理與中文計算會議(NLPCC 2020)針對此業界難題,推出技術評測任務:Auto Information Extraction(AutoIE),即信息抽取系統的自動構建,任務旨在通過利用只有少量不完全標註的數據集來完成NER抽取系統的構建。本文將主要介紹本次比賽過程中使用的主體技術方案以及對應的評測結果。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"背景介紹"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"得益於互聯網發展和數字化進程,信息的豐富程度呈指數級爆炸增長,但同時也讓我們陷入無法快速找到所需信息的困境中,信息抽取技術應運而生。信息抽取(Information Extraction,IE)就是指從自然語言文本中,抽取出特定信息,以及信息之間的相互關係,幫助我們將海量內容自動分類、提取和重構。這些特定信息通常包括實體(entity)、關係(relation)、事件(event)。例如從新聞中抽取時間、地點、關鍵人物,或者從技術文檔中抽取產品名稱、開發時間、性能指標等。能從自然語言中抽取用戶感興趣的事實信息,無論是在知識圖譜、信息檢索、問答系統還是在情感分析、文本挖掘中,信息抽取都有廣泛應用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前,信息抽取的主流方案是靠數據驅動的機器學習方法,即在有監督、有足夠多標註數據的場景下訓練出適用的機器模型完成信息抽取。而信息抽取系統一般都是針對某一特定領域量身定做,根據業務需求人工標註相關數據集以供訓練模型使用,例如從經濟新聞中抽取新發行股票的相關信息,包括 “股票名稱”、“股票價格”、“上市公司”、“募資金額”等等,就需要有大量已經標註好,包含上述信息的模板新聞進行訓練,而“標註”這個過程需要純人工來完成。也就是說,構建某一特定領域的信息抽取系統很大程度依賴於人工標註足夠多的數據,這無疑使得信息抽取技術的人工成本急劇擴大,實施週期也隨之拉長。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"怎麼減少模型對標註數據的依賴,如何自動化構建模型所需的數據集,以及對於不完全標註的數據集怎樣利用等問題成爲了攻克信息抽取難題的關鍵所在。本次比賽我們針對此類問題,構建了針對目標實體類型的信息抽取系統。本系統大大減少了模型對人工標註數據的依賴,符合業界實際需求。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"任務場景描述"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於基於有監督學習的命名實體識別(NamedEntity Recognition, NER)的信息抽取系統,解決命名實體識別的領域自適應問題十分關鍵,而能夠獲取到目標領域的人工標註數據是最爲理想的解決方法。爲此,常用的方法包括使用半監督的方法,如Bootstrapping 學習框架;選用更爲通用的、領域無關的特徵來訓練模型;模型融合等。這些方法最終的目的都是想要在模型訓練過程中,讓模型學習到更多的目標領域的特徵,從而提高模型在目標領域數據上的性能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"學習目標領域特徵的方法有很多,其中,一種較爲直接的方法是使用目標領域的不完全標註數據。在解決領域自適應問題時,我們通常擁有大量的目標領域未標註的數據,同時,還有其中一些數據的不完整的標註信息,這些不完整的標註數據其實也包含了目標領域的重要信息,因而如何利用這些不完整的標註信息也是一個非常值得研究的工作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在本次NLPCC-2020 AutoIE任務中,主辦方發佈了優酷視頻標題文本數據集,其中包含電視、人物和系列三類信息。訓練數據集由不完整的標記語料庫組成,其中的實體根據與給定實體列表匹配的字符串進行標記,標籤數據樣例如下圖所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d1\/d1e4f2ee96adcd690a602e5ef80d1825.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖1.優酷視頻標題數據集樣例“實體漏標”樣本數據如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/cd\/cd8adf0509c8936a510071557e2699fc.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖2. 不完全標註數據樣例"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"“不完全標註問題”主流解決方案"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前針對“未標註實體問題”的解決方案大致分爲以下幾種:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"①AutoNER + Fuzzy CRF:通過自動抽取短語回標訓練集"},{"type":"sup","content":[{"type":"text","text":"[1]"}]},{"type":"text","text":";"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"②AutoNER + 自訓練:通過多輪迭代僞標籤進行自訓練,達到自動降噪的目的"},{"type":"sup","content":[{"type":"text","text":"[2]"}]},{"type":"text","text":";"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"③positive-unlabeled(PU)learning:爲每個標籤構建不同的二分類器,從而減輕噪聲數據的影響"},{"type":"sup","content":[{"type":"text","text":"[3]"}]},{"type":"text","text":";"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"④Partial CRF:拓展改進CRF,使其可以繞過未標註實體進行訓練"},{"type":"sup","content":[{"type":"text","text":"[4]"}]},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上述各類解決方案存在如下的一些缺陷:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"方案①依賴於遠程監督的質量,因而從本質上來講,未標註實體問題仍然存在;方案②的多輪迭代自訓練過程計算非常耗時;方案③中雖然爲不同標籤單獨劃分了數據,但是未標註的實體仍然會影響相應實體類型的分類器;方案④中在繞過未標註實體的同時,忽略了負樣本的作用,只適用於含有非常少量漏標實體的高質量數據集。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"技術方案"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本次比賽我們使用的技術包括Classifier-stacking、Word-merging Representation、PredictionMajority Voting (PMV)等,下面將會逐一介紹。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在我們的技術方案中,Classifier-stacking算法被用來作爲基礎組件對數據集進行交叉推斷,實現數據集的“修復”。並且我們融合了多種特定領域的預訓練詞向量來讓我們的實體邊界識別更加精準。同時我們在不同的預訓練模型上進行對比實驗,找出與任務最匹配的預訓練模型,最終在集成學習的幫助下,將模型的潛力發揮到最大。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們的技術方案相較於上一節提到的四大主流方案在以下幾方面有了改進。一是採用Classifier-stacking算法將未標註實體問題從數據層面轉移到算法層面,能減輕模型對高質量數據集的依賴性;二是針對性地使用特定領域預訓練詞向量對實體邊界進行了一定的約束,改善了實體抽取的完整度。三是就比賽而言,我們用實驗充分對比了不同預訓練模型在當前數據集的表現異同,使我們的算法效果在本次比賽的具體場景下得到更大的發揮。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4.1 構造不完全數據集的方法探討"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於不完全標註數據集的構造,大致可以分爲三種:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"①從完整標註語料隨機去除一定量word_level的標註;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"②從完整標註語料隨機去除一定量span_level的標註;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"③從完整標註語料隨機去除一定量span_level的標註,並將所有O標籤也去除。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中,word_level是指任意的“多字片段”,span_level 則是指的某個完整實體片段,具體含義可參考下圖樣例。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從實際應用場景來看,第3種做法更符合標註人員漏標場景的真實樣本,因爲首先大部分情況下的標註遺漏都會發生在實體層面,而非字的層面,因而第1種做法並不妥當;其次,在真實標註場景下,我們會將所有未被標註人員作爲實體標註出來的Token,統一作爲O標籤處理,因此對於O標籤和遺漏實體,我們無法將其區分開來,所以方法2也不符合真實的不完全標註樣本“生產”場景。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據樣例如下圖所示,其中A.1、A.2、A.3分別爲如上所述的三種數據構造方法:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ce\/ce17c760115f63e99872497398d7e933.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖3. 構造不完整標註的數據方法"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4.2 Classifier-stacking算法流程及要點"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"訓練集通過K-Fold交叉驗證的形式,K-1與K-2分別訓練標註模型進行交叉推斷來“修復”數據集,然後用“修復”後的訓練集訓練出final模型,不斷迭代上述過程,直到驗證集效果達標。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/22\/22360ac72d7f0d788c5e3c89647b01db.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖4.Classifier-stacking算法流程圖"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在構造Loss函數時,我們在CRF loss函數的基礎上進行改造,對於不完整標註的序列,應當給予所有可能的完整序列一個可訓練權重矩陣q,如下圖所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/70\/700d8592042f8e3639d0ab28358fe066.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖5 不同的Loss構造方法"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相較於原生CRF損失函數,以及平均分配權重的Uniform 損失函數,可訓練權重的做法使得模型在每次迭代訓練中對每個標記爲O的Token的候選標籤給予不同的“關注度”,從而使數據的“修復過程”更快且更精準地完成。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於以上幾種不同Loss函數的標籤權重可視化示意如下,顏色的深淺示意了權重的分佈情況。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/69\/69e206ad2a40a4ea5ca624672b9894d2.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖6. Loss函數中可訓練權重的可視化示意圖"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4.3 Word-merging Representation 方法的應用"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"預訓練詞向量[5,6]是許多神經語言模型中的標準組件,在命名實體識別中,引入詞彙信息是提升中文NER指標的重要手段。引入詞彙信息可以強化實體邊界,特別是對於span較長的實體邊界更加有效,並且也是一種數據增強的方式,引入詞彙信息的增強方式對於小樣本下的中文NER增益明顯。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本次比賽我們從[7]獲得具有不同性質的預訓練向量來進行我們的實驗,實驗中採用了基於Skip-Gramwith Negative Sampling (SGNS)技術訓練的詞向量,如下表所示。具體做法是將Transformer-model的輸出H通過詞彙融合層,做一次詞彙增強表徵。我們利用中文分詞工具和詞向量表徵來獲取每個樣本的不同詞彙層特徵,並將得到的詞彙特徵對齊融入到原本的字符特徵中,然後輸入到線性層進行標籤路徑的映射。最後通過CRF學習標籤路徑的約束進一步提升模型的預測效果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"表1. Word2vec \/ Skip-Gram with Negative Sampling (SGNS)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d1\/d1a561de6a60ec8ee31724a5f1d1ddaf.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[1] The dimension of the Chinese Word Vectors is 300."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4.4 Prediction Majority Voting (PMV) 投票法的應用"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在模型的預測階段,我們採用了Prediction Majority Voting (PMV) 投票法進行實體擇優推斷。我們嘗試了兩種不同的組合方式來利用多模型的輸出,第一種方法很簡單,對於k個模型,每個模型爲句子中的每個單詞中分配候選標籤,並在所有k種預測結果中,選擇獲得多數票最多的實體作爲最終預測輸出。另一種方法是對於每一個Token,將各個模型預測結果取平均值,得到唯一的標籤序列輸出。實驗表明,在本次任務中,前一種策略相對而言對實體邊界的查準率更高。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4.5 不同預訓練模型的表現效果研究"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下表展示了我們利用不同預訓練模型進行實驗的效果對比,作爲選取合適的預訓練模型的參考依據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從結果可以看出BERT-wwm模型的效果最差,顯著低於使用更多預訓練數據的BERT-wwm-ext模型。說明模型訓練數據量大小直接影響了實體抽取的效果。從精確性、召回率和F1來看,RoBERTa -wwm-ext模型都要顯著高於其他模型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"鑑於預訓練模型在體系結構和訓練數據上的差異,我們可以通過結果做如下推測:首先,使用更多數據進行預訓練,可能有助於提高模型性能。這可以解釋爲什麼BERT-wwm-ext模型(訓練數據爲5.4B Token)比BERT-wwm模型(訓練數據爲0.4B Token)具有更好的性能。其次,去掉下一句預測任務(NSP)和增加訓練步數(1M步)的策略,導致RoBERTa-wwm ext模型性能具有顯著優勢,因爲RoBERTa-wwm ext模型和BERT-wwm ext模型都是在包含大約54億個Token的Wikipedia文本和擴展數據標記上訓練的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"表2.預訓練模型的影響評估實驗"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/e4\/e42375a62749e000334121610040d0e6.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了比較這些預訓練模型對訓練集尺度變化的魯棒性,我們進一步研究了在訓練集尺度從2000個樣本到10000個樣本變化時,開發集上的性能曲線。總體趨勢如下圖所示。結果表明,訓練集規模的減小對RoBERTa-wwm-ext模型的影響最小,也即在小樣本數據集的場景下,我們傾向於選擇表現更好的RoBERTa-wwm-ext模型來作爲我們的預訓練模型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/f0\/f05ecbc9af1840ff95e3330f05c02123.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖7. 預訓練模型對訓練數據集規模的魯棒性研究實驗"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"評測結果"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過對本次比賽採用數據集的類型分析,我們選用了基於Weibo和Sougou News預料訓練的詞向量進行融合實驗,實驗結果如下表所示。在開發集上使用了Sougou News詞向量的模型表現更優。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"表3.詞向量融合表徵實驗"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/e8\/e8a6eb37bc05371af3b8eebecddd2f7f.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們在最終測試集上使用了k-fold(k=10)交叉驗證,並利用10個基本模型進行特定策略的PMV投票,在NLPCC-2020 AutoIE排行榜上提交的最終結果F1爲84.75。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"表4.模型集成學習實驗"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/0d\/0d2b4f2151ea9b0651bd63c75a68da96.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"總結"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本次比賽是在解決不完全數據集NER的難題上的一次嘗試,我們在Classifier-stacking技術路徑之上,融合了特定領域詞向量表徵和Prediction Majority Voting (PMV)等方法,爲解決不完整標註數據場景下的信息抽取難題提供了有效且易於實施的解決方案。在信息抽取領域,本方案能夠在一定程度上緩解監督模型對高質量標註數據的依賴,使得信息抽取更易於在工業界落地實施。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"參考資料"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[1] Shang J , Liu L , Gu X , et al.Learning Named Entity Tagger using Domain-Specific Dictionary[C]\/\/ Proceedingsof the 2018 Conference on Empirical Methods in Natural Language Processing.2018."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[2] Jie Z , Xie P , Lu W , et al.Better Modeling of Incomplete Annotations for Named Entity Recognition[C]\/\/2019 Annual Conference of the North American Chapter of the Association forComputational Linguistics (NAACL). 2019."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[3] Peng M , Xing X , Zhang Q , etal. Distantly Supervised Named Entity Recognition using Positive-UnlabeledLearning[J]. 2019."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[4] Nooralahzadeh F , Lnning J T ,Vrelid L . Reinforcement-based denoising of distantly supervised NER withpartial annotation[C]\/\/ Proceedings of the 2nd Workshop on Deep LearningApproaches for Low-Resource NLP (DeepLo 2019). 2019."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[5] Tomas Mikolov, Ilya Sutskever,Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations ofwords and phrases and their compositionality. In NIPS."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[6] Jeffrey Pennington, RichardSocher, and Christopher D. Manning. 2014. Glove: Global vectors forwordrepresentation. In EMNLP."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[7] Shen Li, Zhe Zhao, Renfen Hu,Wensi Li, Tao Liu, Xiaoyong Du. 2018. Analogical Reasoning on ChineseMorphological and Semantic Relations. In ACL."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"本文作者:寧星星 蘇海波"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文轉載自:百分點認知智能實驗室(ID:baifendian_com)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s\/qKI_H4Yv2_Brx59WiexZlg","title":"xxx","type":null},"content":[{"type":"text","text":"百分點認知智能實驗室:基於不完全標註樣本集的信息抽取實踐"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章