百分點認知智能實驗室:如何打造工業級的機器翻譯

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"機器翻譯是利用計算機將一種自然語言(源語言)轉換爲另一種自然語言(目標語言)的過程,不同於目前的主流機器翻譯,大多是基於神經機器翻譯,實現單純的機器翻譯,打造兼具穩定、易用、高效並符合用戶需求的工業級翻譯產品,要解決很多難題,比如:文檔內縮略語如何翻譯?小語種低資源翻譯問題如何解決?語料如何處理? "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在本篇文章中,百分點認知智能實驗室基於多年的經驗積累,分享了百分點科技在工業級機器翻譯領域的技術研究和實踐成果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着經濟全球化及互聯網的飛速發展,機器翻譯技術在促進政治、經濟、文化交流等方面起到越來越重要的作用。但各大領域的翻譯需求越來越多,翻譯要求也越來越高。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"1. 翻譯文檔越來越多"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"據統計,美海軍“溫森斯”(CG—49)導彈巡洋艦維護手冊達23.5噸,僅空軍F-16戰鬥機技術資料約750000頁;F-18戰鬥機的技術資料有500000多頁,重達1428.84kg。每天,美軍官方和著名的諮詢公司每天新發布的裝備科技信息相關材料就超過100萬頁。而這些文檔涉及的語種,包括最常用的英文、俄文、日文以及德文、法文、意大利文、韓文等,文檔格式包括掃描版\/電子版PDF、Word、Excel、PPT等,以及各種格式的圖片(包括但不限於png, jpg,bmp, tiff等),甚至手寫材料。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2. 材料內容越來越專"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"各大領域的翻譯任務包含大量的專有詞彙、縮略語,覆蓋航天、電子、船舶等各個業務,谷歌、百度等通用翻譯引擎無法滿足裝備科技信息領域內的個性化需求。同時,業務方對翻譯的效果質量要求越來越高,以更準確地瞭解最新的科技信息。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"3. 速度要求越來越高"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"海量資料的快速翻譯需求,對翻譯速度的要求越來越快,以更及時地獲取信息,支持科學決策。翻譯速度不僅和硬件、軟件相關,更和模型算法直接相關。在實際中,需通過模型、算法和工程層面的優化,實現翻譯速度能夠滿足技術參數要求。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"4. 數據安全和信息安全要求不斷提升"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不僅需要翻譯系統能夠在本地化部署、本地化運維,而且需要能在本地自動化加工語料,自動化模型訓練、迭代、升級。從而滿足整個系統的所有核心環節都能在本地完成,形成語料生產、語料加工、模型訓練、模型部署、模型運維的閉環,而不需要相關敏感的業務數據離開本地環境;同時,針對用戶自身的特定需求,可以更及時、自動地完成優化和升級,從而提高翻譯的效果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"百分點智能翻譯系統正是爲了應對以上“多、專、快、高”的緊迫需求而產生的。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"機器翻譯發展及Transformer介紹"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. 機器翻譯發展"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"機器翻譯技術在近幾十年的發展中經歷三個主要階段,依次是基於規則的機器翻譯、基於統計的機器翻譯和神經機器翻譯。基於規則的機器翻譯需要人工書寫翻譯規則,代價過高,並且伴隨翻譯失敗的可能;基於統計的機器翻譯完全由數據驅動機器學習,但用短語拼接翻譯的基本思想使長句翻譯品質不佳,並且帶有先驗假設。目前主流的機器翻譯方法爲神經機器翻譯,翻譯的知識和參數由神經網絡自動學習,避免了傳統方法的人工干預模塊帶來的偏差,而且直接把整個句子轉化爲向量進行翻譯,使得模型的特徵表示能力更強。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/95\/9578fad1366cd3271fccc64657773b7b.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖1.機器翻譯的發展"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"神經機器翻譯始於2013年提出的Encoder-Decoder框架,在發展的過程中,大部分模型由RNN結構組成,RNN的序列特性利於自然語言建模的同時也帶來無法高效並行化的弊端。2015年Attention概念的提出使得機器翻譯的品質大幅度提升,2017年穀歌在此基礎上提出的Transformer模型成爲當今神經機器翻譯模型的基石。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/3e\/3e88b0706719a028929e4ab797b89b40.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖2.神經機器翻譯的發展"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. Transformer結構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Transformer的本質是一個帶有自注意力機制的Encoder-Decoder結構,具體結構如圖所示。從整體上看,左半部分爲Encoder編碼器,右半部分爲Decoder解碼器。編碼器讀取源語言句子並編碼成固定長度的向量,然後解碼器將向量解碼並生成對應的目標語言翻譯。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/40\/403871f6daed71b646f33ecaa3db5636.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖3.Transformer整體結構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"編碼端和解碼端分別由6層結構相同的EncoderLayer和結構相同的Decoder Layer堆疊而成。Encoder和Decoder之間的連接方式爲:Inputs經過各層Encoder Layer作用後的輸出序列作爲Encoder的最終結果,分別輸入各層Decoder Layer。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/5b\/5bb780f27595abbcd0193eff9111b9be.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖4.Transformer編碼端解碼端整體結構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"具體每個EncoderLayer由2個sub-layers組成,依次爲編碼器多頭自注意力(圖左Encoder中的self-attention)、前饋網絡(Feed Forward);每個DecoderLayer由3個sub-layers組成,依次爲解碼器多頭自注意力(圖右Decoder中的self-attention)、編碼器解碼器多頭自注意力(Encoder-DecoderAttention)和前饋網絡(Feed Forward)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/b3\/b3971c967e96e343d636a7d7e8d0a450.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖5.單層EncoderLayer-Decoder Layer結構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面將詳細介紹各個子結構。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"2.1 多頭自注意力機制"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Transformer的核心在於多頭自注意力機制,分爲點積注意力計算和多頭注意力兩大步驟。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"(1)點積注意力"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"點積注意力函數有3個輸入:Q(請求(query))、K(主鍵(key))、V(數值(value))。出現在編碼器或解碼器中不同的注意力計算時,Q,K,V的表示也有所不同:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在編碼器自注意力中,Q=K=V,均是編碼端各個位置的表示,來自編碼器前一層的輸出,使得編碼器中的每個位置都可以關注編碼器上一層的所有位置;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在解碼器中的第一個sublayer自注意力中,Q=K=V,均是解碼端各個位置的表示,使得解碼器中的每個位置可以關注解碼器中直到幷包括該位置的所有位置;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在解碼器中的第二個sublayer編碼器-解碼器注意力中,Q來自解碼器的上一個sublayer,是解碼端各個位置的表示,K=V,來自編碼器的最終輸出,是編碼端各個位置的表示,使得解碼器中的每個位置能關注到輸入序列中的所有位置。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/4c\/4c5e10ff053b6753deb599fd283ffd0a.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖6.Transformer的三種自注意力"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"點積注意力具體計算公式如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/2b\/2b76fe63928e48301b62ab31d9f74670.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一步,對𝑄和𝐾的轉置進行點乘操作。此爲利用點積的方式計算相關性係數,表示一個序列上任意兩個位置的相關性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二步,通過係數"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/f3\/f323cde18378e550b844565071b17933.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"進行放縮操作,防止過大的輸入進入Softmax函數的飽和區造成梯度過小等問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三步,與掩碼矩陣𝑀𝑎𝑠𝑘相加,從而對句子序列中Padding的位置屏蔽,以及解碼器自注意力中需額外對目標語言序列未來位置的信息進行屏蔽。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第四步,使用Softmax函數對相關性矩陣在行的維度上進行歸一化操作,結果對應𝑉中的不同位置上向量的注意力權重。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第五步,和𝑉進行矩陣乘法,即對Value加權求和。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"(2)多頭注意力"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將𝑄、𝐾、𝑉與參數矩陣"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/a6\/a65c6d4966cbc153b43a2eb78dc72cdf.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對應做線性變換映射爲子集,分別進行點乘注意力操作得到,然後拼接這個頭並再次映射。多頭注意力機制的公式如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/2d\/2d10788b6e9b99f056e2b4c545bf5f8f.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/36\/36863f01a7aeb16036c04d15d9771485.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖7.點乘注意力"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"2.2 前饋神經網絡"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"該網絡獨立且相同的應用於每個編碼層及解碼層的最後一個子層,包含兩個線性變換,中間有一個ReLU激活函數。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"2.3 殘差正則化"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲防止梯度消失或者梯度爆炸並加快模型收斂,在每個子層均使用殘差鏈接和層歸一化操作:𝐿𝑎𝑦𝑒𝑟𝑁𝑜𝑟𝑚(𝑥 + 𝑆𝑢𝑏𝑙𝑎𝑦𝑒𝑟(𝑥))"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"2.4 位置編碼"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲捕捉句子序列的位置順序信息,將編碼端輸入的InputEmbedding、解碼端輸入的OutputEmbedding均與位置編碼的對應位置嵌入相加。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/5a\/5a730aea70d739bda4b933591f68e521.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中𝑝𝑜𝑠爲位置,𝑖爲維度。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以上是對Transformer結構的介紹。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"2.5 創新點總結"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Transformer的創新點在於提出的自注意力機制。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一,不採用RNN和CNN的結構,具有並行運算的能力,體現在編碼器的所有詞向量以矩陣的形式並行進行注意力計算,改進了此前RNN最被人詬病的訓練慢的缺點。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二,在計算複雜度方面,Self-Attention層將所有位置連接到恆定數量的順序操作,而循環層需要O(n) 順序操作。 對於每層複雜度,當序列長度n 小於表示維度d 時,自注意力層比循環層快。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"表1.不同圖層類型最大路徑長度、複雜度、最少順序操作數對比表"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/cf\/cfbd31a0426a354c4366f98663c7e6c9.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中,n爲序列的長度,d爲表示的維度,k爲卷積的核的大小,r爲受限self-attention中鄰域大小."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三,多頭自注意力機制使得Transformer可以學習到豐富的上下文信息。由於自注意力的計算直接將句子中任意兩個單詞的關係通過同一種操作(Query和Key的相關度)處理,將信息傳遞的距離拉近爲1,所以可以更好的捕獲遠距離依賴的特徵,如:同一個句子中單詞之間的句法特徵,包含指代關係的語義特徵等。同時,多頭機制將模型分爲多個頭,分別在不同的表示子空間學習,使得模型在各個子空間關注不同方面的信息,有的頭可以捕捉句法信息,有頭可以捕捉詞法信息,最後綜合得到更豐富全面的信息。 "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/25\/2552c40cb94c5c47c890b1e41db4f9c5.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖8.捕捉語法信息"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/0a\/0a7c24aeacb396a66ad6567af8f44d0d.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖9.捕捉語義信息"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另一方面,Transformer可以增加到非常深的深度,使得表層的詞法信息隨着模型的逐步加深組合爲更加抽象的語義信息。Transformer充分發掘DNN模型的特性,爲模型準確率帶來提升,這也是其性能優越的原因之一。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"百分點科技智能翻譯實踐"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. 產品邏輯架構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/5c\/5c3b97bfb995dbd6e53a54606de0afc4.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖10.產品邏輯架構圖"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面詳細闡述各個邏輯層及其子層。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"1.1 語料倉庫"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"該層包括語料收集、語料清洗、質量評測、語料入庫四個子層次。其中:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"語料收集:"},{"type":"text","text":"機器翻譯模型的效果同訓練語料數量成正相關。爲了充分發掘自有數據的價值,並靈活應對未來的個性化挑戰,我們必須持續收集各類語料庫。百分點科技在國內外多語言輿情分析、文本分析、機器翻譯的項目中,積累了大量的多語言語料,爲機器翻譯的效果奠定了堅實的數據基礎。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"語料清洗:"},{"type":"text","text":"語料清洗是舉足輕重的關鍵步驟,它決定着一個好的模型訓練難易程度,也是決定特定領域模型效果好壞的又一重大因素。語料質量越高模型翻譯效果越好。對收集來的語料要經過諸如長度失衡處理、雜質識別去除、語種識別、標點符號對齊等步驟處理。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"質量評測:"},{"type":"text","text":"爲使模型效果更專業、更符合特定領域場景。我們需要質量評測來選取高質量語料作爲模型訓練數據。對於清洗好的語料要進行質量評測,便於優化調整語料清洗步驟,通常這些評測手段包括:詞法分析、句法分析、SMT校驗以及人工校驗等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"語料入庫:"},{"type":"text","text":"爲適應特定領域語言規律的發現、規則的制訂與挖掘、語言知識的發現等深層次研究,需要質量評估合格的語料錄入到數據庫中,便於後續對語料進行智能檢索、版本管理、多維分類、質量評級等多種操作。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"1.2 模型工程"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"模型工程是翻譯系統的核心處理功能。包括主流語言翻譯模型的構建、訓練及針對特定問題的優化。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"主流語言翻譯:"},{"type":"text","text":"爲滿足各大領域對非結構化文檔數據的高質量翻譯要求,我們構建先進的深度神經網絡Transformer結構作爲翻譯模型,並通過回譯等方式提升翻譯效果。模型效果的提升,也是翻譯產品專業化的保證。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"小語種翻譯:"},{"type":"text","text":"在各類翻譯場景中,也存在對小語種的需求,對此我們的解決方式是:無監督學習方法、跨語言學習翻譯等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"特定問題優化:"},{"type":"text","text":"爲適應特定領域場景,我們需要針對性優化翻譯模型效果。對這些特定問題歸類,解決方式如:實體校正、術語干預、數詞量詞校正、漏譯補全等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"速度優化:"},{"type":"text","text":"爲更廣範圍地獲取最新態勢,及時響應特定領域場景翻譯需求,我們需要對模型翻譯進行速度優化。優化包括如:減少浮點數精度,模型壓縮等。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"1.3 服務架構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在應用服務部署的方式上,我們採用Nginx+ Tornado + RabbitMQ,簡單快速部署模型。在對外訪問接口的方式上,我們採用RESTAPI提供高效、標準的服務調用方式。接口按照協議類型來看,可以包括但不限於HTTP。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"1.4 功能應用"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"功能應用即客戶終端,這裏將客戶終端劃分爲翻譯終端和管理終端。翻譯終端爲用戶(遊客、註冊用戶)提供文本及文檔翻譯服務;管理終端爲註冊用戶提供詞庫管理、句庫管理、任務管理、工具箱、權限管理等相應服務。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. 語料蒐集及處理"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"2.1 語料蒐集及產生來源"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"訓練語料是模型的基礎,此外翻譯模型效果還依賴於語料的質量和分佈,因此我們在語料收集階段在保證語料規模的同時平衡經濟、政治、科技、生活、文化等各大領域的比例,使訓練語料儘可能覆蓋實際使用中的語言場景。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"語料收集渠道包括:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在業務中積累的雙語數據;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"公開供研究使用的數據集;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"網絡爬取,新聞、字幕、例句等;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"語料商城購買;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雙語書籍的計算機輔助和人工對齊等。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了獲取全世界互聯網上開放的語料庫資源,開發團隊設計一種從電子文檔中的單邊語料構建領域平行語料的模型與工具,可較爲高效地構建高質量的行業領域平行語料支撐模型訓練。百分點認知智能實驗室團隊提出通過給譯文分類的方式學習語義相似性:給定一對雙語文本輸入,設計一個可以返回表示各種自然語言關係(包括相似性和相關性)的編碼模型。利用這種方式,模型訓練時間大大減少,同時還能保證雙語語義相似度分類的性能。由此,實現快速的雙語文本自動對齊,構建十億級平行語料。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"2.2 語料對齊和管理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在語料庫建設過程中,需要充分利用自然語言處理以及相關技術開發語料庫自動加工工具,提高語料庫對齊建設效率,提升平行語料質量,提高語料庫規模。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"百分點智能翻譯系統,可以對語料進行全流程科學管理,從而支撐模型的本地化、個性化訓練和升級,及時提高翻譯效果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"語料庫自動加工工具系統涵蓋從語料的OCR、轉換、清洗、對齊、校對、標籤、管理、檢索、分析、訓練等多個子系統。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"2.3 語料處理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"神經機器翻譯需要大量的訓練語料,這些語料來源範圍廣,格式種類多,所以數據處理的第一步是將不同來源不同格式的數據統一處理,合併多源數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"與統計機器翻譯一樣,神經機器翻譯也需要對輸入和輸出的句子進行分詞,目的是得到翻譯的最基本單元。但是,這裏所說的單詞並不是語言學上的單詞,更多的是指面向機器翻譯任務的最小翻譯片段。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"自然語言的表達非常豐富,因此需要很多的單詞才能表達不同的語義。但是,神經機器翻譯系統對大詞表的處理效率很低,比如,輸出層在大規模詞表上進行預測會有明顯的速度下降,甚至無法進行計算。因此,在神經機器翻譯中會使用受限的詞表,比如包含30000-50000個單詞的詞表。另一方面,翻譯新的句子時,受限詞表會帶來大量的未登錄詞(Outof Vocabulary Word,OOV Word),系統無法對其進行翻譯。產生未登錄詞一方面的原因是詞表大小受限,另一方面的原因在於分詞的顆粒度過大。對於後者,一種解決方法是進一步對“單詞”進行切分,以得到更小的單元,這樣可以大大緩解單詞顆粒度過大造成的數據稀疏問題。這個過程通常被稱作子詞切分(Sub-wordSegmentation)。以BPE爲代表的子詞切分方法已經成爲了當今神經機器翻譯所使用的標準方法,翻譯效果顯著超越基於傳統分詞的系統。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外,機器翻譯依賴高質量的訓練數據。在神經機器翻譯時代,模型對訓練數據很敏感。由於神經機器翻譯的模型較爲複雜,因此數據中的噪聲會對翻譯系統產生較大的影響。特別是在實際應用中,數據的來源繁雜,質量參差不齊。因此,往往需要對原始的訓練集進行標準化(Normalization)和數據清洗(DadaCleaning),從而獲得高質量的雙語數據用於模型訓練。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以上這些內容統稱爲數據處理。下圖展示了百分點智能翻譯系統數據處理流程,主要步驟包括分詞、標準化、數據過濾和子詞切分。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/7c\/7cd355efc4c2ec5965c5302ecb2a78a5.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖11.機器翻譯數據處理流程"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3. 模型訓練"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Transformer的訓練流程:首先對模型進行初始化,然後在編碼器輸入包含結束符的源語言單詞序列。解碼端每個位置單詞的預測都要依賴已經生成的序列。在解碼端輸入包含起始符號的目標語序列,通過起始符號預測目標語的第一個單詞,用真實的目標語的第一個單詞去預測第二個單詞,以此類推,然後用真實的目標語序列和預測的結果比較,計算它的損失。Transformer使用了交叉熵損失(CrossEntropy Loss)函數,損失越小說明模型的預測越接近真實輸出。然後利用反向傳播來調整模型中的參數。由於Transformer將任意時刻輸入信息之間的距離拉近爲1,摒棄了RNN中每一個時刻的計算都要基於前一時刻的計算這種具有時序性的訓練方式,因此Transformer中訓練的不同位置可以並行化訓練,大大提高了訓練效率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"需要注意的是,Transformer包含很多工程方面的技巧。首先,在訓練優化器方面,需要注意以下幾點:Transformer使用Adam優化器優化參數;Transformer在學習率中同樣應用了學習率預熱(Warm_up)策略。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外,Transformer爲了提高模型訓練的效率和性能,還進行了以下幾方面的操作:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"小批量訓練(Mini-batchTraining):每次使用一定數量的樣本進行訓練,即每次從樣本中選擇一小部分數據進行訓練。這種方法的收斂較快,同時易於提高設備的利用率。每一個批次中的句子並不是隨機選擇的,模型通常會根據句子長度進行排序,選取長度相近的句子組成一個批次。這樣做可以減少padding數量,提高訓練效率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Dropout:由於Transformer模型網絡結構較爲複雜,會導致過度擬合訓練數據,從而對未見數據的預測結果變差。這種現象也被稱作過擬合(OverFitting)。爲了避免這種現象,Transformer加入了Dropout操作。Transformer中這四個地方用到了Dropout:詞嵌入和位置編碼、殘差連接、注意力操作和前饋神經網絡。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"標籤平滑(LabelSmoothing):在計算損失的過程中,需要用預測概率去擬合真實概率。在分類任務中,往往使用One-hot向量代表真實概率,即真實答案位置那一維對應的概率爲1,其餘維爲0,而擬合這種概率分佈會造成兩個問題:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"無法保證模型的泛化能力,容易造成過擬合;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"概率值0和1鼓勵所屬類別和其他類別之間的差距儘可能加大,會造成模型過於相信預測的類別。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此Transformer裏引入標籤平滑來緩解這種現象,簡單的說就是給正確答案以外的類別分配一定的概率,而不是採用非0即1的概率。這樣,可以學習一個比較平滑的概率分佈,從而提升泛化能力。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4. 翻譯效果"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"4.1 低資源翻譯優化"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"機器翻譯依賴於大量高質量的平行語料,然而對於小語種,存在數據量小,平行語料難以蒐集問題。針對數據稀疏問題,百分點科技使用了回譯來進行語料擴充,進而提高翻譯效果。以日中模型爲例,通過回譯方法,將原有的3308萬平行語料擴充到6700萬語料左右,然後再訓練。通過此種方式,中日方向bleu較通過英文作爲中間語言方式提升了10.4,日中方向bleu提升了12.5,對比結果如下表。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"表2.兩個方向BLEU和公司A對比表"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/9e\/9e5abde61ecec9ee015c929936dd3b80.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"4.2 術語翻譯優化"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"翻譯過程中,越來越多的筆譯工作者選擇調用和參考機器翻譯結果,並在機翻譯文的基礎上進行編輯修改。這種新型翻譯模式就是MTPE(機器翻譯+譯後編輯),能夠有效提升翻譯效率。不過,常有譯員被機翻譯文裏不準確的術語翻譯“拖了後腿”。每當發現機翻譯文與給定術語、常用譯法或專有名詞不一致時,譯員都要花費大量時間手動查找替換,十分麻煩。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"術語干預功能可以提高公司名稱、品牌名稱、行業縮寫等術語機翻結果的準確度,減輕譯者手動填充術語的負擔。機器翻譯+術語干預的翻譯新模式有效確保了譯文表達的一致性,大大提升了譯員和審校的工作效率和翻譯質量。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"百分點智能翻譯系統對文檔內縮略語動態提取,然後以縮略語+全稱形式翻譯出來,效果如下圖:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ab\/abb1c0059696e458054091a4cf09c9f5.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖12.百分點智能翻譯系統縮略語翻譯示例圖"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"4.3 百分點翻譯效果"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"表3.百分點智能翻譯系統評測BLEU得分表"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/03\/036906898399a21d84678c25c11ec594.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"5. 翻譯特色"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"百分點智能翻譯系統經過迭代打磨,積累了以下6大特色:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持涵蓋中文、英文、俄文、法文、西班牙文、阿拉伯文、德文、日文、韓文等多語種互譯。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"具有文檔翻譯、文本翻譯、文檔轉換、圖表提取等四大功能。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"混合語言翻譯。支持混合語種文檔的自動識別和翻譯,即上傳混合語種文檔,翻譯爲指定語言的譯文。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"術語干預翻譯。系統支持詞庫、句庫、縮略語庫干預神經機器翻譯結果。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"縮略語自動識別。支持對文檔中縮略語的自動識別、提取匹配和智能翻譯,即文檔中某一處出現了縮略語的簡寫以及對應的全文,在其他僅出現縮略語的地方也能給出縮略語對應全文的譯文。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持本地化和saas部署。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"三、結束語"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"機器翻譯算法發展非常快,隨着全球信息交流的加快,要求翻譯形態更趨於多元化,人們對於翻譯效果要求越來越高。百分點科技將在機器翻譯效果優化上持續發力,嘗試融合語音、圖像的多模態翻譯、元學習、遷移學習等方法,追蹤前沿技術,踐行用認知智能技術服務社會發展的使命。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"參考資料"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[1]Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neuralnetworks[C]\/\/Advances in neural information processing systems. 2014."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[2]Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translationby jointly learning to align and translate. CoRR, abs\/1409.0473, 2014."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[3]Cho K, Van Merriënboer B, Gulcehre C, et al. Learning phrase representationsusing RNN encoder-decoder for statistical machine translation[J]. arXiv, 2014."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[4]Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, WolfgangMacherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’sneural machine translation system: Bridging the gap between human and machinetranslation. arXiv preprint arXiv:1609.08144, 2016."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[5]Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin.Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122v2,2017."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[6]Ashish Vaswani, Noam Shazeer, Niki Parmar,Jakob Uszkoreit, Llion Jones, Aidan N Gomez, L ukasz Kaiser, and IlliaPolosukhin. Attention is all you need. In Advances in neural informationprocessing systems, pages 5998–6008, 2017."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[7]肖桐, 朱靖波. 機器翻譯統計建模與深度學習方法."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[8]Vaswani A , Shazeer N ,  Parmar N , et al.Attention Is All You Need[J]. arXiv, 2017."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文轉載自:百分點認知智能實驗室(ID:baifendian_com)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s\/De1AySVQd8z5pcL07sJrEQ","title":"xxx","type":null},"content":[{"type":"text","text":"百分點認知智能實驗室:如何打造工業級的機器翻譯"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章