CoreNLP Python接口處理中文

CoreNLP 項目是Stanford開發的一套開源的NLP系統。包括tokenize, pos , parse 等功能，與SpaCy類似。SpaCy號稱是目前最快的NLP系統，並且提供現成的python接口，但不足之處就是目前還不支持中文處理， CoreNLP則包含了中文模型，可以直接用於處理中文，但CoreNLP使用Java開發，python調用稍微麻煩一點。

安裝

安裝的方式比較簡單，下載CoreNLP最新的壓縮包，再下載對應的語言jar包。從CoreNLP下載頁面下載。將壓縮包解壓得到目錄，再將語言的jar包放到這個目錄下即可。

啓動NLPServer

由於corenlp使用Java開發，所以沒有python包可以直接使用，但是corenlp可以啓動Server端，接收http請求。所以使用python簡單的封裝，就可以與server端進行通信，像使用原生python包一樣使用。

對於中文的情況，啓動corenlp server的方式是，到corenlp的目錄下，執行如下代碼

java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -serverProperties StanfordCoreNLP-chinese.properties -port 9000 -timeout 15000

目前corenlp對jdk的要求是1.8以上。上面的-Xmx4g的含義是爲這個server端申請4G的內存。-serverProperties指定properties文件，這個文件在chinese-model的jar包裏面。

啓動Server之後，第一次執行的時候會比較慢，需要載入各種包。

基本HTTP 請求

wget --post-data 'The quick brown fox jumped over the lazy dog.' 'localhost:9000/?properties={"annotators":"tokenize,ssplit,pos","outputFormat":"json"}'

這是發一個POST的HTTP請求，使用Python的示例如下

import requests

url = 'http://192.168.200.169:9000'
properties = {'annotators': 'tokenize,ssplit,pos', 'outputFormat': 'json'}

# properties 要轉成字符串, requests包裝URL的時候貌似不支持嵌套的dict
params = {'properties' : str(properties)}

data = '天氣非常好'

resp = requests.post(url, data, params=params)

官方Python接口

CoreNLP官方也有提供封裝好的Python接口：https://github.com/stanfordnlp/python-stanford-corenlp

git clone https://github.com/stanfordnlp/python-stanford-corenlp然後在python-stanford-corenlp目錄底下，sudo python setup.py install就安裝成功了。

設置JAVANLP_HOME環境變量

這個Python接口並不是一個完整的CoreNLP Python包，它僅僅是對上文所說的啓動Server，Client端發送http請求的一個封裝。因此底層還是依賴於運行在JVM裏面的CoreNLP Server端。這個Server端可以在代碼執行的時候在本地啓動，因此程序需要知道Java CoreNLP的目錄，爲了不用每次都傳這個參數，代碼中是從系統獲取名爲JAVANLP_HOME的環境變量。

所以到~/.bashrc或~/.bash_profile文件中添加JAVANLP_HOME環境變量

JAVANLP_HOME="/path/to/corenlp"
export JAVANLP_HOME

修改代碼以處理中文

但是用於處理中文還需要改一些地方，可以fork到自己的github，修改一下，以後在其他地方要用直接clone自己修改過的項目就可以了。

需要改的是python-stanford-corenlp/corenlp/client.py文件CoreNLPClient的__init__方法中啓動server端的命令start_cmd，原來的代碼如下：

start_cmd = "{javanlp}/bin/javanlp.sh edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port {port} -timeout {timeout}".format(
                javanlp=os.getenv("JAVANLP_HOME"),
                port=port,
                timeout=timeout)

修改爲

start_cmd = 'java -Xmx{memory}g -cp "{javanlp}/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -serverProperties StanfordCoreNLP-chinese.properties -port {port} -timeout {timeout}'.format(
                memory=allocate_mem,
                javanlp=os.getenv("JAVANLP_HOME"),
                port=port,
                timeout=timeout)

原來的命令start_cmd被寫的比較死，並且可能由於我下的CoreNLP版本不對，目錄底下並沒有bin目錄與javanlp.sh腳本。因此直接改成
java -Xmx{memory}g -cp "{javanlp}/*" ，memory參數用於配置server端所需的內存。

增加-serverProperties參數爲了可以處理中文。修改後的__init__方法代碼如下：

    DEFAULT_ANNOTATORS = "tokenize ssplit pos ner depparse".split()
    DEFAULT_PROPERTIES = {}

    def __init__(self, start_server=True, endpoint="http://localhost:9000", 
        timeout=5000, annotators=DEFAULT_ANNOTATORS, properties=DEFAULT_PROPERTIES, allocate_mem=4):
        if start_server:
            host, port = urlparse(endpoint).netloc.split(":")
            assert host == "localhost", "If starting a server, endpoint must be localhost"

            assert os.getenv("JAVANLP_HOME") is not None, "Please define $JAVANLP_HOME where your CoreNLP Java checkout is"
            start_cmd = 'java -Xmx{memory}g -cp "{javanlp}/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -serverProperties StanfordCoreNLP-chinese.properties -port {port} -timeout {timeout}'.format(
                memory=allocate_mem,
                javanlp=os.getenv("JAVANLP_HOME"),
                port=port,
                timeout=timeout)
            stop_cmd = None
        else:
            start_cmd = stop_cmd = None

        super(CoreNLPClient, self).__init__(start_cmd, stop_cmd, endpoint)
        self.default_annotators = annotators
        self.default_properties = properties

還去除了DEFAULT_ANNOTATORS中的lemma，獲取詞原型的功能在處理中文的時候沒用。

示例

修改好代碼以後，重新執行一遍sudo python setup.py install即可。

應用的示例代碼如下

#-*- coding:utf-8 -*-

import corenlp


text = u'今天是一個大晴天'

with corenlp.CoreNLPClient(annotators='tokenize ssplit pos'.split()) as client:
    ann = client.annotate(text)
    sentence = ann.sentence[0]

    for token in sentence.token:
        print token.word, token.pos

執行以後結果

今天 NT
是 VC
一 CD
個 M
大 JJ
晴天 NN

CoreNLP Python接口處理中文

安裝

啓動NLPServer

基本HTTP 請求

官方Python接口

設置JAVANLP_HOME環境變量

修改代碼以處理中文

示例

plsa(Probabilistic Latent Semantic Analysis) 概率隱語義分析

從前向分步算法推導出AdaBoost

對GBDT的一點理解

Tensorflow實現卷積神經網絡，用於人臉關鍵點識別

Spark on YARN 筆記

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結