word2vec查詢詞向量時報錯:'utf-8' codec cann't decode bytes in position 96-07:unexpected end of data

加載word2vec模型時報錯:

    model_path = "model/Hanlp_cut_news.bin"
    w2v_dict = word2vec.load(model_path)
    print(w2v_dict["奧運"])
Traceback (most recent call last):
  File "/home/iiip/PycharmProjects/smp_yinglish/demo1/data_preprocess.py", line 10, in <module>
    w2v_dict = word2vec.load(model_path)
  File "/home/iiip/.local/lib/python3.5/site-packages/word2vec/io.py", line 18, in load
    return word2vec.WordVectors.from_binary(fname, *args, **kwargs)
  File "/home/iiip/.local/lib/python3.5/site-packages/word2vec/wordvectors.py", line 202, in from_binary
    vocab[i] = word.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 96-97: unexpected end of data

查看了一下自己分詞文件的編碼,utf-8的,沒問題:

$file hanlp_cut_news.txt
Hanlp_cut_news.txt: UTF-8 Unicode text, with very long lines, with no line terminators

再看了訓練出來的bin文件編碼,data表示二進制文件,也沒問題:

$ file Hanlp_cut_news.bin 
Hanlp_cut_news.bin: data

回到報錯信息,點開word2vec.py源碼202行,注意到:

                if include:
                    vocab[i] = word.decode(encoding)

修改一下源碼爲:

                if include:
                    try:
                            print (word)
                            print(word.encode(encoding)
                            vocab[i] = word.decode(encoding)
                        except:
                            vocab[i] = word

在運行出來的結果中,程序停在了一個特別長的二進制輸出,可以推測,應該某個分詞結果存在編碼混亂或者過長的錯誤。

把那個很長的二進制編碼copy出來測試一下:

line = '\xe9\x98\xbf\xe5\xb0\x94\xe6\xaf\x94\xe5\xb7\xb4\xe9\x87\x8c\xe5\xb8\x83\xe9\x9b\xb7\xe8\xa5\xbf\xe6\xa0\xbc\xe7\xbd\x97\xe7\x91\x9f\xe6\x9b\xbc\xe6\x89\x98\xe7\x93\xa6\xe6\xa2\x85\xe8\xa5\xbf\xe7\xba\xb3\xe6\x91\xa9\xe5\xbe\xb7\xe7\xba\xb3\xe7\x89\xb9\xe9\x87\x8c\xe5\x9f\x83\xe8\xb5\xab\xe5\xba\x93\xe6\x96\xaf\xe9\x98\xbf\xe6\x8b\x89\xe7\xbb\xb4\xe6\xb2\x99\xe6\x8b\x89\xe6\x9b\xbc\xe5\x8d'
print (line)

輸出是一堆亂碼……

解決原問題的方法就是把源碼改爲:

                if include:
                    # vocab[i] = word.decode(encoding)
                    try:
                        # print (word)
                        # print (word.decode(encoding))
                        vocab[i] = word.decode(encoding)
                    except:
                        # vocab[i] = word
                        vocab[i] = 'UNK'
                        print (word, 'UNK')

直接跳過這個出錯的詞語。額,當然了,其實最好應該是在分詞的時候做數據清洗(只不過我的分詞文件很大,重新跑一遍分詞程序不划算)。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章