word2vec查詢詞向量時報錯：'utf-8' codec cann't decode bytes in position 96-07:unexpected end of data

原創

2020-06-07 14:13

加載word2vec模型時報錯：

    model_path = "model/Hanlp_cut_news.bin"
    w2v_dict = word2vec.load(model_path)
    print(w2v_dict["奧運"])

Traceback (most recent call last):
  File "/home/iiip/PycharmProjects/smp_yinglish/demo1/data_preprocess.py", line 10, in <module>
    w2v_dict = word2vec.load(model_path)
  File "/home/iiip/.local/lib/python3.5/site-packages/word2vec/io.py", line 18, in load
    return word2vec.WordVectors.from_binary(fname, *args, **kwargs)
  File "/home/iiip/.local/lib/python3.5/site-packages/word2vec/wordvectors.py", line 202, in from_binary
    vocab[i] = word.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 96-97: unexpected end of data

查看了一下自己分詞文件的編碼，utf-8的，沒問題：

$file hanlp_cut_news.txt
Hanlp_cut_news.txt: UTF-8 Unicode text, with very long lines, with no line terminators

再看了訓練出來的bin文件編碼，data表示二進制文件，也沒問題：

$ file Hanlp_cut_news.bin 
Hanlp_cut_news.bin: data

回到報錯信息，點開word2vec.py源碼202行，注意到：

                if include:
                    vocab[i] = word.decode(encoding)

修改一下源碼爲：

                if include:
                    try:
                            print (word)
                            print(word.encode(encoding)
                            vocab[i] = word.decode(encoding)
                        except:
                            vocab[i] = word

在運行出來的結果中，程序停在了一個特別長的二進制輸出，可以推測，應該某個分詞結果存在編碼混亂或者過長的錯誤。

把那個很長的二進制編碼copy出來測試一下：

line = '\xe9\x98\xbf\xe5\xb0\x94\xe6\xaf\x94\xe5\xb7\xb4\xe9\x87\x8c\xe5\xb8\x83\xe9\x9b\xb7\xe8\xa5\xbf\xe6\xa0\xbc\xe7\xbd\x97\xe7\x91\x9f\xe6\x9b\xbc\xe6\x89\x98\xe7\x93\xa6\xe6\xa2\x85\xe8\xa5\xbf\xe7\xba\xb3\xe6\x91\xa9\xe5\xbe\xb7\xe7\xba\xb3\xe7\x89\xb9\xe9\x87\x8c\xe5\x9f\x83\xe8\xb5\xab\xe5\xba\x93\xe6\x96\xaf\xe9\x98\xbf\xe6\x8b\x89\xe7\xbb\xb4\xe6\xb2\x99\xe6\x8b\x89\xe6\x9b\xbc\xe5\x8d'
print (line)

輸出是一堆亂碼……

解決原問題的方法就是把源碼改爲：

                if include:
                    # vocab[i] = word.decode(encoding)
                    try:
                        # print (word)
                        # print (word.decode(encoding))
                        vocab[i] = word.decode(encoding)
                    except:
                        # vocab[i] = word
                        vocab[i] = 'UNK'
                        print (word, 'UNK')

直接跳過這個出錯的詞語。額，當然了，其實最好應該是在分詞的時候做數據清洗（只不過我的分詞文件很大，重新跑一遍分詞程序不划算）。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

word2vec查詢詞向量時報錯：'utf-8' codec cann't decode bytes in position 96-07:unexpected end of data

PDManer [元數建模]-v4.9.0 發佈：一款簡單好用的數據庫建模平臺

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

cs01 CSS Syntax

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

[MASM拾遺]Offset僞指令

h30 HTML Layout Elements

瞭解顯卡

一款基於C#開發的通訊調試工具（支持Modbus RTU、MQTT調試）

Linux/Golang/glibC系統調用

cs04 CSS Measurement Units

MapReduce再學習：資源管理框架YARN

word2vec查詢詞向量時報錯：'utf-8' codec cann't decode bytes in position 96-07:unexpected end of data

Aprior算法和FP Group算法

最大回撤率和移動數組零元素到末尾

Spark Mllib　迴歸學習筆記二（java）：保序迴歸

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結