Python開發之利用TF特徵向量和Simhash指紋計算中文文本的相似度的示例

原創

2020-06-22 16:59

文章目錄

1、簡介

最近一直在研究NLP的文本相似度算法，本文將利用TF-IDF特徵向量和Simhash指紋計算中文文本的相似度。

2、計算過程

準備測試數據
預處理讀到的數據
加載數據到Map中
輸入用戶問題
利用TF特徵向量和Simhash指紋計算出預處理的配置文件中的分值

3、效果圖

4、核心代碼

    try:
        text = re_test.run(question) # 通過正則 查找匹配數據
        doc_token = jt.tokens(text) # 預處理，分詞
        doc_feat = fb.compute(doc_token)
        doc_fl = DocFeatLoader(smb, doc_feat) # 對象包含兩個參數 # fingerprint   指紋分值 # feat_vec  包含元組的列表

        # 預處理後的配置文件
        contentFlListMap = nodeMap
        p_score_list = []
        if nodeId in contentFlListMap.keys():
            nodeFlList = contentFlListMap[nodeId]
            print("nodeFilist",nodeFlList)
            for i in range(len(nodeFlList)):
                p_score_dict={}
                dist = cosine_distance_nonzero(nodeFlList[i]["lableDataFeatureVector"].feat_vec, doc_fl.feat_vec, norm=False)
                p_score_dict["score"] = dist
                p_score_dict["labelData"] = nodeFlList[i]["labelData"]
                p_score_dict["targetNodeId"] = nodeFlList[i]["targetNodeId"]
                p_score_dict["conditionId"] = nodeFlList[i]["conditionId"]
                p_score_list.append(p_score_dict)
            p_score_list = sorted(p_score_list, key=lambda score : score["score"], reverse=True)

            print("Sorted：",p_score_list)

            Complete_MayBeL4 = []
            Complete_MayBeL4Score = []
            Complete_MayBeL4ID = []
            Complete_MayBeL4Max = 3
            for i, el in enumerate(p_score_list):
                p_label = p_score_list[i]["labelData"]
                p_score = p_score_list[i]["score"]
                p_conditionId = p_score_list[i]["conditionId"]
                if len(Complete_MayBeL4) < Complete_MayBeL4Max:
                    Complete_MayBeL4.append(p_label)
                    Complete_MayBeL4Score.append(p_score)
                    Complete_MayBeL4ID.append(p_conditionId)
                else:
                    break

            print("************************************")
            print("用戶問題：", question)
            print("相似問(Max=%s)：%s"%(Complete_MayBeL4Max,Complete_MayBeL4))
            print("特徵值(Max=%s)：%s"%(Complete_MayBeL4Max,Complete_MayBeL4Score))
            print("ID：",Complete_MayBeL4ID)
            return "", "", "", "", "", ""
    except Exception as e:
        print("************************************")
        print("Error textSimilarity：", str(e))
        print("************************************")

5、此項目Github源碼分享

https://github.com/ShaShiDiZhuanLan/Demo_TFIDF_Simhash_Python

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python開發之利用TF特徵向量和Simhash指紋計算中文文本的相似度的示例

文章目錄

1、簡介

2、計算過程

3、效果圖

4、核心代碼

5、此項目Github源碼分享

vue項目獲取富文本編輯器wangEditor內容導出爲word（html轉word格式並下載）

dotnet C# 創建 X11 應用時設置窗口背景顏色

TDengine docker安裝方法

vue3組件通信與props

sapui5

Alpine Linux apk add DNS lookup error

部分JDK版本的發佈時間

工作中用到的腳本合集

合併代碼時Beyond Compare設置

go語言 defer延遲機制

Go開發之 GoLand還是VsCode

Go開發之基礎語法（常量、枚舉、註釋、類型別名、指針）

Go開發之 Go的 9個基本命令

Go開發之流程控制（if/else、for/range、switch、goto、break、continue）

人工智能之機器學習常用算法總結及各個常用分類算法精確率對比

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Python開發 之 利用TF特徵向量和Simhash指紋計算中文文本的相似度的示例

文章目錄

1、簡介

2、計算過程

3、效果圖

4、核心代碼

5、此項目Github源碼分享

Python開發之利用TF特徵向量和Simhash指紋計算中文文本的相似度的示例