使用ChatGPT自動構建知識圖譜

1.概述

本文將探討利用OpenAI的gpt-3.5-turbo從原始文本構建知識圖譜,通過LLM和RAG技術實現文本生成、問答和特定領域知識的高效提取,以獲得有價值的洞察。在開始前,我們需要明確一些關鍵概念。

2.內容

2.1 什麼是知識圖譜?

知識圖譜是一種語義網絡,它表示和連接現實世界中的實體,如人物、組織、物體、事件和概念。知識圖譜由具有以下結構的三元組組成:知識圖譜由“頭實體 → 關係 → 尾實體”或語義網術語“主語 → 謂語 → 賓語”的三元組構成,用於提取和分析實體間的複雜關係。它通常包含一個定義概念、關係及其屬性的本體,作爲目標領域中概念和關係的正式規範,爲網絡提供語義。搜索引擎等自動化代理使用本體來理解網頁內容,以正確索引和顯示。

2.2 案例

2.2.1 準備依賴

使用 OpenAI 的 gpt-3.5-turbo 根據產品數據集中的產品描述創建知識圖。Python依賴如下:

pip install pandas openai sentence-transformers networkx

2.2.2 讀取數據

讀取數據集,代碼如下所示:

import json
import logging
import matplotlib.pyplot as plt
import networkx as nx
from networkx import connected_components
from openai import OpenAI
import pandas as pd
from sentence_transformers import SentenceTransformer, util
data = pd.read_csv("products.csv")

數據集包含"PRODUCT_ID"、"TITLE"、"BULLET_POINTS"、"DESCRIPTION"、"PRODUCT_TYPE_ID"和"PRODUCT_LENGTH"列。我們將合併"TITLE"、"BULLET_POINTS"和"DESCRIPTION"列成"text"列,用於提示ChatGPT從中提取實體和關係的商品規格。

實現代碼如下:

data['text'] = data['TITLE'] + data['BULLET_POINTS'] + data['DESCRIPTION']

2.2.3 特徵提取

我們將指導ChatGPT從提供的商品規格中提取實體和關係,並以JSON對象數組的形式返回結果。JSON對象必須包含以下鍵:'head'、'head_type'、'relation'、'tail'和'tail_type'。

'head'鍵必須包含從用戶提示提供的列表中提取的實體文本。'head_type'鍵必須包含從用戶提供的列表中提取的頭實體類型。'relation'鍵必須包含'head'和'tail'之間的關係類型,'tail'鍵必須表示提取的實體文本,該實體是三元組中的對象,而'tail_type'鍵必須包含尾實體的類型。

我們將使用下面列出的實體類型和關係類型來提示ChatGPT進行實體關係提取。我們將把這些實體和關係映射到Schema.org本體中對應的實體和關係。映射中的鍵表示提供給ChatGPT的實體和關係類型,值表示Schema.org中的對象和屬性的URL。

# ENTITY TYPES:
entity_types = {
  "product": "https://schema.org/Product", 
  "rating": "https://schema.org/AggregateRating",
  "price": "https://schema.org/Offer", 
  "characteristic": "https://schema.org/PropertyValue", 
  "material": "https://schema.org/Text",
  "manufacturer": "https://schema.org/Organization", 
  "brand": "https://schema.org/Brand", 
  "measurement": "https://schema.org/QuantitativeValue", 
  "organization": "https://schema.org/Organization",  
  "color": "https://schema.org/Text",
}

# RELATION TYPES:
relation_types = {
  "hasCharacteristic": "https://schema.org/additionalProperty",
  "hasColor": "https://schema.org/color", 
  "hasBrand": "https://schema.org/brand", 
  "isProducedBy": "https://schema.org/manufacturer", 
  "hasColor": "https://schema.org/color",
  "hasMeasurement": "https://schema.org/hasMeasurement", 
  "isSimilarTo": "https://schema.org/isSimilarTo", 
  "madeOfMaterial": "https://schema.org/material", 
  "hasPrice": "https://schema.org/offers", 
  "hasRating": "https://schema.org/aggregateRating", 
  "relatedTo": "https://schema.org/isRelatedTo"
 }

爲使用ChatGPT進行信息提取,我們創建了OpenAI客戶端,利用聊天完成API,爲每個識別到的關係生成JSON對象輸出數組。選擇gpt-3.5-turbo作爲默認模型,因其性能已足夠滿足此簡單演示需求。

client = OpenAI(api_key="<YOUR_API_KEY>")

定義提取函數:

def extract_information(text, model="gpt-3.5-turbo"):
   completion = client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[
        {
            "role": "system",
            "content": system_prompt
        },
        {
            "role": "user",
            "content": user_prompt.format(
              entity_types=entity_types,
              relation_types=relation_types,
              specification=text
            )
        }
        ]
    )

   return completion.choices[0].message.content

2.2.4 編寫Prompt

system_prompt變量包含了指導ChatGPT從原始文本中提取實體和關係,並將結果以JSON對象數組形式返回的指令,每個JSON對象包含以下鍵:'head'、'head_type'、'relation'、'tail'和'tail_type'。

system_prompt = """You are an expert agent specialized in analyzing product specifications in an online retail store.
Your task is to identify the entities and relations requested with the user prompt, from a given product specification.
You must generate the output in a JSON containing a list with JOSN objects having the following keys: "head", "head_type", "relation", "tail", and "tail_type".
The "head" key must contain the text of the extracted entity with one of the types from the provided list in the user prompt, the "head_type"
key must contain the type of the extracted head entity which must be one of the types from the provided user list,
the "relation" key must contain the type of relation between the "head" and the "tail", the "tail" key must represent the text of an
extracted entity which is the tail of the relation, and the "tail_type" key must contain the type of the tail entity. Attempt to extract as
many entities and relations as you can.
"""

user_prompt變量包含來自數據集單個規範所需的輸出示例,並提示ChatGPT以相同的方式從提供的規範中提取實體和關係。這是ChatGPT單次學習的一個示例。

user_prompt = """Based on the following example, extract entities and relations from the provided text.
Use the following entity types:

# ENTITY TYPES:
{entity_types}

Use the following relation types:
{relation_types}

--> Beginning of example

# Specification
"YUVORA 3D Brick Wall Stickers | PE Foam Fancy Wallpaper for Walls,
 Waterproof & Self Adhesive, White Color 3D Latest Unique Design Wallpaper for Home (70*70 CMT) -40 Tiles
 [Made of soft PE foam,Anti Children's Collision,take care of your family.Waterproof, moist-proof and sound insulated. Easy clean and maintenance with wet cloth,economic wall covering material.,Self adhesive peel and stick wallpaper,Easy paste And removement .Easy To cut DIY the shape according to your room area,The embossed 3d wall sticker offers stunning visual impact. the tiles are light, water proof, anti-collision, they can be installed in minutes over a clean and sleek surface without any mess or specialized tools, and never crack with time.,Peel and stick 3d wallpaper is also an economic wall covering material, they will remain on your walls for as long as you wish them to be. The tiles can also be easily installed directly over existing panels or smooth surface.,Usage range: Featured walls,Kitchen,bedroom,living room, dinning room,TV walls,sofa background,office wall decoration,etc. Don't use in shower and rugged wall surface]
Provide high quality foam 3D wall panels self adhesive peel and stick wallpaper, made of soft PE foam,children's collision, waterproof, moist-proof and sound insulated,easy cleaning and maintenance with wet cloth,economic wall covering material, the material of 3D foam wallpaper is SAFE, easy to paste and remove . Easy to cut DIY the shape according to your decor area. Offers best quality products. This wallpaper we are is a real wallpaper with factory done self adhesive backing. You would be glad that you it. Product features High-density foaming technology Total Three production processes Can be use of up to 10 years Surface Treatment: 3D Deep Embossing Damask Pattern."

################

# Output
[
  {{
    "head": "YUVORA 3D Brick Wall Stickers",
    "head_type": "product",
    "relation": "isProducedBy",
    "tail": "YUVORA",
    "tail_type": "manufacturer"
  }},
  {{
    "head": "YUVORA 3D Brick Wall Stickers",
    "head_type": "product",
    "relation": "hasCharacteristic",
    "tail": "Waterproof",
    "tail_type": "characteristic"
  }},
  {{
    "head": "YUVORA 3D Brick Wall Stickers",
    "head_type": "product",
    "relation": "hasCharacteristic",
    "tail": "Self Adhesive",
    "tail_type": "characteristic"
  }},
  {{
    "head": "YUVORA 3D Brick Wall Stickers",
    "head_type": "product",
    "relation": "hasColor",
    "tail": "White",
    "tail_type": "color"
  }},
  {{
    "head": "YUVORA 3D Brick Wall Stickers",
    "head_type": "product",
    "relation": "hasMeasurement",
    "tail": "70*70 CMT",
    "tail_type": "measurement"
  }},
  {{
    "head": "YUVORA 3D Brick Wall Stickers",
    "head_type": "product",
    "relation": "hasMeasurement",
    "tail": "40 tiles",
    "tail_type": "measurement"
  }},
  {{
    "head": "YUVORA 3D Brick Wall Stickers",
    "head_type": "product",
    "relation": "hasMeasurement",
    "tail": "40 tiles",
    "tail_type": "measurement"
  }}
]

--> End of example

For the following specification, generate extract entitites and relations as in the provided example.

# Specification
{specification}
################

# Output

"""
View Code

現在,我們對數據集中的每個規範調用extract_information函數,並創建一個包含所有提取的三元組的列表,這將代表我們的知識圖譜。爲了演示,我們將使用僅包含100個產品規範的子集來生成知識圖譜。

kg = []
for content in data['text'].values[:100]:
  try:
    extracted_relations = extract_information(content)
    extracted_relations = json.loads(extracted_relations)
    kg.extend(extracted_relations)
  except Exception as e:
    logging.error(e)

kg_relations = pd.DataFrame(kg)

信息提取的結果顯示在下面的圖中。

2.2.5 實體關係

實體解析(ER)是消除與現實世界概念對應的實體歧義的過程。在這種情況下,我們將嘗試對數據集中的頭實體和尾實體進行基本的實體解析。這樣做的原因是使文本中存在的實體具有更簡潔的表示。

我們將使用NLP技術進行實體解析,更具體地說,我們將使用sentence-transformers庫爲每個頭實體創建嵌入,並計算頭實體之間的餘弦相似性。

我們將使用'all-MiniLM-L6-v2'句子轉換器來創建嵌入,因爲它是一個快速且相對準確的模型,適用於這種情況。對於每對頭實體,我們將檢查相似性是否大於0.95,如果是,我們將認爲這些實體是相同的實體,並將它們的文本值標準化爲相等。對於尾實體也是同樣的道理。

這個過程將幫助我們實現以下結果。如果我們有兩個實體,一個的值爲'Microsoft',另一個爲'Microsoft Inc.',那麼這兩個實體將被合併爲一個。

我們以以下方式加載和使用嵌入模型來計算第一個和第二個頭實體之間的相似性。

heads = kg_relations['head'].values
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = embedding_model.encode(heads)
similarity = util.cos_sim(embeddings[0], embeddings[1])

爲了可視化實體解析後提取的知識圖譜,我們使用Python的networkx庫。首先,我們創建一個空圖,然後將每個提取的關係添加到圖中。

G = nx.Graph()
for _, row in kg_relations.iterrows():
  G.add_edge(row['head'], row['tail'], label=row['relation'])

要繪製圖表,我們可以使用以下代碼:

pos = nx.spring_layout(G, seed=47, k=0.9)
labels = nx.get_edge_attributes(G, 'label')
plt.figure(figsize=(15, 15))
nx.draw(G, pos, with_labels=True, font_size=10, node_size=700, node_color='lightblue', edge_color='gray', alpha=0.6)
nx.draw_networkx_edge_labels(G, pos, edge_labels=labels, font_size=8, label_pos=0.3, verticalalignment='baseline')
plt.title('Product Knowledge Graph')
plt.show()

下面的圖中顯示了生成的知識圖譜的一個子圖:

 

我們可以看到,通過這種方式,我們可以基於共享的特徵將多個不同的產品連接起來。這對於學習產品之間的共同屬性、標準化產品規格、使用通用模式(如Schema.org)描述網絡資源,甚至基於產品規格進行產品推薦都是有用的。

3.總結

大多數公司有大量未被利用的非結構化數據存儲在數據湖中。創建知識圖譜以從這些未使用的數據中提取洞察的方法將有助於從未經處理和非結構化的文本語料庫中獲取信息,並利用這些信息做出更明智的決策。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章