用圖挖掘找到感興趣的人(1)

原創

lfhyon

2020-06-08 13:46

對Twitter進行數據收集：

首先創建相應的文件，用於儲存讀取的數據

例：

os.path.join(os.path.expanduser("~"), "Data", "twitter")

或者寫入數據

with open (file_name) as inf:

inf.write(json.dump(?))#寫入json格式的文件

保存訓練模型或使用訓練模型：

from sklearn.externals import joblib

joblib.dump(model_name,output_filename)#創建名稱爲model_name的模型於output_filename中

model=joblib.load(model_name)#引用相應模型

model.predict(train_x)

通過從tweet內容中選出與目標內容相關的

例：

[tweet[i] for i in len(tweets) if y_predict[i]==1]

#tweets中被預測分類爲1的全爲目標內容

獲取好友信息

直接構建獲取好友的函數

def get_friends(t, user_id):#由於Twitter使用遊標管理多頁數據，cursor=-1代表數據開始，cursor=0代表數據結束
    friends = []
    cursor = -1  # Start with the first page
    while cursor != 0:  # If zero, that is the end:
        try:
            results = t.friends.ids(user_id=user_id, cursor=cursor, count=5000)#TwitterAPI中找朋友編號的辦法
            friends.extend([friend for friend in results['ids']])
            cursor = results['next_cursor']#更新遊標，這裏將遊標想象成頁碼，遊標爲0代表到最後一頁
            if len(friends) >= 10000:
                break
            if cursor != 0:
                print("Collected {} friends so far, but there are more".format(len(friends)))
                sys.stdout.flush
        except TypeError as e:

#出現類型錯誤的異常，將等待5分鐘，然後再執行循環
            if results is None:
                print("You probably reached your API limit, waiting for 5 minutes")
                #連續輸出，而不是等到最後再一個一個輸出
                sys.stdout.flush()
                time.sleep(5*60) # 5 minute wait
            else:
                raise e
        except twitter.TwitterHTTPError as e:
            break
        finally:
            time.sleep(60)  # Wait 1 minute before continuing
    return friends

sys.stdout.flush()#循環過程中習慣性緩衝下

構建網絡進行說明

從最初得到的相關人士進行遍歷，新建friends={},鍵爲user_id,值爲好友id

friends = {user_id:friends[user_id] for user_id in friends
             if len(friends[user_id]) > 0}

由於相關用戶太少，所以從現有用戶好友中選取關係網最大，最密集的人

所以先統計好友數量

def count_friends(friends):
    friend_count = defaultdict(int)
    for friend_list in friends.values():
        for friend in friend_list:
            friend_count[friend] += 1
    return friend_count

通過計算關係最大的好友並對其進行排序sorted()

#遍歷直到朋友大於150人

while len(friends) < 150:
    #獲取不是在friends中的用戶，並將該用戶的好友進行統計
    for user_id, count in best_friends:
        if user_id not in friends and str(user_id) != '467407284':
            break
    print("Getting friends of user {}".format(user_id))
    sys.stdout.flush()
    friends[user_id] = get_friends(t, user_id)
    print("Received {} friends".format(len(friends[user_id])))
    print("We now have the friends of {} users".format(len(friends)))
    sys.stdout.flush()
    # Update friend_count
    for friend in friends[user_id]:
        friend_count[friend] += 1
    # Update the best friends list
    best_friends = sorted(friend_count.items(), key=itemgetter(1), reverse=True)

ps：python中的字典與json格式可以輕鬆轉換

創建關係網絡圖

pip install network

import networkx as nx
G = nx.DiGraph()

main_users = friends.keys()
G.add_nodes_from(main_users)#創建頂點

for user_id in friends:
    for friend in friends[user_id]:
        if friend in main_users:
           G.add_edge(user_id, friend) #創建邊
G

nx.draw(G)#畫圖

可以將圖設置爲長方形的圖

nx.draw(G, alpha=0.1, edge_color='b', node_color='g', node_size=2000)

創建用戶關係圖

創建用戶關係圖得用到傑克卡得相似係數

def compute_similarity(friends1, friends2):#傑克卡得相似係數，前提將friends中的好友列表變爲集合的形式
    return len(friends1 & friends2) / len(friends1 | friends2)#&爲交集；|爲並集

#畫用戶相似圖

def create_graph(followers, threshold=0):
    G = nx.Graph()
    for user1 in friends.keys():
        for user2 in friends.keys():
            if user1 == user2:
                continue
            weight = compute_similarity(friends[user1], friends[user2])
            if weight >= threshold:#如果權重超過閾值
                G.add_node(user1)
                G.add_node(user2)
                G.add_edge(user1, user2, weight=weight)
    return G

利用spring_layout將關係圖展示的好看些；

具體用法如下：

pos = nx.spring_layout(G)#使用spring_layout方法佈局
nx.draw_networkx_nodes(G, pos)#使用pos佈局方法確定頂點位置
edgewidth = [ d['weight'] for (u,v,d) in G.edges(data=True)]#遍歷圖中每條邊並獲得其權重
nx.draw_networkx_edges(G, pos, width=edgewidth)#繪製

尋找子圖：

類似於聚類-

sub_graphs=nx.connected_component_subgraphs(G)#尋找圖中的連通分支,sub_graphs爲生成器

nx.draw(list(sub_graphs)[index])#畫相應index中的連通分支圖

fig=plt.figure(figsize=(長，寬))

fig.add_subplot()#對畫的圖確定好位置

#silhouette_score爲計算總輪廓係數，它的參數爲關係圖中各頂點間的相似值Weight以及連通子圖的標籤

from sklearn.metrics import silhouette_score

def compute_silhouette(threshold, friends):
    G = create_graph(friends, threshold=threshold)
    if len(G.nodes()) < 2:
        return -99
    sub_graphs = nx.connected_components(G)
    if not (2 <= nx.number_connected_components(G) < len(G.nodes()) - 1):
        return -99
    label_dict = {}
    for i, sub_graph in enumerate(sub_graphs):
        print(type(sub_graph))
        print(sub_graph)
        for node in sub_graph:  #.nodes():
            label_dict[node] = i
    labels = np.array([label_dict[node] for node in G.nodes()])
    X = nx.to_scipy_sparse_matrix(G)#.todense()
    #X = 1 - X
    return silhouette_score(X, labels, metric='precomputed')

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

用圖挖掘找到感興趣的人(1)

python gdal 安裝使用（Windows， python 3.6.8）

python的MySQL操作增刪改查

學習筆記第七天

學習筆記第五天

學習筆記第三天

數據分析基礎（1）

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結