nvidia-rapids︱cuGraph(NetworkX-like)關係圖模型

RAPIDS cuGraph庫是一組圖形分析,用於處理GPU數據幀中的數據 - 請參閱cuDF。 cuGraph旨在提供類似NetworkX的API,這對數據科學家來說很熟悉,因此他們現在可以更輕鬆地構建GPU加速的工作流程

官方文檔:
rapidsai/cugraph
cuGraph API Reference

支持的模型:

在這裏插入圖片描述

關聯文章:

nvidia-rapids︱cuDF與pandas一樣的DataFrame庫
NVIDIA的python-GPU算法生態 ︱ RAPIDS 0.10
nvidia-rapids︱cuML機器學習加速庫
nvidia-rapids︱cuGraph(NetworkX-like)關係圖模型



1 安裝與背景

1.1 安裝

Conda安裝,https://github.com/rapidsai/cugraph:

# CUDA 10.0
conda install -c nvidia -c rapidsai -c numba -c conda-forge -c defaults cugraph cudatoolkit=10.0

# CUDA 10.1
conda install -c nvidia -c rapidsai -c numba -c conda-forge -c defaults cugraph cudatoolkit=10.1

# CUDA 10.2
conda install -c nvidia -c rapidsai -c numba -c conda-forge -c defaults cugraph cudatoolkit=10.2

docker版本,可參考:https://rapids.ai/start.html#prerequisites
在這裏插入圖片描述

docker pull rapidsai/rapidsai:cuda10.1-runtime-ubuntu16.04-py3.7
docker run --gpus all --rm -it -p 8888:8888 -p 8787:8787 -p 8786:8786 \
    rapidsai/rapidsai:cuda10.1-runtime-ubuntu16.04-py3.7

1.2 背景

cuGraph已在將領先的圖形框架集成到一個簡單易用的接口方面邁出了新的一步。幾個月前,RAPIDS收到了來自佐治亞理工學院的Hornet副本,並將其重構和重命名爲cuHornet。這一名稱更改表明,源代碼已偏離Georgia Tech基準並體現了代碼API和數據結構與RAPIDS cuGraph的匹配。cuHornet的加入提供了基於邊界的編程模型、動態數據結構以及現有分析的列表。除了核心數函數之外,可用的前兩個cuHornet算法是Katz centrality 和K-Cores。

cuGraph是RAPIDS的圖形分析庫,針對cuGraph我們推出了一個由兩個新原語支持的多GPU PageRank算法:這是一個COO到CSR的多GPU數據轉換器,和一個計算頂點度的函數。這些原語會被用於將源和目標邊緣列從Dask Dataframe轉換爲圖形格式,並使PageRank能夠跨越多個GPU進行縮放。

下圖顯示了新的多GPU PageRank算法的性能。與之前的PageRank基準運行時刻不同,這些運行時刻只是測量PageRank解算器的性能。這組運行時刻包括Dask DataFrame到CSR的轉換、PageRank執行以及從CSR返回到DataFrame的結果轉換。平均結果顯示,新的多GPU PageRank分析比100節點Spark集羣快10倍以上。

在這裏插入圖片描述
圖1:cuGraph PageRank在不同數量的邊緣和NVIDIA Tesla V 100上計算所用的時間

下圖僅查看Bigdata數據集、5000萬個頂點和19.8億條邊,並運行HiBench端到端測試。HiBench基準運行時刻包括數據讀取、運行PageRank,然後得到所有頂點的得分。此前,HiBench分別在10、20、50和100個節點的Google GCP上進行了測試。

在這裏插入圖片描述
圖2:5千萬邊緣端到端PageRank運行時刻,cuGraph PageRank vs Spark Graph(越低越好)


2 簡單的demo

參考:https://github.com/rapidsai/cugraph

import cugraph

# assuming that data has been loaded into a cuDF (using read_csv) Dataframe
gdf = cudf.read_csv("graph_data.csv", names=["src", "dst"], dtype=["int32", "int32"] )

# create a Graph using the source (src) and destination (dst) vertex pairs the GDF  
G = cugraph.Graph()
G.add_edge_list(gdf, source='src', destination='dst')

# Call cugraph.pagerank to get the pagerank scores
gdf_page = cugraph.pagerank(G)

for i in range(len(gdf_page)):
	print("vertex " + str(gdf_page['vertex'][i]) + 
		" PageRank is " + str(gdf_page['pagerank'][i]))  

3 PageRank

cugraph.pagerank(G,alpha=0.85, max_iter=100, tol=1.0e-5)

  • G: cugraph.Graph object
  • alpha: float, The damping factor represents the probability to follow an outgoing edge. default is 0.85
  • max_iter: int, The maximum number of iterations before an answer is returned. This can be used to limit the execution time and do an early exit before the solver reaches the convergence tolerance. If this value is lower or equal to 0 cuGraph will use the default value, which is 100
  • tol: float, Set the tolerance the approximation, this parameter should be a small magnitude value. The lower the tolerance the better the approximation. If this value is 0.0f, cuGraph will use the default value which is 0.00001. Setting too small a tolerance can lead to non-convergence due to numerical roundoff. Usually values between 0.01 and 0.00001 are acceptable.

Returns:

  • df: a cudf.DataFrame object with two columns:
    • df[‘vertex’]: The vertex identifier for the vertex
    • df[‘pagerank’]: The pagerank score for the vertex

安裝:

# The notebook compares cuGraph to NetworkX,  
# therefore there some additional non-RAPIDS python libraries need to be installed. 
# Please run this cell if you need the additional libraries
!pip install networkx
!pip install scipy

代碼模塊:

# Import needed libraries
import cugraph
import cudf
from collections import OrderedDict


# NetworkX libraries
import networkx as nx
from scipy.io import mmread

# 相關參數

# define the parameters 
max_iter = 100  # The maximum number of iterations
tol = 0.00001   # tolerance
alpha = 0.85    # alpha
# Define the path to the test data  
datafile='../data/karate-data.csv'

# NetworkX
# Read the data, this also created a NetworkX Graph 
file = open(datafile, 'rb')
Gnx = nx.read_edgelist(file)

pr_nx = nx.pagerank(Gnx, alpha=alpha, max_iter=max_iter, tol=tol)

在這裏插入圖片描述

cuGraph模型:

# cuGraph

# Read the data  
gdf = cudf.read_csv(datafile, names=["src", "dst"], delimiter='\t', dtype=["int32", "int32"] )

# create a Graph using the source (src) and destination (dst) vertex pairs from the Dataframe 
G = cugraph.Graph()
G.from_cudf_edgelist(gdf, source='src', destination='dst')


# Call cugraph.pagerank to get the pagerank scores
gdf_page = cugraph.pagerank(G)


# Find the most important vertex using the scores
# This methods should only be used for small graph
bestScore = gdf_page['pagerank'][0]
bestVert = gdf_page['vertex'][0]

for i in range(len(gdf_page)):
    if gdf_page['pagerank'][i] > bestScore:
        bestScore = gdf_page['pagerank'][i]
        bestVert = gdf_page['vertex'][i]
        
print("Best vertex is " + str(bestVert) + " with score of " + str(bestScore))

# A better way to do that would be to find the max and then use that values in a query
pr_max = gdf_page['pagerank'].max()


def print_pagerank_threshold(_df, t=0) :
    filtered = _df.query('pagerank >= @t')
    
    for i in range(len(filtered)):
        print("Best vertex is " + str(filtered['vertex'][i]) + 
            " with score of " + str(filtered['pagerank'][i]))              


print_pagerank_threshold(gdf_page, pr_max)
sort_pr = gdf_page.sort_values('pagerank', ascending=False)
d = G.degrees()
d.sort_values('out_degree', ascending=False).head(4)

關聯結果:
在這裏插入圖片描述

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章