ABSTRACT

對大RDF圖進行復雜查詢的需求，要求查詢的scalable。分區間查詢費事，本文提出新的數據劃分方法，利用了RDf數據集中豐富的結構信息，減少了分區間連接，效果很好。

INTRODUCTION

RDF增長——超出單機運算能力。
RDF表形式——圖形式，舉例圖1（a）
SPARQL——建模爲圖，舉例圖1（b）
在scale-out RDF 數據處理系統中，RDF在被分區到不同的計算節點。需要好的分區策略，減少分區間連接。
常用的劃分方法是散列分區，可以對拆分成星型的子查詢快速的並行查詢，但是中間結果的連接代價昂貴。
“Scalable SPARQL querying of large RDF graphs”提出劃分圖的方式，並複製n跳的內容，可以將查詢分解爲長度爲2*n的子查詢。
如果複製部分有度比較高的節點，則會產生較大的數據偏移。“Scaling queries over big rdf graphs with semantic hash partitioning”用哈希的方法替換圖形分割步驟來解決數據偏度問題，但是由於頂點擴展，數據複製問題仍然存在。
本文同時考慮RDF圖和查詢圖的結構，引入了端到端的路徑的概念，有較長路徑的複雜查詢不必分解爲大量子查詢。提出頂點合併技術，可以減少冗餘和內部分區子查詢的數量。
本文貢獻：(好麻煩。。。

We propose a new RDF data partitioning framework, which adopts the end-to-end path as the basic partition element and employs vertex merging to combine paths into partitions. We formally formulate the balanced path partitioning problem under this new framework and proof the problem is NP-Hard and APX-Hard.

In view of the hardness of the problem, we introduce a new version of the problem with a relaxed balance constraint. Then we propose an approximate algorithm that provides a performance guarantee.

To enhance the efficiency, we also present two bottom-up path merging algorithms to partition the paths. The resulting data partitioning can localize many queries with complex structures, such as star, chain, cycle and tree, while maintaining low data duplication and data skewness.

We propose a partition-aware query decomposition method to decompose a complex SPARQL query to minimize the possible cross-node communication. Our data partitioning method allows a complex query to be decomposed into fewer number of subqueries and hence be evaluated more efficiently.

To perform a fair comparison with the state-of-the-art approaches [15], [19], we implement an experimental system by adopting a similar architecture as proposed in [15], [19], where each single node RDF store is powered by RDF-3X [22] and cross-node communication is implemented on a Hadoop platform. We conduct an extensive experimental study on LUBM [11] and Sp2 Bench [25] benchmarks as well as a large real-world RDF dataset UniProt [6]. The results show that our method outperforms the previous approaches by up to two orders of magnitude, especially for complex queries.

基於Hadoop的RDF數據系統將RDF以HDFS文件形式存儲，並使用vanilla Hadoop的文件劃分和放置策略來分佈式這些文件。應設計數據分區算法和數據本地化策略，降低I/O成本和通信開銷。
散列分割是最流行的分割算法，對於星型查詢適合，但是對於鏈式或其他效率低下，基於圖劃分的存在數據分佈偏差和大量數據複製的問題。
分區另一個方向是動態運行時數據分區，本文的算法可以用作相關算法的初始分區方法，可提高性能。
也可用於Trinity.RDF和TriAD中，降低通信成本，提高性能。

PERLIMINARIES

A. RDF Graph and Sparql Query

Definition 1(RDF Graph) G=(V,E,LE) 沒有入度的是源點，沒有出度的是匯點。

B. End-To-End Path

Definition 2 (End-to-End Path)
Let G be an RDF graph. A path v0e1v1e2v2...emvm is called an end to end path if it satisfies all the following conditions: (i) v0 is a source vertex or one of the vertices in a directed cycle that does not contain any vertex with incoming edges from vertices outside of the cycle, (ii) vm is a sink vertex or there exists a vertex vi , in this path and vm=vi . We call v0 as the start vertex and vm as the end vertex.

端到端路徑簡稱爲路徑。

Theorem 1
Given an RDF graph G, if it is decomposed into a set of paths according to Definition 2, then every vertex v and every edge (u,w) in G exist in at least one path.

PROBLEM FORMULATION

A. Path Partitioning Model

Definition 3 (k-way Path Partitioning Plan)
Given an RDf graph G=(V,E), a k-way path partitioning plan P over G is to divide all the end-to-end paths of G into k nonempty and disjoint partitions {P1,...,Pk} , where Pi contains an exclusive subset of end-to-end paths
Theorem 2
Given a path partitioning plan P, all queries that only contain S-O joins (i.e. chain and directed cycle queries) are inner-partition queries.
Definition 4 (Merged Vertex)
A vertex v is called merged if all paths that contain v are in the same partition.

B. Query Decomposition Model

把查詢分解爲鏈式或只包含S-O連接的子查詢。
如果兩個子查詢的連接有共享點，則合成一個子查詢。

Theorem 3
Given two inner-partition subqueries SQi and SQj that share a set of vertices Vi,j , the join between SQi and SQj can be evaluated locally, if there exists one vertex v∈Vi,j such that all matching vertices of v in the RDF graph are merged.
兩個子查詢有一個合併點集，如果能在RDF圖中找到一個合併的點，則這兩個子查詢可以合併。

C. Metrics for Path Partitioning

Balance

|L(Pi)| 分區中的路徑數量。

Data Duplication

一些路徑可能共享公共邊和頂點，如果被分配給不同的分區，則三元組會重複。可定義爲：

D u p (P) = 1 | E | \sum e \in E (| P (e) | - 1) (1)

其中

|P(e)| 代表路徑劃分方案P中e的拷貝數量。

Query-Efficiency

V+表示合併的頂點集合，合併的頂點越多，我們可以組合的子查詢越多，查詢效率取決於合併頂點的數量。

Q E (P) = | V + | (2)

D. Problem Statement

Theorem 4
Dup(P) 滿足以下性質：

$D u p (P) \leq ( | V | - | V + | ) 2 | E | (k - 1)$

定理四表明合併頂點的數量越大，重複率越小。因此，The (k, 1)-balanced path partitioning problem, denoted by (k, 1)-BPP problem.

Definition 5 ((k, 1)-BPP Problem)
Given an RDF graph G, find a k-way path partitioning plan P with the following objective functions:

$M a x i m i z e | V + | s . t . | L (P i) | \leq ⌈ n k ⌉, 1 \leq i \leq k$
Theorem 5
The (k,1)-BPP problem is NP-hard, APX-Hard.

APPROXIMATE ALGORITHM

(k,1)-BPP問題難以估計。因此將問題轉化爲(k, 2)-BPP，即找一個k-way path partitioning plan在每個分區包含至多⌈2nk⌉ 個路徑的條件下，最大化|V+| 。
在路徑長度爲固定值l的前提下，對(k, 2)-BPP提出了一個近似算法。在所有分區路徑數量不超過⌈2nk⌉ ，合併點的數量至少爲(1−e−1kl)|V∗+| 的條件下，進行k-way path partitioning plan.
算法分爲兩個階段，第一個階段儘可能的產生合併節點，同時保證所有路徑組的路徑數少於或等於⌈nk⌉ ，然後把這些路徑組均勻的分爲k組，使得沒有分區包含多於⌈2nk⌉ 的路徑。

首先對符號進行介紹，lep 代表路徑ep 的長度。E(v) 是經過點v的所有路徑的集合，合併一個節點的利潤爲1，合併頂點的權重表示爲w(v)=∑ep∈E(v)1lep ，權重越大意味着合併該節點會得到一個較大的路徑組。所以合併節點v的利潤權重比爲1w(v) 。
算法開始時，需要生成RDF圖中的所有路徑。這關鍵在於起始點集的生成。首先找到所有的源點，另外在循環中找到ID最小的點加入到起始點集。得到起始的點集後用深度優先搜索得到所有的路徑。
初始時沒有頂點被合併，所以每個路徑組包含一個路徑（第3行）。然後從點集V中每兩個頂點，如果合併這兩個頂點後最大數量不超過限制，我們用貪心的啓發式規則合併節點。

Theorem 6:
算法1得到了一個近似因子(1−e−1kl) ，即|V′+|≥(1−e−1kl)|V∗+| 其中V′+ 表示算法1產生的合併頂點，其中V∗+ 表示(k, 1)-BPP產生的合併節點

BOTTOM-UP PATH MERGING ALGORITHM

A. Merging Start vertices

S-S連接和S-O連接最爲常見，即星型，鏈式，循環，樹形查詢。
自底而上的路徑合併算法主要有兩步：(1)合併所有的start vertices，即所有的start vertices都在V+ 中(2)設計了一種新的衡量剩餘節點的方式，並按照這個排序進行合併。
第一步有兩個好處，
1. 可以讓所有的星型，鏈式，循環，樹形查詢作爲分區內查詢。
2. 可以減少空間複雜度。

如果所有的start vertices都被合併了，那麼所有的星型，鏈式，循環，樹形查詢都可作爲分區內查詢（對應好處1）。

B. Vertex Weighting

在合併完所有的start vertices後，我們需要利用profit-weight比值對剩餘節點排序。這裏的profit仍是1，但weight值與上一個算法不同，合併共享較多路徑的頂點可能得到較大的路徑組，因此w′(v)=Np(v) ，Np(v) 是包含v的路徑數，它的值在不產生全部路徑的情況下可由下面的方法進行估算。

　Theorem 8:
Given an RDF graph G=(V,E), Np(v)=Ip(v)×Op(v)
其中Ip(v) 表示從開始節點到v的路徑數量（不包含環），Op(v) 表示從v到終止節點的路徑數量。

由上述定理，我們可以得出下式：

I p (v) = \sum u \in I (v) I p (u) a n d O p (v) = \sum u \in O (v) O p (u) (4)

其中

I(v) 表示所有v的in-neighbors，

O(u) 表示u的所有out-neighbors。
接下來討論

Ip(v) 的計算方法，

Op(v) 類似。

I p (v) k = (1 - α) + α \cdot \sum u \in I ( v ) I p ( u ) k - 1 \sum u \in I ( v ) ( I p ( u ) k - 1 ) 2 - - - - - - - - - - - - - - - \sqrt (5)

其中

Ip(v)0=1 ,

α 代表衰減係數，

∑u∈I(v)(Ip(u)k−1)2−−−−−−−−−−−−−−−√ 用作規範化。
讓A表示RDF圖G的鄰接矩陣，

Ip 表示所有頂點的預估值，則

I p = (1 - α) \cdot 1 + α \cdot A \cdot I p | | A \cdot I p | | 2 (6)

Theorem 9: 公式6是收斂的

C. Class-based Vertex Weighting

本節主要針對含有rdf:type標籤的RDF圖，計算每個頂點的權重。
可以根據rdf:type提供的類信息，做出更有效率的查詢分解。
對兩個子查詢進行合併時，需要其某個共享點v是merged vertex。如果點v是固定值，則在未分區的RDF圖中只對應一個點，較容易判斷是否被merged；如果點v是變量，如果不執行查詢的話，我們不知道有多少個匹配頂點，以及他們是否被merged。但如果我們知道RDF中v所屬的類的頂點都已經被merged，那麼我們可以將兩個子查詢合併。

Definition 6 (Class-based Vertex Weighting): Given an RDF graph G=(V,E), let C={T(v)|v∈V} be a set of classes of the vertices, where T(v) represent the class of v. We use the average weight score of all the vertices in class Ci(Ci∈C) as the weight of Ci , denoted as wclass(Ci)

然後，所有同一類的點v被賦予同樣的權重，即

w c l a s s (v) = w c l a s s (T (v)), T (v) \in C (7)

D. The Complete Algorithm

自底向上路徑合併算法的細節。
算法分爲兩個階段，第一個階段，算法2對每個節點v找到它的start vertices ——S(v)。

QUERY DECOMPOSITION

給定一個查詢並進行分解SQ={SQ1,...,SQm} ，每個SQi 包含一個起始點和該點對應的邊。爲進一步減少子查詢的數量，需對子查詢進行合併。如果兩個子查詢的重合點中有固定值且查詢後發現是合併點，則兩個子查詢可以合併；或者共享點是變量，但該類點都是合併點，同樣可以合併。

EXPERIMENT

20臺電腦，每個都是雙核2.4GHz, 6GB內存，500G硬盤。
實驗對比的方法：
Path-AX——本文中的近似算法
Path-BM——本文中的自底而上的方法使用節點權重
Path-BMC——本文中的自底而上的方法使用類別權重
[J. Huang, D. Abadi, and K. Ren. Scalable SPARQL querying of large
RDF graphs. PVLDB, 4(11):1123–1134.]中無方向一跳，無方向兩跳。
[K. Lee and L. Liu. Scaling queries over big rdf graphs with semantic
hash partitioning. PVLDB, 6(14):1894–1905, 2013.] 前向兩跳。
使用了九個數據集，如表1：

查詢如表2：

表3表示了不同路徑劃分算法的區別（下面實驗只考慮Path-BMC）：

A. Partitioning Time, Data Balance and Duplication

B. Query Performance

C. Scalability

D. Parameter Analysis

Conclusion

分析現有方法侷限，提出路徑分割，可以使更多的SPARQL查詢成爲內部分區查詢，另外，提出一些方法保持平衡的數據分佈和最小化數據重複。

《Scalable SPARQL Querying using Path Partitioning》