通過層次代表矩陣的top-k圖搜索

“”"
time:2020.5.26
reference:
1.https://blog.csdn.net/qq_37475168/article/details/103616443
2.<<>top k graph similarity>
“”"
輔助定理:
假設有一個矩陣如下,
[1374161214212]\begin{bmatrix} 13 &7 &4 &16 \\ 12&14 &2 &12 \end{bmatrix},現在表明,矩陣中每列的最小值和,小於等於按照分塊之後每個模塊行和最小值的和。比如1,2,3,,4列最小值爲12 7 2 12,12+7+2+12相加33.那麼現在把矩陣分爲如下,
[1371214][416212]\begin{bmatrix} 13 &7 \\ 12&14 \end{bmatrix} \begin{bmatrix} 4 &16 \\ 2&12 \end{bmatrix},那麼第一個模塊1 2行的和爲20 26,則行和最大值爲26,第二個模塊1 2 行和爲20 14,行和最大值爲20,所以總體最小和46.因此33小於46。

上面的結論,可以用在層次矩陣變小的過程。
輔助定理的使用:
假設有一個圖集合g1 g2 g3 g3的特徵矩陣W0W^{0}如下,
W0=[129471232276332119465121]v1=[113811]W^{0}= \begin{bmatrix} 1 &2 &9 &4 &7 &1 \\ 2 &3 &2 &2 &7 &6 \\ 3 &3 &2 &1 &1 &9 \\ 4 &6 &5 &1 &2 &1 \end{bmatrix} v^{1} = \begin{bmatrix} 1 & 1 & 3 &8 &1 &1 \end{bmatrix},經過本文的按照塊劃分,可以分爲如下6個:
[1223][9422][7176] \begin{bmatrix} 1&2 \\ 2&3 \end{bmatrix} \begin{bmatrix} 9&4 \\ 2&2 \end{bmatrix} \begin{bmatrix} 7&1 \\ 7&6 \end{bmatrix}
[3346][2151][1921] \begin{bmatrix} 3&3 \\ 4&6 \end{bmatrix} \begin{bmatrix} 2&1 \\ 5&1 \end{bmatrix} \begin{bmatrix} 1&9 \\ 2&1 \end{bmatrix}
,按照“壓縮矩陣的思想”,可以構建如下的矩陣W1=[5131310610]V1=[2112]W^{1}=\begin{bmatrix} 5 &13 &13 \\ 10 &6 &10 \end{bmatrix} V^{1}=\begin{bmatrix} 2 &11 &2 \end{bmatrix},。
本文所需要研究的問題是計算上標相同的向量的交集的加和,用異或表示\bigoplus。則第1層W01v1W^{1}_{0} v^{1}的異或值爲15.而第0層W00v0W^{0}_{0} v^{0}爲8 ,第0層舉證W10v0W^{0}_{1} v^{0}異或值爲8,因爲15>max(11,8),所以第一層一行的異或是第0層2行異或的上限。 第二層第二行的異或是第0層3 4 行異或的上界。
有上述的思想可知,如果W0W^{0}層的1-n行的異或值,肯定不超過壓縮L層異或的某一行。
因此,可以按照需求,求出Wi1W^{i-1}中某幾個圖之間的異或上界,Wri1W^{i-1}_{r}表示第i-1層矩陣的第r行。 那麼第l層矩陣的某一行rWtlW^{l}_{t}則可以給出Wri1W^{i-1}_{r}中的rr範圍 [rRi,(r+1)Ri1][r*R^{i},(r+1)R^{i}-1]

問題:
如果對於已經構建好的RMS進行聚類?文中已經給出解答,但是自己沒有讀懂。

原文表示如下:
Firstly, we cluster graph feature vectors in W by the cardinality of their feature multisets, and sort each cluster by their cardinality. So graph feature vectors in the same cluster have the same cardinality and graph feature vectors in the adjacent clusters have similar cardinality.
翻譯:
首先,我們將W中的圖特徵向量按其特徵多集的基數進行聚類,並按其基數對每個聚類進行排序。因此,同一簇內的圖特徵向量具有相同的基數,相鄰簇內的圖特徵向量具有相似的基數。也就是說,先計算特徵爲1個的圖,把他們聚集爲1類A;再計算特徵爲2個的圖,把他們聚集爲一類B,。。。。。分類之後再把A B C排好序,排序標準是特徵的個數。

Secondly,we put similar graphs to gether with in each cluster. We recognize high-frequency features from W firstly, and then sort feature vectors in each cluster by the sum of their high-frequency features. The reason of sorting feature vectors is based upon the following two observations: i) the distribution of features in W is not uniform; only a few of features occur in most of the graphs, which are denoted as high frequency features; ii) the average values of high-frequency features are greater than that of other features significantly, thus the similarity of two graphs is mainly determined by high-frequency features
其次,我們把每個簇中的相似圖放在一起。我們首先從W中識別出高頻特徵,然後根據每個簇的高頻特徵之和對特徵向量進行排序。

對特徵向量進行排序的原因是基於以下兩個觀察結果:
1)分佈不均勻,大多數圖中只存在一個特徵,即高頻特徵;
2)高頻特徵的平均值明顯大於其他特徵的平均值,因此兩個圖的相似性主要由高頻特徵決定。

算法2:
輸入:已經構建好的RMS(即層次代表矩陣)、q(查詢圖)、k(前k個)
輸出:和q前k個相似圖
舉個例子。假設已經有如下的層次代表矩陣,共三層,W0 W1 W2,已經提前計算好每一層和對應RVS q0 q1 q2的相似度上界,用三元組(l,r,val)表示。如下面的手寫:

現在初始化堆P1,將RMS最後一層(RMS不一定最後一層只有一行)的三元組都放入P1。例子中P1一開始的三元組只有一個。
初始化P2,P2是存放在比較過程中候選圖(三元組下標爲0的圖)的相似度。一開始共有K個0元素。

從P1中拿出val最大的三元組,檢查這個三元組的l,看這個三元組是否是第一層W0對應圖集的val,若是,則將此三元組對應的圖放入結果集合result。若不是,則檢查此三元組對應前一層矩陣W(l-1)中每行元素的val。(說明:如果l-0=0,說明l-1是W0對應value,即圖和查詢圖q相似度,l-1!=0,表明是圖集合和q的相似度上界)。前一層的每行元素的val如果比堆P2中的元素大,就把這一行對應的三元組放入P1。若訪問的“前一層”是W0,則必須更新P2中的值,否則不跟新。

問題:
雖然理解了查找的過程,但是對於直觀含義還是模糊。

實驗:
1)建立特徵樹
在這裏插入圖片描述
文章中使用WL子樹迭代的方法建立特徵。文章中定義了t-hop特徵樹,是k-adj樹的另外一種說法(k-adj和本文不一樣!但是建立索引時候的思路是一樣的)。文章實驗中比較t大小對於分類精度的影響。t設立大小爲1,2,3,4,5,6,7,8。對於k=3是,分類精度已經趨於穩定,因此將下面的實驗設立爲3.這裏記錄如何表示K=2時候的模式。
如果創建t=0水分子(H20)的模式,則先把頂點不同種類的label轉變爲數字(比如H O 變爲不同的ID,比如1 2,能夠唯一識別分子種類的數字,在AIDS數據集中)有模式pattern_0 = {1,2} pattern_1 = {2,11},patter_2={1,2}這三個模式是0-hop的編碼表示;然後,設定一個哈希函數,將hash(pattern)= ID{1,2},即將上面2中模式進行單射,比如這邊構建哈希函數hash(pattern_0) = 3,hash(pattern_1) = 4,即現在水分子的標籤爲 3 4 4 ,鄰居集合爲{3,4,4} {4,3},{4,3}
這也是1-hop的模式。

參考博客中解釋了這一個過程:
在這裏插入圖片描述
2)壓縮矩陣的塊狀大小R(行塊大小) C(列塊大小)對查詢時間和內存開銷大小。
平均查詢時間:給定1000給查詢圖,3000個數據庫,查看返回時間的平均值。
內存開銷:不是執行時候佔有的內存大小,是內存開銷比例。即非原始矩陣的壓縮矩陣的總大小和原始矩陣總大小比值。實驗表明:當行塊大小(R)不變,C(列快大小)越大,查詢時間越長。
列塊C越大,內存比率越小。這也是和邏輯相符合的。
問題:實驗2)倆者越大,RMS越小,應該越大,但是爲什麼推導出相反的時間結論?
在這裏插入圖片描述

3) 評估圖特徵向量聚類對查詢時間的影響。評估指標:時間

在建立RMS時候,並不是簡單地按照圖id建立圖向量。作者在這裏做了一個創新:
首先聚類特徵的個數(就是每一行向量的和)相同的圖向量(上面解釋過),然後再把這些聚類好的圖集,按照特徵個數大小排序。在已經聚類好、排好序的特徵向量中,我們再按照每個圖高頻率特徵的和再在一個聚類中排序。
下面是統計高頻率特徵的信息:
在這裏插入圖片描述
這幅圖比較不直觀,想說明的高頻率特徵的數量不多,但是高頻率決定倆個圖之間的相似度。

問題:圖b的縱座標average values作用?
在這裏插入圖片描述
在AIDS數據庫中,選取1000個數據作爲查詢圖,剩下的40000個數據作爲數據庫,從中尋找和查詢圖相似的前1個圖,前5個圖,前10個圖,前50個圖。若尋找和查詢圖醉相思的前50個圖,時間不超過25毫秒。(感覺不可靠)

在NCI數據庫(共4000個左右)中,選取1/5的數據,(約爲4000*0.2=800個),其餘圖(約爲3200)個作爲數據庫,從中尋找前1 5 10 50個最相似的圖

在NCI109數據庫(共4000個左右)中,做同樣的事情。

Resizable-array implementation of the List interface. Implements

  • all optional list operations, and permits all elements, including
  • null. In addition to implementing the List interface,
  • this class provides methods to manipulate the size of the array that is
  • used internally to store the list. (This class is roughly equivalent to
  • Vector, except that it is unsynchronized.)

The size, isEmpty, get, set, * iterator, and listIterator operations run in constant * time. The add operation runs in amortized constant time, * that is, adding n elements requires O(n) time. All of the other operations * run in linear time (roughly speaking). The constant factor is low compared * to that for the LinkedList implementation. * *

Each ArrayList instance has a capacity. The capacity is * the size of the array used to store the elements in the list. It is always * at least as large as the list size. As elements are added to an ArrayList, * its capacity grows automatically. The details of the growth policy are not * specified beyond the fact that adding an element has constant amortized * time cost. * *

An application can increase the capacity of an ArrayList instance * before adding a large number of elements using the ensureCapacity * operation. This may reduce the amount of incremental reallocation. * *

Note that this implementation is not synchronized. * If multiple threads access an ArrayList instance concurrently, * and at least one of the threads modifies the list structurally, it * must be synchronized externally. (A structural modification is * any operation that adds or deletes one or more elements, or explicitly * resizes the backing array; merely setting the value of an element is not * a structural modification.) This is typically accomplished by * synchronizing on some object that naturally encapsulates the list. * * If no such object exists, the list should be "wrapped" using the * {@link Collections#synchronizedList Collections.synchronizedList} * method. This is best done at creation time, to prevent accidental * unsynchronized access to the list:

 *   List list = Collections.synchronizedList(new ArrayList(...));
* *

* The iterators returned by this class's {@link #iterator() iterator} and * {@link #listIterator(int) listIterator} methods are fail-fast: * if the list is structurally modified at any time after the iterator is * created, in any way except through the iterator's own * {@link ListIterator#remove() remove} or * {@link ListIterator#add(Object) add} methods, the iterator will throw a * {@link ConcurrentModificationException}. Thus, in the face of * concurrent modification, the iterator fails quickly and cleanly, rather * than risking arbitrary, non-deterministic behavior at an undetermined * time in the future. * *

Note that the fail-fast behavior of an iterator cannot be guaranteed * as it is, generally speaking, impossible to make any hard guarantees in the * presence of unsynchronized concurrent modification. Fail-fast iterators * throw {@code ConcurrentModificationException} on a best-effort basis. * Therefore, it would be wrong to write a program that depended on this * exception for its correctness: the fail-fast behavior of iterators * should be used only to detect bugs. * *

This class is a member of the * * Java Collections Framework.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章