後綴樹及後綴數組

後綴樹:
後綴樹是一種數據結構,它支持有效的字符串匹配和查詢。

一個具有m個詞的字符串S的後綴樹T,就是一個包含一個根節點的有向樹,該樹恰好帶有m個葉子,這些葉子被賦予從1到m的標號。 每一個內部節點,除了根節點以外,都至少有兩個子節點,而且每條邊都用S的一個非空子串來標識。出自同一節點的任意兩條邊的標識不會以相同的詞開始。後綴 樹的關鍵特徵是:對於任何葉子i,從根節點到該葉子所經歷的邊的所有標識串聯起來後恰好拼出S的從i位置開始的後綴,即S[i,…,m]。樹中節點的標識 被定義爲從根到該節點的所有邊的標識的串聯。

圖中示意了字符串 "I know you know I know "的後綴樹。內部節點用圓來表示,葉子用矩形來表示,該例子中有六個葉子,被標號爲1到6。 終止字符在圖中被省略掉了。

同理, 若干字符串組成的後綴樹, 稱爲一個擴展的後綴樹:n個字符串Sn,其中字符串的長度爲mn, 由這些字符串組成一個擴展的後綴樹 T ,它是一個包含一個根節點的有向樹,該樹有mn個葉子,每個葉子都用一個兩數字的座標tuple(k,l)來標識,其中k的範圍是從1到n,而l的範圍 是從1到mk ,每一個內部節點,除了根節點外,都有兩個子節點並且每條邊都用一個非空的S中若干單詞構成的一個子串來標識。並且出自同一節點的任意兩條邊的標識的第一 個單詞不能相同。對於任意的葉子(i,j),從根節點到該葉子所經歷的所有邊的標識的串聯恰好拼出後綴Si,該後綴從位置j開始,就是說它們拼出了Si [j..mi]。
在 字符串處理當中,後綴樹和後綴數組都是非常有力的工具,其中後綴樹大家瞭解得比較多,關於後綴數組則很少見於國內的資料。其實後綴數組是後綴樹的一個非常 精巧的替代品,它比後綴樹容易編程實現,能夠實現後綴樹的很多功能而時間複雜度也不太遜色,並且,它比後綴樹所佔用的空間小很多。可以說,在信息學競賽中 後綴數組比後綴樹要更爲實用。因此在本文中筆者想介紹一下後綴數組的基本概念、構造方法,以及配合後綴數組的最長公共前綴數組的構造方法,最後結合一些例 子談談後綴數組的應用。


後綴數組:

首先明確一些必要的定義:

字符集 一個字符集∑是一個建立了全序關係的集合,也就是說,∑中的任意兩個不同的元素α和β都可以比較大小,要麼α<β,要麼β<α(也就是α>β)。字符集∑中的元素稱爲字符。
字符串 一個字符串S是將n個字符順次排列形成的數組,n稱爲S的長度,表示爲len(S)。S的第i個字符表示爲S[i]。
子串 字符串S的子串S[i..j],i≤j,表示S串中從i到j這一段,也就是順次排列S[i],S[i+1],...,S[j]形成的字符串。
後綴 後綴是指從某個位置i開始到整個串末尾結束的一個特殊子串。字符串S的從i開頭的後綴表示爲Suffix(S,i),也就是Suffix(S,i)=S[i..len(S)]。

關 於字符串的大小比較,是指通常所說的“字典順序”比較,也就是對於兩個字符串u、v,令i從1開始順次比較u[i]和v[i],如果相等則令i加1,否則 若u[i]<v[i]則認爲u<v,u[i]>v[i]則認爲u>v(也就是v<u),比較結束。如果i>len (u)或者i>len(v)仍未比較出結果,那麼若len(u)<len(v)則認爲u<v,若len(u)=len(v)則認爲u= v,若len(u)>len(v)則u>v。
從字符串的大小比較的定義來看,S的兩個開頭位置不同的後綴u和v進行比較的結果不可能是相等,因爲u=v的必要條件len(u)=len(v)在這裏不可能滿足。

下 面我們約定一個字符集∑和一個字符串S,設len(S)=n,且S[n]='$',也就是說S以一個特殊字符'$'結尾,並且'$'小於∑中的任何一個字 符。除了S[n]之外,S中的其他字符都屬於∑。對於約定的字符串S,從位置i開頭的後綴直接寫成Suffix(i),省去參數S。

後 綴數組 後綴數組SA是一個一維數組,它保存1..n的某個排列SA[1],SA[2],...SA[n],並且保證 Suffix(SA[i])<Suffix(SA[i+1]),1≤i<n。也就是將S的n個後綴從小到大進行排序之後把排好序的後綴的開頭 位置順次放入SA中。
名次數組 名次數組Rank=SA-1,也就是說若SA[i]=j,則Rank[j]=i,不難看出Rank[i]保存的是Suffix(i)在所有後綴中從小到大排列的“名次”。


構造方法
如何構造後綴數組呢?最直接最簡單的方法當然是把S的後綴都看作一些普通的字符串,按照一般字符串排序的方法對它們從小到大進行排序。
不難看出,這種做法是很笨拙的,因爲它沒有利用到各個後綴之間的有機聯繫,所以它的效率不可能很高。即使採用字符串排序中比較高效的Multi-key Quick Sort,最壞情況的時間複雜度仍然是O(n2)的,不能滿足我們的需要。
下面介紹倍增算法(Doubling Algorithm),它正是充分利用了各個後綴之間的聯繫,將構造後綴數組的最壞時間複雜度成功降至O(nlogn)。

對一個字符串u,我們定義u的k-前綴

定義k-前綴比較關係<k、=k和≤k:
設兩個字符串u和v,
u<kv 當且僅當 uk<vk
u=kv 當且僅當 uk=vk
u≤kv 當且僅當 uk≤vk

直觀地看這些加了一個下標k的比較符號的意義就是對兩個字符串的前k個字符進行字典序比較,特別的一點就是在作大於和小於的比較時如果某個字符串的長度不到k也沒有關係,只要能夠在k個字符比較結束之前得到第一個字符串大於或者小於第二個字符串就可以了。
根據前綴比較符的性質我們可以得到以下的非常重要的性質:
性質1.1 對k≥n,Suffix(i)<kSuffix(j) 等價於 Suffix(i)<Suffix(j)。
性質1.2 Suffix(i)=2kSuffix(j)等價於
Suffix(i)=kSuffix(j) 且 Suffix(i+k)=kSuffix(j+k)。
性質1.3 Suffix(i)<2kSuffix(j) 等價於
Suffix(i)<kS(j) 或 (Suffix(i)=kSuffix(j) 且 Suffix(i+k)<kSuffix(j+k))。
這 裏有一個問題,當i+k>n或者j+k>n的時候Suffix(i+k)或Suffix(j+k)是無明確定義的表達式,但實際上不需要考慮 這個問題,因爲此時Suffix(i)或者Suffix(j)的長度不超過k,也就是說它們的k-前綴以'$'結尾,於是k-前綴比較的結果不可能相等, 也就是說前k個字符已經能夠比出大小,後面的表達式自然可以忽略,這也就看出我們規定S以'$'結尾的特殊用處了。

定義k-後綴數組 SAk保存1..n的某個排列SAk[1],SAk[2],…SAk[n]使得Suffix(SAk[i]) ≤kSuffix(SAk[i+1]),1≤i<n。也就是說對所有的後綴在k-前綴比較關係下從小到大排序,並且把排序後的後綴的開頭位置順次放 入數組SAk中。
定義k-名次數組Rankk,Rankk[i]代表Suffix(i)在k-前綴關係下從小到大的“名次”,也就是1加上滿足Suffix(j)<kSuffix(i)的j的個數。通過SAk很容易在O(n)的時間內求出Rankk。
假 設我們已經求出了SAk和Rankk,那麼我們可以很方便地求出SA2k和Rank2k,因爲根據性質1.2和1.3,2k-前綴比較關係可以由常數個k -前綴比較關係組合起來等價地表達,而Rankk數組實際上給出了在常數時間內進行<k和=k比較的方法,即:
Suffix(i)<kSuffix(j) 當且僅當 Rankk[i]<Rankk[j]
Suffix(i)=kSuffix(j) 當且僅當 Rankk[i]=Rankk[j]
因 此,比較Suffix(i)和Suffix(j)在k-前綴比較關係下的大小可以在常數時間內完成,於是對所有的後綴在≤k關係下進行排序也就和一般的排 序沒有什麼區別了,它實際上就相當於每個Suffix(i)有一個主關鍵字Rankk[i]和一個次關鍵字Rankk[i+k]。如果採用快速排序之類O (nlogn)的排序,那麼從SAk和Rankk構造出SA2k的複雜度就是O(nlogn)。更聰明的方法是採用基數排序,複雜度爲O(n)。
求出SA2k之後就可以在O(n)的時間內根據SA2k構造出Rank2k。因此,從SAk和Rankk推出SA2k和Rank2k可以在O(n)時間內完成。
下 面只有一個問題需要解決:如何構造出SA1和Rank1。這個問題非常簡單:因爲<1,=1和≤1這些運算符實際上就是對字符串的第一個字符進行比 較,所以只要把每個後綴按照它的第一個字符進行排序就可以求出SA1,不妨就採用快速排序,複雜度爲O(nlogn)。
於是,可以在O(nlogn)的時間內求出SA1和Rank1。
求出了SA1和Rank1,我們可以在O(n)的時間內求出SA2和Rank2,同樣,我們可以再用O(n)的時間求出SA4和Rank4,這樣,我們依次求出:
SA2和Rank2,SA4和Rank4,SA8和Rank8,……直到SAm和Rankm,其中m=2k且m≥n。而根據性質1.1,SAm和SA是等價的。這樣一共需要進行logn次O(n)的過程,因此
可以在O(nlogn)的時間內計算出後綴數組SA和名次數組Rank。

最長公共前綴
現 在一個字符串S的後綴數組SA可以在O(nlogn)的時間內計算出來。利用SA我們已經可以做很多事情,比如在O(mlogn)的時間內進行模式匹配, 其中m,n分別爲模式串和待匹配串的長度。但是要想更充分地發揮後綴數組的威力,我們還需要計算一個輔助的工具——最長公共前綴(Longest Common Prefix)。
對兩個字符串u,v定義函數lcp(u,v)=max{i|u=iv},也就是從頭開始順次比較u和v的對應字符,對應字符持續相等的最大位置,稱爲這兩個字符串的最長公共前綴。
對正整數i,j定義LCP(i,j)=lcp(Suffix(SA[i]),Suffix(SA[j]),其中i,j均爲1至n的整數。LCP(i,j)也就是後綴數組中第i個和第j個後綴的最長公共前綴的長度。
關於LCP有兩個顯而易見的性質:
性質2.1 LCP(i,j)=LCP(j,i)
性質2.2 LCP(i,i)=len(Suffix(SA[i]))=n-SA[i]+1
這兩個性質的用處在於,我們計算LCP(i,j)時只需要考慮i<j的情況,因爲i>j時可交換i,j,i=j時可以直接輸出結果n-SA[i]+1。

直接根據定義,用順次比較對應字符的方法來計算LCP(i,j)顯然是很低效的,時間複雜度爲O(n),所以我們必須進行適當的預處理以降低每次計算LCP的複雜度。
經過仔細分析,我們發現LCP函數有一個非常好的性質:
設i<j,則LCP(i,j)=min{LCP(k-1,k)|i+1≤k≤j} (LCP Theorem)

要證明LCP Theorem,首先證明LCP Lemma:
對任意1≤i<j<k≤n,LCP(i,k)=min{LCP(i,j),LCP(j,k)}
證明:設p=min{LCP(i,j),LCP(j,k)},則有LCP(i,j)≥p,LCP(j,k)≥p。
設Suffix(SA[i])=u,Suffix(SA[j])=v,Suffix(SA[k])=w。
由u=LCP(i,j)v得u=pv;同理v=pw。
於是Suffix(SA[i])=pSuffix(SA[k]),即LCP(i,k)≥p。 (1)

又設LCP(i,k)=q>p,則
u[1]=w[1],u[2]=w[2],...u[q]=w[q]。
而min{LCP(i,j),LCP(j,k)}=p說明u[p+1]≠v[p+1]或v[p+1]≠w[q+1],
設u[p+1]=x,v[p+1]=y,w[p+1]=z,顯然有x≤y≤z,又由p<q得p+1≤q,應該有x=z,也就是x=y=z,這與u[p+1]≠v[p+1]或v[p+1]≠w[q+1]矛盾。
於是,q>p不成立,即LCP(i,k)≤p。 (2)
綜合(1),(2)知 LCP(i,k)=p=min{LCP(i,j),LCP(j,k)},LCP Lemma得證。

於是LCP Theorem可以證明如下:
當j-i=1和j-i=2時,顯然成立。
設j-i=m時LCP Theorem成立,當j-i=m+1時,
由LCP Lemma知LCP(i,j)=min{LCP(i,i+1),LCP(i+1,j)},
因j-(i+1)≤m,LCP(i+1,j)=min{LCP(k-1,k)|i+2≤k≤j},故當j-i=m+1時,仍有
LCP(i,j)=min{LCP(i,i+1),min{LCP(k-1,k)|i+2≤k≤j}}=min{LCP(k-1,k}|i+1≤k≤j)
根據數學歸納法,LCP Theorem成立。

根據LCP Theorem得出必然的一個推論:
LCP Corollary 對i≤j<k,LCP(j,k)≥LCP(i,k)。

定義一維數組height,令height[i]=LCP(i-1,i),1<i≤n,並設height[1]=0。
由LCP Theorem,LCP(i,j)=min{height[k]|i+1≤k≤j},也就是說,計算LCP(i,j)等同於詢問一維數組height中下 標在i+1到j範圍內的所有元素的最小值。如果height數組是固定的,這就是非常經典的RMQ(Range Minimum Query)問題。
RMQ問題可以用線段樹或靜態排序樹在O(nlogn)時間內進行預處理,之後每次詢問花費時間O(logn),更好的方法是RMQ標準算法,可以在O(n)時間內進行預處理,每次詢問可以在常數時間內完成。
對於一個固定的字符串S,其height數組顯然是固定的,只要我們能高效地求出height數組,那麼運用RMQ方法進行預處理之後,每次計算LCP(i,j)的時間複雜度就是常數級了。於是只有一個問題——如何儘量高效地算出height數組。
根據計算後綴數組的經驗,我們不應該把n個後綴看作互不相關的普通字符串,而應該儘量利用它們之間的聯繫,下面證明一個非常有用的性質:
爲了描述方便,設h[i]=height[Rank[i]],即height[i]=h[SA[i]]。h數組滿足一個性質:
性質3 對於i>1且Rank[i]>1,一定有h[i]≥h[i-1]-1。
爲了證明性質3,我們有必要明確兩個事實:

設i<n,j<n,Suffix(i)和Suffix(j)滿足lcp(Suffix(i),Suffix(j)>1,則成立以下兩點:
Fact 1 Suffix(i)<Suffix(j) 等價於 Suffix(i+1)<Suffix(j+1)。
Fact 2 一定有lcp(Suffix(i+1),Suffix(j+1))=lcp(Suffix(i),Suffix(j))-1。
看 起來很神奇,但其實很自然:lcp(Suffix(i),Suffix(j))>1說明Suffix(i)和Suffix(j)的第一個字符是相同 的,設它爲α,則Suffix(i)相當於α後連接Suffix(i+1),Suffix(j)相當於α後連接Suffix(j+1)。比較Suffix (i)和Suffix(j)時,第一個字符α是一定相等的,於是後面就等價於比較Suffix(i)和Suffix(j),因此Fact 1成立。Fact 2可類似證明。

於是可以證明性質3:
當h[i-1]≤1時,結論顯然成立,因h[i]≥0≥h[i-1]-1。
當h[i-1]>1時,也即height[Rank[i-1]]>1,可見Rank[i-1]>1,因height[1]=0。
令j=i-1,k=SA[Rank[j]-1]。顯然有Suffix(k)<Suffix(j)。
根據h[i-1]=lcp(Suffix(k),Suffix(j))>1和Suffix(k)<Suffix(j):
由Fact 2知lcp(Suffix(k+1),Suffix(i))=h[i-1]-1。
由Fact 1知Rank[k+1]<Rank[i],也就是Rank[k+1]≤Rank[i]-1。
於是根據LCP Corollary,有
LCP(Rank[i]-1,Rank[i])≥LCP(Rank[k+1],Rank[i])
=lcp(Suffix(k+1),Suffix(i))
=h[i-1]-1
由於h[i]=height[Rank[i]]=LCP(Rank[i]-1,Rank[i]),最終得到 h[i]≥h[i-1]-1。

根據性質3,可以令i從1循環到n按照如下方法依次算出h[i]:
若Rank[i]=1,則h[i]=0。字符比較次數爲0。
若i=1或者h[i-1]≤1,則直接將Suffix(i)和Suffix(Rank[i]-1)從第一個字符開始依次比較直到有字符不相同,由此計算出h[i]。字符比較次數爲h[i]+1,不超過h[i]-h[i-1]+2。
否 則,說明i>1,Rank[i]>1,h[i-1]>1,根據性質3,Suffix(i)和Suffix(Rank[i]-1)至少有 前h[i-1]-1個字符是相同的,於是字符比較可以從h[i-1]開始,直到某個字符不相同,由此計算出h[i]。字符比較次數爲h[i]-h[i- 1]+2。

設SA[1]=p,那麼不難看出總的字符比較次數不超過

也就是說,整個算法的複雜度爲O(n)。
求出了h數組,根據關係式height[i]=h[SA[i]]可以在O(n)時間內求出height數組,於是
可以在O(n)時間內求出height數組。

結合RMQ方法,在O(n)時間和空間進行預處理之後就能做到在常數時間內計算出對任意(i,j)計算出LCP(i,j)。
因爲lcp(Suffix(i),Suffix(j))=LCP(Rank[i],Rank[j]),所以我們也就可以在常數時間內求出S的任何兩個後綴之間的最長公共前綴。這正是後綴數組能強有力地處理很多字符串問題的重要原因之一。 

WiKi Answer:
suffix tree
Suffix tree for the string BANANA padded with $. The six paths from the root to a leaf (shown as boxes) correspond to the six suffixes A$, NA$, ANA$, NANA$, ANANA$ and BANANA$. The numbers in the boxes give the start position of the corresponding suffix. Suffix links drawn dashed.

Suffix tree for the string BANANA padded with $. The six paths from the root to a leaf (shown as boxes) correspond to the six suffixes A$, NA$, ANA$, NANA$, ANANA$ and BANANA$. The numbers in the boxes give the start position of the corresponding suffix. Suffix links drawn dashed.

In computer science, a suffix tree (also called PAT tree or, in an earlier form, position tree) is a certain data structure that presents the suffixes of a given string in a way that allows for a particularly fast implementation of many important string operations.

The suffix tree for a string S is a tree whose edges are labeled with strings, and such that each suffix of S corresponds to exactly one path from the tree's root to a leaf. It is thus a radix tree for the suffixes of S.

Constructing such a tree for the string S takes time and space linear in the length of S. Once constructed, several operations can be performed quickly, for instance locating a substring in S, locating a substring if a certain number of mistakes are allowed, locating matches for a regular expression pattern etc. Suffix trees also provided one of the first linear-time solutions for the longest common substring problem. These speedups come at a cost: storing a string's suffix tree typically requires significantly more space than storing the string itself.

History

The concept was first introduced as a position tree by Weiner in 1973[1] in a paper which Donald Knuth subsequently characterized as "Algorithm of the Year 1973". The construction was greatly simplified by McCreight in 1976 [2] , and also by Ukkonen in 1995[3][4]. Ukkonen provided the first linear-time online-construction of suffix trees, now known as Ukkonen's algorithm.

Definition

The suffix tree for the string S of length n is defined as a tree such that ([5] page 90):

  • the paths from the root to the leaves have a one-to-one relationship with the suffixes of S,
  • edges spell non-empty strings,
  • and all internal nodes (except perhaps the root) have at least two children.

Since such a tree does not exist for all strings, S is padded with a terminal symbol not seen in the string (usually denoted $). This ensures that no suffix is a prefix of another, and that there will be n leaf nodes, one for each of the n suffixes of S. Since all internal non-root nodes are branching, there can be at most n - 1 such nodes, and n + (n - 1) + 1 = 2n nodes in total.

Suffix links are a key feature for linear-time construction of the tree. In a complete suffix tree, all internal non-root nodes have a suffix link to another internal node. If the path from the root to a node spells the string χα, where χ is a single character and α is a string (possibly empty), it has a suffix link to the internal node representing α. See for example the suffix link from the node for ANA to the node for NA in the figure above. Suffix links are also used in some algorithms running on the tree.

Functionality

A suffix tree for a string S of length n can be built in Θ(n) time, if the alphabet is constant or integer [6]. Otherwise, the construction time depends on the implementation. The costs below are given under the assumption that the alphabet is constant. If it is not, the cost depends on the implementation (see below).

Assume that a suffix tree has been built for the string S of length n, or that a generalised suffix tree has been built for the set of strings D = {S1,S2,...,SK} of total length n = | n1 | + | n2 | + ... + | nK | . You can:

  • Search for strings:
    • Check if a string P of length m is a substring in O(m) time ([5] page 92).
    • Find the first occurrence of the patterns P1,...,Pq of total length m as substrings in O(m) time, when the suffix tree is built using Ukkonen's algorithm.
    • Find all z occurrences of the patterns P1,...,Pq of total length m as substrings in O(m + z) time ([5] page 123).
    • Search for a regular expression P in time expected sublinear on n ([7]).
    • Find for each suffix of a pattern P, the length of the longest match between a prefix of P[i...m] and a substring in D in Θ(m) time ([5] page 132). This is termed the matching statistics for P.
  • Find properties of the strings:
    • Find the longest common substrings of the string Si and Sj in Θ(ni + nj) time ([5] page 125).
    • Find all maximal pairs, maximal repeats or supermaximal repeats in Θ(n + z) time ([5] page 144).
    • Find the Lempel-Ziv decomposition in Θ(n) time ([5] page 166).
    • Find the longest repeated substrings in Θ(n) time.
    • Find the most frequently occurring substrings of a minimum length in Θ(n) time.
    • Find the shortest strings from Σ that do not occur in D, in O(n + z) time, if there are z such strings.
    • Find the shortest substrings occurring only once in Θ(n) time.
    • Find, for each i, the shortest substrings of Si not occurring elsewhere in D in Θ(n) time.

The suffix tree can be prepared for constant time lowest common ancestor retrieval between nodes in Θ(n) time ([5] chapter 8). You can then also:

  • Find the longest common prefix between the suffixes Si[p..ni] and Sj[q..nj] in Θ(1) ([5] page 196).
  • Search for a pattern P of length m with at most k mismatches in O(kn + z) time, where z is the number of hits ([5] page 200).
  • Find all z maximal palindromes in Θ(n)([5] page 198), or Θ(gn) time if gaps of length g are allowed, or Θ(kn) if k mismatches are allowed ([5] page 201).
  • Find all z tandem repeats in O(nlogn + z), and k-mismatch tandem repeats in O(knlog(n / k) + z) ([5] page 204).
  • Find the longest substrings common to at least k strings in D for k = 2..K in Θ(n) time ([5] page 205).

Uses

Suffix trees are often used in bioinformatics applications, where they are used for searching for patterns in DNA or protein sequences, which can be viewed as long strings of characters. The ability to search efficiently with mismatches might be the suffix tree's greatest strength. It is also used in data compression, where on the one hand it is used to find repeated data and on the other hand it can be used for the sorting stage of the Burrows-Wheeler transform. Variants of the LZW compression schemes use it (LZSS). A suffix tree is also used in something called suffix tree clustering which is a data clustering algorithm used in some search engines.

Implementation

If each node and edge can be represented in Θ(1) space, the entire tree can be represented in Θ(n) space. The total length of the edges in the tree is O(n2), but each edge can be stored as the position and length of a substring of S, giving a total space usage of Θ(n) computer words. The worst-case space usage of a suffix tree is seen with a fibonacci string, giving the full 2n nodes.

An important choice when making a suffix tree implementation is the parent-child relationships between nodes. The most common is using linked lists called sibling lists. Each node has pointer to its first child, and to the next node in the child list it is a part of. Hash maps, sorted/unsorted arrays (with array doubling), and balanced search trees may also be used, giving different running time properties. We are interested in:

  • The cost of finding the child on a given character.
  • The cost of inserting a child.
  • The cost of enlisting all children of a node (divided by the number of children in the table below).

Let σ be the size of the alphabet. Then you have the following costs:

  Lookup Insertion Traversal
Sibling lists / unsorted arrays O(σ) Θ(1) Θ(1)
Hash maps Θ(1) Θ(1) O(σ)
Balanced search tree O(logσ) O(logσ) O(1)
Sorted arrays O(logσ) O(σ) O(1)
Hash maps + sibling lists O(1) O(1) O(1)

Note that the insertion cost is amortised, and that the costs for hashing are given perfect hashing.

The large amount of information in each edge and node makes the suffix tree very expensive, consuming about ten to twenty times the memory size of the source text in good implementations. The suffix array reduces this requirement to a factor of four, and researchers have continued to find smaller indexing structures.

See also

References

  1. ^ P. Weiner (1973). "Linear pattern matching algorithm". 14th Annual IEEE Symposium on Switching and Automata Theory: 1-11. 
  2. ^ Edward M. McCreight (1976). "A Space-Economical Suffix Tree Construction Algorithm". Journal of the ACM 23 (2): 262--272. 
  3. ^ E. Ukkonen (1995). "On-line construction of suffix trees". Algorithmica 14 (3): 249--260. 
  4. ^ R. Giegerich and S. Kurtz (1997). "From Ukkonen to McCreight and Weiner: A Unifying View of Linear-Time Suffix Tree Construction". Algorithmica 19 (3): 331--353. 
  5. ^ a b c d e f g h i j k l m n Gusfield, Dan [1997] (1999). Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. USA: Cambridge University Press. ISBN 0-521-58519-8. 
  6. ^ Martin Farach (1997). "Optimal suffix tree construction with large alphabets". Foundations of Computer Science, 38th Annual Symposium on: 137--143. 
  7. ^ Ricardo A. Baeza-Yates and Gaston H. Gonnet (1996). "Fast text searching for regular expressions or automaton searching on tries". Journal of the ACM 43: 915--936. DOI:10.1145/235809.235810. ISSN 0004-5411. 

External links

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章