23.PHP內核探索:翻譯一篇HashTables文章

In case you ever heard me talking about PHP internals I certainly mentioned something along the lines of "Everything in PHP is a HashTable" that's not true, but next to a zval the HashTable is one of the most important structures in PHP. While preparing my "PHP Under The Hood" talk for the Dutch PHP Conference there was a question on IRC about extension_loaded() being faster thanfunction_exists(), which might be strange as both of them are simple hash lookups and a hash lookup is said to be O(1). I started to write some slides for it but figured out that I won't have the time to go through it during that presentation, so I'm doing this now, here:

如果你以前看過我寫的PHP內核相關的內容,那麼你一定記得我說過PHP中的任何東西都是哈希表。這不一定是真的,但是除了zval,哈希表確實是PHP中的最重要的結構了。在爲Dutch PHP 大會準備"PHP Under The Hood" 演講時,IRC上有個關於extension_loaded()比function_exists()快的問題,這讓人感到疑惑,它們底層都是簡單的哈希查找,而哈希查找的時間複雜度爲O(1)。我寫了一些相關的幻燈片,但是那個大會上已經沒有時間了,所以放在了這裏。

You all should know PHP arrays. They allow you to create a list of elements where every element may be identified by a key. This key may either be an integer or a string. Now we need a way to store this association in an effective way in memory. An efficient way to store a collection of data in memory is a "real" array, an array of elements indexed by a sequence of numbers, starting at 0, up to n. As memory is essentially a sequence of numbered storage blocks (this obviously is simplified, there might be segments and offsets, there might be virtual pages, etc.) we can efficiently access an element by knowing the address of the first element, the size of the elements (we assume all have the same size, which is the case here), and the index: The address of the n-th element is start_address + (n * size_of_element). That's basically what C arrays are.

大家都應該知道PHP數組,它可以保存一系列元素,並且可以用一個key來標識每個元素,這個key可以是整數或字符串。因而我們需要在內存中有一個高效的方案來保存這種關聯關係。其中一個高效的方案是把內存中的這些數據存儲爲”真正“的數組,一個從索引0開始到n的連續序列。內存從本質上講就是一個連續的存儲塊(這麼說是簡化了的結果,其實這裏有段、偏移以及虛擬頁等概念)。我們可以快速地訪問一個元素:只要我們知道第一個元素的地址,元素的大小(這裏假設每個元素大小一致)和待查找元素的索引值。第n個元素的地址計算公式爲:第一個元素地址 + (n * 元素大小),C語言中的數組就是這麼實現的。

Now we're not dealing with C and C arrays but also want to use string offsets so we need a function to calculate a numeric value, a hash, for each key. An hash function you most likely know is MD5, MD5 is creating a 128 bit numeric value which is often represented using 32 hexadecimal characters. For our purpose 128 bit is a bit much and MD5 is too slow so the PHP developers have chosen the "DJBX33A (Daniel J. Bernstein, Times 33 with Addition)" algorithm. This hash function is relatively fast and gives us an integer of the value, the trouble with this algorithm is that it is more likely to produce conflicts, that means to string values which create the same numeric value.

現在我們處理的不是C語言和它的數組,我們想用字符串偏移量,所以我們需要一個哈希函數,爲每個鍵值計算出一個數值。大家最熟悉的哈希函數可能是MD5,它產生一個128比特的數值,且通常用32個十六進制的字符表示。對於我們的目的而言,128比特的數值太大了,MD5計算過程緩慢,因而PHP開發者選擇了"DJBX33A (Daniel J. Bernstein, Times 33 with Addition)"算法。這個哈希函數相對快地從一個字符串計算出一個整數,缺點是衝突的概率增大了,即不同的字符串產生了相同的數值。

Now back to our C array: For being able to safely read any element, to see whether it is used, we need to pre-allocate the entire array with space for all possible elements. Given our hash function returning a system dependent (32 bit or 64 bit) integer this is quite a lot (size of an element multiplied wit the max int value), so PHP does another trick: It throws some digits away. For this a table mask is calculated. The a table mask is a number which is a power of two and then one subtracted and ideally higher than the number elements in the hash table. If one looks at this in binary representation this is a number where all bits are set. Doing a binary AND operation of our hash value and the table this gives us a number which is smaller than our table mask. Let's look at an example: The hash value of "foobar" equals, in decimal digits, to 3127933054. We assume a table mask of 2047 (2¹¹-1).

回到C數組:爲了安全地讀到任何一個元素,不管它是否用到,我們需要爲所有可能元素預先分配空間。假設我們的哈希函數返回一個系統相關(32位或64位)的整數,數值非常大,這時PHP需要另外一個技巧:拋棄一些比特。這是通過表掩碼實現的,表掩碼是這樣一個數,不小於哈希表的元素個數值且爲2的次方減一。如果查看它的二進制表現形式,即各個位都設置爲1的數。把哈希值和表掩碼做二進制與操作,得到一個小於表掩碼的數值。如:"foobar"的十進制格式的哈希值爲3127933054,假設表掩碼爲2047 (2¹¹-1)。

  3127933054    10111010011100000111100001111110
& 2047 00000000000000000000011111111111
=        126    00000000000000000000000001111110

Wonderful - we have an array Index, 126, for this string and can set the value in memory!

好——我們得到這個字符串的數組索引值126,就可以在內存中操作了。

If life were that easy. Remember: We used a hashing function which is by far not collision free and then dropped almost two thirds of the binary digits by using the table mask. This makes it rather likely that some collisions appear. These collisions are handled by storing values with the same hash in a linked list. So for accessing the an element one has to

  1. Calculate the hash
  2. Apply the table mask
  3. locate the memory offset
  4. check whether this is the element we need, traverse the linked list.

生活並不如此簡單,我們使用的哈希函數不能避免碰撞,並通過表掩碼拋棄了幾乎三分之二的位,這更增大了碰撞的機率。解決辦法是把相同哈希值的元素保存在鏈表中,所以訪問一個元素的時候:

  1. 計算哈希值
  2. 應用表掩碼
  3. 找到內存位置
  4. 遍歷鏈表,檢查是否是查找的元素

Now the question initially asked was why extension_loaded() might be faster than function_exists() and we can tell: For many random reasons and since you have chosen a value which probably conflicts in the first, not in the second. So now the question is how often such collisions happen for this I did a quick analysis: Given the function table of a random PHP build of mine with 1106 functions listed I have 634 unique hash values and 210 hash values calculated from different functions. Worst is the value of 471 which represents 6 functions.

現在可以回答文章開頭提出的爲什麼extension_loaded()可能比function_exists() 快的問題了:它可以在第一次查找時即找到對應元素。現在的問題是這種衝突的概率有多大,我做了一個快速的分析:我的PHP構建版本的函數表中有1106個函數,計算得到634個唯一的哈希值,210個有衝突的哈希值,最差的哈希值471對應了6個函數。

Full results are online but please mind: These results are very specific to my environment. Also mind that the code actually works on a copy of my function table so the table mask might be different, which changes results. Also note that the given PHP code won't work for you as I added special functions exporting the table mask and hash function to mz version of PHP. And then yet another note: doing performance optimizations on this level is the by far worst place as to many unknown factors go in. And you don't have any measurable performance win. Mind readability. Clear code is worth way more than the 2 CPU cycles you probably gain! But you may have learned how hash tables work.

全部的結果在這裏,不過請注意,這個結果是在我的環境裏測試的,而且即使是我這個版本的代碼拷貝,也會因爲表掩碼不同而導致結果不同。另外要注意的是,這些代碼在你的環境中是無法工作的,因爲我在PHP源碼中添加了一些表掩碼和哈希函數相關的代碼。另外一個提示:由於各種未知的因素,想在這個地方進行優化並不合適,你很難獲得明顯的性能提升。清晰可讀的代碼遠比兩個CPU週期重要,不過通過它你可以學習哈希表是如何工作的。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章