DJBX33A (Daniel J. Bernstein, Times 33 with Addition) APR哈希默認算法

經典是經過了時間考驗的

APR_DECLARE_NONSTD(unsigned int ) apr_hashfunc_default( const char *char_key,
apr_ssize_t *klen)
{
unsigned int hash = 0;
const unsigned char *key = ( const unsigned char *)char_key;
const unsigned char *p;
apr_ssize_t i;
/*
* This is the popular `times 33' hash algorithm which is used by
* perl and also appears in Berkeley DB. This is one of the best
* known hash functions for strings because it is both computed
* very fast and distributes very well.
*
* The originator may be Dan Bernstein but the code in Berkeley DB
* cites Chris Torek as the source. The best citation I have found
* is "Chris Torek, Hash function for text in C, Usenet message
* <[email protected]> in comp.lang.c , October, 1990." in Rich
* Salz's USENIX 1992 paper about INN which can be found at
* <http://citeseer.nj.nec.com/salz92internetnews.html>.
*
* The magic of number 33, i.e. why it works better than many other
* constants, prime or not, has never been adequately explained by
* anyone. So I try an explanation: if one experimentally tests all
* multipliers between 1 and 256 (as I did while writing a low-level
* data structure library some time ago) one detects that even
* numbers are not useable at all. The remaining 128 odd numbers
* (except for the number 1) work more or less all equally well.
* They all distribute in an acceptable way and this way fill a hash
* table with an average percent of approx. 86%.
*
* If one compares the chi^2 values of the variants (see
* Bob Jenkins ``Hashing Frequently Asked Questions'' at
* http://burtleburtle.net/bob/hash/hashfaq.html for a description
* of chi^2), the number 33 not even has the best value. But the
* number 33 and a few other equally good numbers like 17, 31, 63,
* 127 and 129 have nevertheless a great advantage to the remaining
* numbers in the large set of possible multipliers: their multiply
* operation can be replaced by a faster operation based on just one
* shift plus either a single addition or subtraction operation. And
* because a hash function has to both distribute good _and_ has to
* be very fast to compute, those few numbers should be preferred.
*
* -- Ralf S. Engelschall <[email protected]>
*/
if (*klen == APR_HASH_KEY_STRING) {
for (p = key; *p; p++) {
hash = hash * 33 + *p;
}
*klen = p - key;
}
else {
for (p = key, i = *klen; i; i--, p++) {
hash = hash * 33 + *p;
}
}
return hash;
}

APR_DECLARE_NONSTD(unsigned int) apr_hashfunc_default(const char *char_key, apr_ssize_t *klen) { unsigned int hash = 0; const unsigned char *key = (const unsigned char *)char_key; const unsigned char *p; apr_ssize_t i; /* * This is the popular `times 33' hash algorithm which is used by * perl and also appears in Berkeley DB. This is one of the best * known hash functions for strings because it is both computed * very fast and distributes very well. * * The originator may be Dan Bernstein but the code in Berkeley DB * cites Chris Torek as the source. The best citation I have found * is "Chris Torek, Hash function for text in C, Usenet message * <27038@mimsy.umd.edu> in comp.lang.c , October, 1990." in Rich * Salz's USENIX 1992 paper about INN which can be found at * <http://citeseer.nj.nec.com/salz92internetnews.html>. * * The magic of number 33, i.e. why it works better than many other * constants, prime or not, has never been adequately explained by * anyone. So I try an explanation: if one experimentally tests all * multipliers between 1 and 256 (as I did while writing a low-level * data structure library some time ago) one detects that even * numbers are not useable at all. The remaining 128 odd numbers * (except for the number 1) work more or less all equally well. * They all distribute in an acceptable way and this way fill a hash * table with an average percent of approx. 86%. * * If one compares the chi^2 values of the variants (see * Bob Jenkins ``Hashing Frequently Asked Questions'' at * http://burtleburtle.net/bob/hash/hashfaq.html for a description * of chi^2), the number 33 not even has the best value. But the * number 33 and a few other equally good numbers like 17, 31, 63, * 127 and 129 have nevertheless a great advantage to the remaining * numbers in the large set of possible multipliers: their multiply * operation can be replaced by a faster operation based on just one * shift plus either a single addition or subtraction operation. And * because a hash function has to both distribute good _and_ has to * be very fast to compute, those few numbers should be preferred. * * -- Ralf S. Engelschall <rse@engelschall.com> */ if (*klen == APR_HASH_KEY_STRING) { for (p = key; *p; p++) { hash = hash * 33 + *p; } *klen = p - key; } else { for (p = key, i = *klen; i; i--, p++) { hash = hash * 33 + *p; } } return hash; }

對函數註釋部分的翻譯:
這是很出名的times33哈希算法,此算法被perl語言採用並在Berkeley DB中出現.它是已知的最好的哈希算法之一,在處理以字符串爲鍵值的哈希時,有着極快的計算效率和很好哈希分佈.最早提出這個算法的是Dan Bernstein,但是源代碼確實由Clris Torek在Berkeley DB出實作的.我找到的最確切的引文中這樣說"Chris Torek,C語言文本哈希函數,Usenet消息<<[email protected]> in comp.lang.c ,1990年十月."在Rich Salz於1992年在USENIX報上發表的討論INN的文章中提到.這篇文章可以在<http://citeseer.nj.nec.com /salz92internetnews.html>上找到.

33這個奇妙的數字,爲什麼它能夠比其他數值效果更好呢?無論重要與否,卻從來沒有人能夠充分說明其中的原因.因此在這裏,我來試着解釋一下.如果某人試着測試1到256之間的每個數字(就像我前段時間寫的一個底層數據結構庫那樣),他會發現,沒有哪一個數字的表現是特別突出的.其中的128個奇數(1除外)的表現都差不多,都能夠達到一個能接受的哈希分佈,平均分佈率大概是86%.

如果比較這128個奇數中的方差值(gibbon:統計術語,表示隨機變量與它的數學期望之間的平均偏離程度)的話(見Bob Jenkins的<哈希常見疑問>http://burtleburtle.net/bob/hash/hashfaq.html,中對平方差的描述),數字33並不是表現最好的一個.(gibbon:這裏按照我的理解,照常理,應該是方差越小穩定,但是由於這裏不清楚作者方差的計算公式,以及在哈希離散表,是不是離散度越大越好,所以不得而知這裏的表現好是指方差值大還是指方差值小),但是數字33以及其他一些同樣好的數字比如 17,31,63,127和129對於其他剩下的數字,在面對大量的哈希運算時,仍然有一個大大的優勢,就是這些數字能夠將乘法用位運算配合加減法來替換,這樣的運算速度會提高.畢竟一個好的哈希算法要求既有好的分佈,也要有高的計算速度,能同時達到這兩點的數字很少.