JDK源碼 Hash雜記

更多請移步： 我的博客

最早了解Hash的用法，是一次分表的經歷，公司用戶表數據有幾千萬，查詢的效率已經比較低了，需要做拆分處理，之前系統中已經有分表的數據，處理方式比較簡單，沒有使用中間件，按照商家的ID（32位字符串）做Hash然後取模，算出其落在表的編號，然後加上前綴得到最終表名。

最近在瞭解zk分佈式鎖時，爲了避免一種實現方式的羊羣效應，其改進思路類似一致性哈希算法。於是，便看了下Hash相關的知識，並用Java做了簡單實現。

哈希簡介

哈希算法將任意長度的二進制值映射爲較短的固定長度的二進制值，這個小的二進制值稱爲哈希值。哈希值是一段數據唯一且極其緊湊的數值表示形式。如果散列一段明文而且哪怕只更改該段落的一個字母，隨後的哈希都將產生不同的值。要找到散列爲同一個值的兩個不同的輸入，在計算上是不可能的，所以數據的哈希值可以檢驗數據的完整性。一般用於快速查找和加密算法。

簡單的Hash應用

類似開頭我們的場景，我們根據Hash的特性用代碼來模擬下。

/**
 * 表，實際存儲
 * Created by childe on 2017/5/14.
 */
public class Table {
    private String name;
    Map<String,Merchant> merchantMap;

    Table(String name) {
        this.name = name;
        merchantMap = new HashMap<>();
    }

    public void insert(Merchant merchant) {
        merchantMap.put(merchant.getId(),merchant);
    }

    public Merchant select(String id) {
        return merchantMap.get(id);
    }
}

/**
 * 商家
 * Created by childe on 2017/5/14.
 */
public class Merchant {
    private String id;

    public Merchant(String id) {
        this.id = id;
    }

    public String getId() {
        return id;
    }

    public void setId(String id) {
        this.id = id;
    }
}

/**
 * 表路由
 * Created by childe on 2017/5/14.
 */
public class TableRoute {
    private static final int TABLE_SIZE_MAX = 512;

    private Table[] tables = new Table[TABLE_SIZE_MAX];

    private int size = 0;

    public void insert(Merchant merchant) {
        //以merchant的ID爲key，其不能爲空
        if (merchant == null && StringUtils.isEmpty(merchant.getId())) {
            return;
        }

        int index = merchant.getId().hashCode() % size;

        Table table = tables[index];
        table.insert(merchant);
    }

    public Merchant select(String id) {
        if (StringUtils.isEmpty(id)) {
            return null;
        }

        int index = id.hashCode() % size;

        Table table = tables[index];

        return table.select(id);
    }

    public void addTable(Table table) {
        if (table == null) {
            return;
        }
        tables[size++] = table;
    }
}

/**
 * Created by childe on 2017/5/14.
 */
public class Main {

    static int tableNum = 3;
    static int merchantNum = 100;

    public static void main(String[] args) {
        //初始化表
        TableRoute tableRoute = creatTableRoute(tableNum);

        //插入數據
        for (int i = 0; i < merchantNum; i++) {
            Merchant merchant = new Merchant(String.valueOf(i));
            tableRoute.insert(merchant);
        }

        //有效數據統計
        validCount(tableRoute);

        //增加一個表
        tableRoute.addTable(new Table("merchant_100"));

        System.out.println("after add a table");

        //有效數據統計
        validCount(tableRoute);
    }

    private static void validCount(TableRoute tableRoute) {
        int validNum = 0;
        //獲取數據
        for (int i = 0; i < merchantNum; i++) {
            Merchant merchant = tableRoute.select(String.valueOf(i));
            if (merchant != null) {
                validNum++;
            }
        }

        System.out.println("vaild merchant : " + validNum + ", total merchant : " + merchantNum);
    }

    public static TableRoute creatTableRoute(int tableNum) {
        TableRoute tableRoute = new TableRoute();
        for (int i = 0; i < tableNum; i++) {
            tableRoute.addTable(new Table("merchant_" + String.valueOf(i)));
        }
        return tableRoute;
    }
}

在上述代碼中我們我們模擬了分表插入和查找的過程，最終輸出如下：

vaild merchant : 100, total merchant : 100
after add a table
vaild merchant : 24, total merchant : 100

在Main中兩次統計了表中有效的數據個數，兩次差別還是比較大的，爲什麼新加入一個表會導致這麼多數據實效呢？很簡單，因爲我們是以分表的個數取模的，當表的數量增加後，當然會造成數據失效。還以開篇的分表爲例，如果商家的數據再次很快的增長，那麼商家的用戶數據當然會更多（商家:用戶=1:n），當某個分表記錄再次到達千萬級別，此時就又面臨分表的可能，那麼此時就面臨數據遷移的問題，否則就會出現我們模擬的狀況，從實驗上來看，失效的比例還是很高的，遷移就會比較頭疼。當然，牽扯到實際問題需要我們對業務的增長有個大概的預測，來計算初次分表的數量。但是大量數據的遷移還是難以避免。

一致性哈希

上面我們看到一旦表的數量增加數據失效比例很高，就需要面臨大量的數據遷移，這是難以忍受的。

在應用中還有其他一些類似的場景，比如：緩存（假設我們緩存按照上述方式存放）。本來是爲了減輕後方服務的壓力，如果緩存的機器掛掉了一臺或者我們需要新增加一臺，那麼，後端服務將面臨大量緩存失效而帶來的壓力，甚至造成雪崩。

一致性哈希很好的解決了這個問題，什麼是一致性哈希呢？
一致性哈希將整個哈希值空間組織成一個虛擬的圓環，所有待落到該環上的節點（包括存儲節點）均需要按照同一套Hash算法得出落點位置。節點落入到閉環後，按照順時針的方向存儲到離自己最近的一個存儲節點。因爲存儲節點可能比較少，可能會導致存儲節點存儲數據不均衡，所以需要引入虛擬存儲節點。比如：有A、B兩臺機器提供存儲，我們一般使用機器的IP來計算機器的Hash，如果A、B兩臺機器的hash值比較靠近，數據存儲就會出現傾斜，要儘可能保證數據的均勻分佈，我們可以再做一層映射，在閉環上放置4個（A#0、A#1、B#0、B#1）或者更多存儲節點（使得數據分佈約趨於均勻）。

我們簡單模擬下一致性哈希的實現：

/**
 * 模擬緩存機器
 * Created by childe on 2017/5/14.
 */
public class Server {
    private String name;
    private Map<String, Entry> entries;

    Server(String name) {
        this.name = name;
        entries = new HashMap<>();
    }

    public void put(Entry e) {
        entries.put(e.getKey(), e);
    }

    public Entry get(String key) {
        return entries.get(key);
    }

    public int hashCode() {
        return name.hashCode();
    }

}

/**
 * 緩存集羣
 * Created by childe on 2017/5/14.
 */
public class Cluster {
    private static final int SERVER_SIZE_MAX = 1024;

    private SortedMap<Integer, Server> servers = new TreeMap<>();
    private int size = 0;

    public void put(Entry e) {
        routeServer(e.getKey().hashCode()).put(e);
    }

    public Entry get(String key) {
        return routeServer(key.hashCode()).get(key);
    }

    private Server routeServer(int hash) {
        if (servers.isEmpty()){
            return null;
        }

        /**
         * 順時針找到離該hash最近的slot（server）
         */
        if (!servers.containsKey(hash)) {
            SortedMap<Integer, Server> tailMap = servers.tailMap(hash);
            hash = tailMap.isEmpty() ? servers.firstKey() : tailMap.firstKey();
        }
        return servers.get(hash);
    }

    public boolean addServer(Server s) {
        if (size >= SERVER_SIZE_MAX) {
            return false;
        }

        servers.put(s.hashCode(), s);

        size++;
        return true;
    }
}

/**
 * 緩存實體
 * Created by childe on 2017/5/14.
 */
public class Entry {
    private String key;

    Entry(String key) {
        this.key = key;
    }

    public String getKey() {
        return key;
    }

    public void setKey(String key) {
        this.key = key;
    }
}

/**
 * Created by childe on 2017/5/5.
 */
public class Main {

    static int entryNum = 100;

    public static void main(String[] args) {
        //創建緩存集羣
        Cluster cluster = createCluster();

        //寫入緩存實體
        for (int i = 0; i < entryNum; i++) {
            cluster.put(new Entry(String.valueOf(i)));
        }

        //有效數據統計
        validCount(cluster);

        //新增緩存節點
        cluster.addServer(new Server("C"));

        System.out.println("afer add a server");

        //有效數據統計
        validCount(cluster);

    }

    private static Cluster createCluster() {
        Cluster c = new Cluster();
        c.addServer(new Server("A#1"));
        c.addServer(new Server("A#2"));
        c.addServer(new Server("B#1"));
        c.addServer(new Server("B#2"));
        return c;
    }

    private static void validCount(Cluster cluster) {
        int validNum = 0;
        for (int i = 0; i < entryNum; i++) {
            Entry entry = cluster.get(String.valueOf(i));
            if (entry != null) {
                validNum++;
            }
        }
        System.out.println("valid entry : " + validNum + ", total entry : " + entryNum);
    }
}

//輸出如下
valid entry : 100, total entry : 100
afer add a server
valid entry : 90, total entry : 100

從輸出結果我們看到失效率明顯降低。據瞭解，Memcahce中便採用了一致性哈希的算法。

HashMap

JDK中我們常用的HashMap也是基於哈希實現，JDK1.8以前採用數組和鏈表來組織數據，1.8中引入了紅黑樹對鏈表部分進行了優化。爲什麼HashMap要採用鏈表和紅黑樹呢？因爲我們得到某個key的HashCode需要落到具體的桶中，而桶的數量是有限並且固定的，所以難免遇到不同的key卻落到相同的桶中，於是就需要鏈表將這些數據鏈接起來，這也就是爲什麼當碰撞比較嚴重時，HashMap查詢變慢的原因，在JDK1.8在處理衝突時採用鏈表加紅黑樹，當鏈表長度大於8時，就將鏈表轉換爲紅黑樹，從而達到加速查找的目的。

JDK1.8中還對HashMap的擴容做了優化，在1.8以前擴容時，需要重新計算每個key的HashCode然後入桶，所以擴容是一個耗時的操作，在1.8中避免了重新計算Hash，加快了擴容操作。

不管是JDK1.7還是1.8我們使用HashMap時最好對需要的容量進行評估，儘量避免擴容操作。JDK1.8對HashMap的優化，想深入瞭解的可參考美團點評團隊的這篇博客

參考：
http://wiki.mbalib.com/wiki/%E5%93%88%E5%B8%8C%E7%AE%97%E6%B3%95
http://www.berlinix.com/

JDK源碼 Hash雜記

哈希簡介

簡單的Hash應用

一致性哈希

HashMap

.NET有哪些好用的定時任務調度框架

Python 將PDF轉爲PDF/A、PDF/X，以及PDF/A轉回PDF

elk3

Kafka存儲機制

aws語音呼叫調用，告警電話

深度學習框架火焰圖pprof和CUDA Nsys配置指南

【轉】[C#] WebAPI 防止併發調用二（冥等性）

爬蟲兩種繞過5s盾的方法

【轉】[SQL Server]關掉 SSMS 的 IntelliSense

號稱能打敗MLP的KAN到底行不行？數學核心原理全面解析

rocketmq-事務消息

譯-設計模式-結構模式之Adapter

JDK源碼 Hash雜記

譯-Spring-理解AOP代理

設計模式-結構模式之Facade

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結