HanLP — 漢字轉拼音,簡繁轉換 -- JAVA


HanLP 在漢字轉拼音時,可以解決多音字問題,顯示輸出聲調,聲母、韻母,通過訓練語料庫,
本文代碼爲《自然語言處理入門》配套版本 HanLP-1.7.5

HanLP 裏,漢字轉簡單,簡體繁體轉換,都用到了 雙數組字典樹 (Double-array Trie)Aho-Corasick DoubleArrayTire 算法 ACDAT - 基於雙數組字典樹的AC自動機 需要先熟悉

重載不是重任進行轉拼音,效果如下:

原文:重載不是重任
拼音(數字音調):chong2,zai3,bu2,shi4,zhong4,ren4,
拼音(符號音調):chóng,zǎi,bú,shì,zhòng,rèn,
拼音(無音調):chong,zai,bu,shi,zhong,ren,
聲調:2,3,2,4,4,4,
聲母:ch,z,b,sh,zh,r,
韻母:ong,ai,u,i,ong,en,
輸入法頭:ch,z,b,sh,zh,r,

語料庫

pinyin.txt

一丁點兒=yi1,ding1,dian3,er5
一不小心=yi1,bu4,xiao3,xin1
一丘之貉=yi1,qiu1,zhi1,he2
一絲不差=yi4,si1,bu4,cha1
一絲不苟=yi1,si1,bu4,gou3
一個=yi1,ge4
一個半個=yi1,ge4,ban4,ge4
一個巴掌拍不響=yi1,ge4,ba1,zhang3,pai1,bu4,xiang3
一個蘿蔔一個坑=yi1,ge4,luo2,bo5,yi1,ge4,keng1
一舉兩得=yi1,ju3,liang3,de2
一之爲甚=yi1,zhi1,wei2,shen4

image

訓練模型

訓練,生成 pinyin.txt.bin

加載語料庫

HanLP-1.7.5\src\main\java\com\hankcs\hanlp\corpus\dictionary\SimpleDictionary.java
加載語料庫,每行讀取,按 = 分隔,放入字典 trie
image
根據 = 右邊每個字的拼音,通過 Pinyin.valueOf("yi1") 得到枚舉中聲母、韻母、音調、包含音調的字符串形式、不含音調的字符串形式
image

public enum Pinyin
{
    a1(Shengmu.none, Yunmu.a, 1, "ā", "a", Head.a, 'a'),
    a2(Shengmu.none, Yunmu.a, 2, "á", "a", Head.a, 'a'),
    a3(Shengmu.none, Yunmu.a, 3, "ǎ", "a", Head.a, 'a'),
    a4(Shengmu.none, Yunmu.a, 4, "à", "a", Head.a, 'a'),
    a5(Shengmu.none, Yunmu.a, 5, "a", "a", Head.a, 'a'),
    ai1(Shengmu.none, Yunmu.ai, 1, "āi", "ai", Head.a, 'a'),
    ai2(Shengmu.none, Yunmu.ai, 2, "ái", "ai", Head.a, 'a'),
    ai3(Shengmu.none, Yunmu.ai, 3, "ǎi", "ai", Head.a, 'a'),
    ai4(Shengmu.none, Yunmu.ai, 4, "ài", "ai", Head.a, 'a'),
    ......
}

訓練模型

將Map構建成雙數組樹`trie.build(map)``,可查看:HanLP — 雙數組字典樹 (Double-array Trie) 實現原理 -- 代碼 + 圖文,看不懂你來打我

public void build(TreeMap<String, V> map)
{
    // 把值保存下來
    v = (V[]) map.values().toArray();
    l = new int[v.length];
    Set<String> keySet = map.keySet();
    // 構建二分trie樹
    addAllKeyword(keySet);
    // 在二分trie樹的基礎上構建雙數組trie樹
    buildDoubleArrayTrie(keySet);
    used = null;
    // 構建failure表並且合併output表
    constructFailureStates();
    rootState = null;
    loseWeight();
}

保存模型

通過 saveDat(path, trie, map.entrySet()); 生成模型文件

static boolean saveDat(String path, AhoCorasickDoubleArrayTrie<Pinyin[]> trie, Set<Map.Entry<String, Pinyin[]>> entrySet)
{
    try
    {
        DataOutputStream out = new DataOutputStream(new BufferedOutputStream(IOUtil.newOutputStream(path + Predefine.BIN_EXT)));
        out.writeInt(entrySet.size());
        for (Map.Entry<String, Pinyin[]> entry : entrySet)
        {
            Pinyin[] value = entry.getValue();
            out.writeInt(value.length);
            for (Pinyin pinyin : value)
            {
                out.writeInt(pinyin.ordinal());
            }
        }
        trie.save(out);
        out.close();
    }
    catch (Exception e)
    {
        logger.warning("緩存值dat" + path + "失敗");
        return false;
    }
    return true;
}
/**
 * 持久化
 *
 * @param out 一個DataOutputStream
 * @throws Exception 可能的IO異常等
 */
public void save(DataOutputStream out) throws Exception
{
    out.writeInt(size);
    for (int i = 0; i < size; i++)
    {
        out.writeInt(base[i]);
        out.writeInt(check[i]);
        out.writeInt(fail[i]);
        int output[] = this.output[i];
        if (output == null)
        {
            out.writeInt(0);
        }
        else
        {
            out.writeInt(output.length);
            for (int o : output)
            {
                out.writeInt(o);
            }
        }
    }
    out.writeInt(l.length);
    for (int length : l)
    {
        out.writeInt(length);
    }
}

預測

加載模型

// path = data/dictionary/pinyin/pinyin.txt
static boolean loadDat(String path)
{
    ByteArray byteArray = ByteArray.createByteArray(path + Predefine.BIN_EXT);
    if (byteArray == null) return false;
    int size = byteArray.nextInt();
    Pinyin[][] valueArray = new Pinyin[size][];
    for (int i = 0; i < valueArray.length; ++i)
    {
        int length = byteArray.nextInt();
        valueArray[i] = new Pinyin[length];
        for (int j = 0; j < length; ++j)
        {
            valueArray[i][j] = pinyins[byteArray.nextInt()];
        }
    }
    if (!trie.load(byteArray, valueArray)) return false;
    return true;
}

public boolean load(ByteArray byteArray, V[] value)
{
    if (byteArray == null) return false;
    size = byteArray.nextInt();
    base = new int[size + 65535];   // 多留一些,防止越界
    check = new int[size + 65535];
    fail = new int[size + 65535];
    output = new int[size + 65535][];
    int length;
    for (int i = 0; i < size; ++i)
    {
        base[i] = byteArray.nextInt();
        check[i] = byteArray.nextInt();
        fail[i] = byteArray.nextInt();
        length = byteArray.nextInt();
        if (length == 0) continue;
        output[i] = new int[length];
        for (int j = 0; j < output[i].length; ++j)
        {
            output[i][j] = byteArray.nextInt();
        }
    }
    length = byteArray.nextInt();
    l = new int[length];
    for (int i = 0; i < l.length; ++i)
    {
        l[i] = byteArray.nextInt();
    }
    v = value;
    return true;
}

計算

通過 HanLP — Aho-Corasick DoubleArrayTire 算法 ACDAT - 基於雙數組字典樹的AC自動機 找出漢字的拼音

// HanLP-1.7.5\src\main\java\com\hankcs\hanlp\dictionary\py\PinyinDictionary.java
protected static List<Pinyin> segLongest(char[] charArray, AhoCorasickDoubleArrayTrie<Pinyin[]> trie, boolean remainNone)
{
    final Pinyin[][] wordNet = new Pinyin[charArray.length][];
    trie.parseText(charArray, new AhoCorasickDoubleArrayTrie.IHit<Pinyin[]>()
    {
        @Override
        public void hit(int begin, int end, Pinyin[] value)
        {
            int length = end - begin;
            if (wordNet[begin] == null || length > wordNet[begin].length)
            {
                wordNet[begin] = length == 1 ? new Pinyin[]{value[0]} : value;
            }
        }
    });
    List<Pinyin> pinyinList = new ArrayList<Pinyin>(charArray.length);
    for (int offset = 0; offset < wordNet.length; )
    {
        if (wordNet[offset] == null)
        {
            if (remainNone)
            {
                pinyinList.add(Pinyin.none5);
            }
            ++offset;
            continue;
        }
        for (Pinyin pinyin : wordNet[offset])
        {
            pinyinList.add(pinyin);
        }
        offset += wordNet[offset].length;
    }
    return pinyinList;
}

image
image

調用

public static void main(String[] args)
{
    String text = "重載不是重任";
    List<Pinyin> pinyinList = HanLP.convertToPinyinList(text);
    System.out.print("原文:");
    for (char c : text.toCharArray())
    {
        System.out.printf("%c", c);
    }
    System.out.println();

    System.out.print("拼音(數字音調):");
    for (Pinyin pinyin : pinyinList)
    {
        System.out.printf("%s,", pinyin);
    }
    System.out.println();

    System.out.print("拼音(符號音調):");
    for (Pinyin pinyin : pinyinList)
    {
        System.out.printf("%s,", pinyin.getPinyinWithToneMark());
    }
    System.out.println();

    System.out.print("拼音(無音調):");
    for (Pinyin pinyin : pinyinList)
    {
        System.out.printf("%s,", pinyin.getPinyinWithoutTone());
    }
    System.out.println();

    System.out.print("聲調:");
    for (Pinyin pinyin : pinyinList)
    {
        System.out.printf("%s,", pinyin.getTone());
    }
    System.out.println();

    System.out.print("聲母:");
    for (Pinyin pinyin : pinyinList)
    {
        System.out.printf("%s,", pinyin.getShengmu());
    }
    System.out.println();

    System.out.print("韻母:");
    for (Pinyin pinyin : pinyinList)
    {
        System.out.printf("%s,", pinyin.getYunmu());
    }
    System.out.println();

    System.out.print("輸入法頭:");
    for (Pinyin pinyin : pinyinList)
    {
        System.out.printf("%s,", pinyin.getHead());
    }
    System.out.println();
}

輸出:

原文:重載不是重任
拼音(數字音調):chong2,zai3,bu2,shi4,zhong4,ren4,
拼音(符號音調):chóng,zǎi,bú,shì,zhòng,rèn,
拼音(無音調):chong,zai,bu,shi,zhong,ren,
聲調:2,3,2,4,4,4,
聲母:ch,z,b,sh,zh,r,
韻母:ong,ai,u,i,ong,en,
輸入法頭:ch,z,b,sh,zh,r,

數據下載:http://download.hanlp.com/data-for-1.7.5.zip

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章