java實現漢字字典

環境：eclipsse, jdk1.6, 沒有使用第三方的包，都是JDK有的。

注意，項目源文件我都使用的是UTF-8的編碼格式，如果不是，代碼裏面的漢字註釋會顯示亂碼。

設置UTF-8：windows->Preferences->General->Workspace 頁面上Text file encoding，選擇Other UTF-8

項目結構：

1.字典文件

dic.txt 下載地址:http://download.csdn.net/detail/wssiqi/5056993

這裏只摘錄一部分內容，裏面共收錄了20902個漢字

[plain]view
plain copy

19968,一,一,1,1,GGLL,A,yi1,yī  

19969,丁,一,2,12,SGH,AI,ding1,dīng,zheng1,zhēng  

19970,丂,一,2,15,GNV,AZVV,kao3,kǎo,qiao3,qiǎo,yu2,yú  

19971,七,一,2,15,AGN,HD,qi1,qī  

19972,丄,一,2,21,HGD,IAVV,shang4,shàng  

19973,丅,一,2,12,GHK,AIAA,xia4,xià  

19974,丆,一,2,13,DGT,GDAA,han3,hǎn  

19975,萬,一,3,153,DNV,,wan4,wàn,mo4,mò  

19976,丈,一,3,134,DYI,AOS,zhang4,zhàng  

19977,三,一,3,111,DGGG,CD,san1,sān  

19978,上,一,3,211,HHGG,IDA,shang3,shǎng,shang4,shàng  

19979,下,一,3,124,GHI,AID,xia4,xià  

19980,丌,一,3,132,GJK,AND,ji1,jī,qi2,qí  

19981,不,一,4,1324,GII,GI,fou3,fǒu,bu4,bù  

19982,與,一,3,151,GNGD,AZA,yu4,yù,yu3,yǔ,yu2,yú  

19983,丏,一,4,1255,GHNN,AIZY,mian3,miǎn  

19984,丐,一,4,1215,GHNV,AIZ,gai4,gài  

19985,醜,一,4,5211,NFD,XED,chou3,chǒu  

19986,丒,一,4,5341,VYGF,YDSA,chou3,chǒu

2.Dic.java

[java]view
plain copy

package com.siqi.dict;  

import java.io.BufferedReader;  

import java.io.ByteArrayInputStream;  

import java.io.File;  

import java.io.FileInputStream;  

import java.io.InputStreamReader;  

import java.nio.charset.Charset;  

/** 

 * 漢字本地字典。 <br/> 

 * 本地字典數據來自於<a href=http://www.zdic.net/search/?c=2>漢典</a> 

 * 實現了一下常用的需求，例如返回拼音，五筆，拼音首字母，筆畫數目，筆畫順序。 

 *  

 * @author siqi 

 *  

 */  

public class Dic {  

    /** 

     * 設置是否輸出調試信息 

     */  

    private static boolean DEBUG = true;  

    /** 

     * 默認編碼 

     */  

    public static final Charset DEFAULT_CHARSET = Charset.forName("UTF-8");  

    /** 

     * 漢字Unicode最小編碼 

     */  

    public static final int CN_U16_CODE_MIN = 0x4e00;  

    /** 

     * 漢字Unicode最大編碼 

     */  

    public static final int CN_U16_CODE_MAX = 0x9fa5;  

    /** 

     * 本地字典文件名 

     */  

    public static final String DIC_FILENAME = "dic.txt";  

    /** 

     * 字典數據 

     */  

    public static byte[] bytes = new byte[0];  

    /** 

     * 字典漢字數目 

     */  

    public static int count = 0;  

    /** 

     * 漢字unicode值在一條漢字信息的位置<br/> 

     * 漢字信息，例："25171,打,扌,5,12112,RSH,DAI,da3,dǎ,da2,dá" 

     */  

    public static int INDEX_UNICODE = 0;  

    /** 

     * 漢字在一條漢字信息的位置<br/> 

     * 漢字信息，例："25171,打,扌,5,12112,RSH,DAI,da3,dǎ,da2,dá" 

     */  

    public static int INDEX_CHARACTER = 1;  

    /** 

     * 漢字部首在一條漢字信息的位置<br/> 

     * 漢字信息，例："25171,打,扌,5,12112,RSH,DAI,da3,dǎ,da2,dá" 

     */  

    public static int INDEX_BUSHOU = 2;  

    /** 

     * 漢字筆畫在一條漢字信息的位置<br/> 

     * 漢字信息，例："25171,打,扌,5,12112,RSH,DAI,da3,dǎ,da2,dá" 

     */  

    public static int INDEX_BIHUA = 3;  

    /** 

     * 漢字筆畫順序在一條漢字信息的位置<br/> 

     * 漢字信息，例："25171,打,扌,5,12112,RSH,DAI,da3,dǎ,da2,dá" 

     */  

    public static int INDEX_BISHUN = 4;  

    /** 

     * 漢字五筆在一條漢字信息的位置<br/> 

     * 漢字信息，例："25171,打,扌,5,12112,RSH,DAI,da3,dǎ,da2,dá" 

     */  

    public static int INDEX_WUBI = 5;  

    /** 

     * 漢字鄭碼在一條漢字信息的位置<br/> 

     * 漢字信息，例："25171,打,扌,5,12112,RSH,DAI,da3,dǎ,da2,dá" 

     */  

    public static int INDEX_ZHENGMA = 6;  

    /** 

     * 第一個漢字拼音（英文字母）在一條漢字信息的位置<br/> 

     * 漢字信息，例："25171,打,扌,5,12112,RSH,DAI,da3,dǎ,da2,dá" 

     */  

    public static int INDEX_PINYIN_EN = 7;  

    /** 

     * 第一個漢字拼音（中文字母）在一條漢字信息的位置<br/> 

     * 漢字信息，例："25171,打,扌,5,12112,RSH,DAI,da3,dǎ,da2,dá" 

     */  

    public static int INDEX_PINYIN_CN = 8;  

    /** 

     * 裝載字典 

     */  

    static {  

        long time = System.currentTimeMillis();  

        try {  

            LoadDictionary();  

            count = count();  

            if (DEBUG) {  

                System.out.println("成功載入字典" + new File(DIC_FILENAME).getCanonicalPath() + " ，用時："  

                        + (System.currentTimeMillis() - time) + "毫秒，載入字符數"+count);  

            }  

        } catch (Exception e) {  

            try {  

                System.out.println("載入字典失敗" + new File(DIC_FILENAME).getCanonicalPath()+"\r\n");  

            } catch (Exception e1) {  

            }  

            e.printStackTrace();  

        }  

    }  

    /** 

     * 獲取漢字unicode值 

     *  

     * @param ch 

     *            漢字 

     * @return 返回漢字的unicode值 

     * @throws Exception 

     */  

    public static String GetUnicode(Character ch) throws Exception {  

        return GetCharInfo(ch, INDEX_UNICODE);  

    }  

    /** 

     * 獲取拼音（英文字母） 

     *  

     * @param ch 

     *            單個漢字字符 

     * @return 返回漢字的英文字母拼音。如 "大"->"da4"。 

     * @throws Exception 

     */  

    public static String GetPinyinEn(Character ch) throws Exception {  

        return GetCharInfo(ch, INDEX_PINYIN_EN);  

    }  

    /** 

     * 返回漢字字符串的拼音（英文字母） 

     *  

     * @param str 

     *            漢字字符串 

     * @return 返回漢字字符串的拼音。將字符串中的漢字替換成拼音，其他字符不變。拼音中間會有空格。 注意，對於多音字，返回的拼音可能不正確。 

     * @throws Exception 

     */  

    public static String GetPinyinEn(String str) throws Exception {  

        StringBuffer sb = new StringBuffer();  

        for (int i = 0; i < str.length(); i++) {  

            char ch = str.charAt(i);  

            if (isChineseChar(ch)) {  

                sb.append(GetPinyinEn(ch) + " ");  

            } else {  

                sb.append(ch);  

            }  

        }  

        return sb.toString().trim();  

    }  

    /** 

     * 獲取拼音（中文字母） 

     *  

     * @param ch 

     *            單個漢字字符 

     * @return 返回漢字的中文字母拼音。如 "打"->"dǎ"。 

     * @throws Exception 

     */  

    public static String GetPinyinCn(Character ch) throws Exception {  

        return GetCharInfo(ch, INDEX_PINYIN_CN);  

    }  

    /** 

     * 返回漢字字符串的拼音（中文字母） 

     *  

     * @param str 

     *            漢字字符串 

     * @return 返回漢字字符串的拼音。將字符串中的漢字替換成拼音，其他字符不變。拼音中間會有空格。 注意，對於多音字，返回的拼音可能不正確。 

     * @throws Exception 

     */  

    public static String GetPinyinCn(String str) throws Exception {  

        StringBuffer sb = new StringBuffer();  

        for (int i = 0; i < str.length(); i++) {  

            char ch = str.charAt(i);  

            if (isChineseChar(ch)) {  

                sb.append(GetPinyinCn(ch) + " ");  

            } else {  

                sb.append(ch);  

            }  

        }  

        return sb.toString().trim();  

    }  

    /** 

     * 返回拼音首字母 

     *  

     * @param ch 

     * @return 

     * @throws Exception 

     */  

    public static String GetFirstLetter(Character ch) throws Exception {  

        if (isChineseChar(ch)) {  

            return GetPinyinEn(ch).substring(0, 1);  

        } else {  

            return "";  

        }  

    }  

    /** 

     * 返回漢字字符串拼音首字母，如果不是漢字，會被忽略掉。 

     *  

     * @param str 

     *            漢字字符串 

     * @return 

     * @throws Exception 

     */  

    public static String GetFirstLetter(String str) throws Exception {  

        StringBuffer sb = new StringBuffer();  

        for (int i = 0; i < str.length(); i++) {  

            char ch = str.charAt(i);  

            if (isChineseChar(ch)) {  

                sb.append(GetFirstLetter(ch));  

            }  

        }  

        return sb.toString().trim();  

    }  

    /** 

     * 獲取漢字部首 

     *  

     * @param ch 

     *            漢字 

     * @return 返回漢字的部首 

     * @throws Exception 

     */  

    public static String GetBushou(Character ch) throws Exception {  

        return GetCharInfo(ch, INDEX_BUSHOU);  

    }  

    /** 

     * 獲取漢字筆畫數目 

     *  

     * @param ch 

     *            漢字 

     * @return 返回漢字的筆畫數目 

     * @throws Exception 

     */  

    public static String GetBihua(Character ch) throws Exception {  

        return GetCharInfo(ch, INDEX_BIHUA);  

    }  

    /** 

     * 獲取漢字筆畫順序 

     *  

     * @param ch 

     *            漢字 

     * @return 返回漢字的筆畫順序 

     * @throws Exception 

     */  

    public static String GetBishun(Character ch) throws Exception {  

        return GetCharInfo(ch, INDEX_BISHUN);  

    }  

    /** 

     * 獲取漢字五筆 

     *  

     * @param ch 

     *            漢字 

     * @return 返回漢字五筆 

     * @throws Exception 

     */  

    public static String GetWubi(Character ch) throws Exception {  

        return GetCharInfo(ch, INDEX_WUBI);  

    }  

    /** 

     * 獲取漢字鄭碼 

     *  

     * @param ch 

     *            漢字 

     * @return 返回漢字鄭碼 

     * @throws Exception 

     */  

    public static String GetZhengma(Character ch) throws Exception {  

        return GetCharInfo(ch, INDEX_ZHENGMA);  

    }  

    /** 

     * 從字典中獲取漢字信息 

     *  

     * @param ch 

     *            要查詢的漢字 

     * @return 返回漢字信息，如"25171,打,扌,5,12112,RSH,DAI,da3,dǎ,da2,dá" <br/> 

     *         第一是漢字unicode值<br/> 

     *         第二是漢字<br/> 

     *         第三是漢字部首<br/> 

     *         第四是漢字筆畫<br/> 

     *         第五是漢字筆畫順序("12345"分別代表"橫豎撇捺折")<br/> 

     *         第六是漢字五筆<br/> 

     *         第七是漢字鄭碼<br/> 

     *         第八及以後是漢字的拼音（英文字母拼音和中文字母拼音）<br/> 

     * @throws Exception 

     */  

    public static String GetCharInfo(Character ch) throws Exception {  

        if (!isChineseChar(ch)) {  

            throw new Exception("'" + ch + "' 不是一個漢字！");  

        }  

        String result = "";  

        ByteArrayInputStream bais = new ByteArrayInputStream(bytes);  

        BufferedReader br = new BufferedReader(new InputStreamReader(bais));  

        String strWord;  

        while ((strWord = br.readLine()) != null) {  

            if (strWord.startsWith(String.valueOf(ch.hashCode()))) {  

                result = strWord;  

                break;  

            }  

        }  

        br.close();  

        bais.close();  

        return result;  

    }  

    /** 

     * 返回漢字信息 

     *  

     * @param ch 

     *            漢字 

     * @param index 

     *            信息所在的Index 

     * @return 

     * @throws Exception 

     */  

    private static String GetCharInfo(Character ch, int index) throws Exception {  

        if (!isChineseChar(ch)) {  

            throw new Exception("'" + ch + "' 不是一個漢字！");  

        }  

        // 獲取漢字信息  

        String charInfo = GetCharInfo(ch);  

        String result = "";  

        try {  

            result = charInfo.split(",")[index];  

        } catch (Exception e) {  

            throw new Exception("請查看字典中" + ch + "漢字記錄是否正確！");  

        }  

        return result;  

    }  

    /** 

     * 載入字典文件到內存。 

     * @throws Exception  

     */  

    private static void LoadDictionary() throws Exception {  

        File file = new File(DIC_FILENAME);  

        bytes = new byte[(int) file.length()];  

        FileInputStream fis = new FileInputStream(file);  

        fis.read(bytes, 0, bytes.length);  

        fis.close();  

    }  

    /** 

     * 判斷字符是否爲漢字，在測試的時候，我發現漢字的字符的hashcode值 跟漢字Unicode 

     * 16的值一樣，所以可以用hashcode來判斷是否爲漢字。 

     *  

     * @param ch 

     *            漢字 

     * @return 是漢字返回true，否則返回false。 

     */  

    public static boolean isChineseChar(Character ch) {  

        if (ch.hashCode() >= CN_U16_CODE_MIN  

                && ch.hashCode() <= CN_U16_CODE_MAX) {  

            return true;  

        } else {  

            return false;  

        }  

    }  

    /** 

     *  

     * @return 返回字典包含的漢字數目。 

     * @throws Exception 

     */  

    private static int count() throws Exception {  

        int cnt = 0;  

        ByteArrayInputStream bais = new ByteArrayInputStream(bytes);  

        BufferedReader br = new BufferedReader(new InputStreamReader(bais));  

        while (br.readLine() != null) {  

            cnt++;  

        }  

        br.close();  

        bais.close();  

        return cnt;  

    }  

}

3.Sample.java

如何使用字典

[java]view
plain copy

package com.siqi.dict;  

/** 

 * 包含兩個實例，示例如何獲取漢字的拼音等信息。 

 * @author siqi 

 * 

 */  

public class Sample {  

    /** 

     * 字典使用實例 

     *  

     * @param args 

     */  

    public static void main(String[] args) {  

        try {  

            long time = System.currentTimeMillis();  

            char ch = '打';  

            //漢字單個字符  

            System.out.println("====打字信息開始====");  

            System.out.println("首字母："+Dic.GetFirstLetter(ch));  

            System.out.println("拼音（中）："+Dic.GetPinyinCn(ch));  

            System.out.println("拼音（英）："+Dic.GetPinyinEn(ch));  

            System.out.println("部首："+Dic.GetBushou(ch));  

            System.out.println("筆畫數目："+Dic.GetBihua(ch));  

            System.out.println("筆畫："+Dic.GetBishun(ch));  

            System.out.println("五筆："+Dic.GetWubi(ch));  

            System.out.println("====打字信息結束====");  

            //漢字字符串  

            System.out.println("\r\n====漢字字符串====");  

            System.out.println(Dic.GetPinyinEn("返回漢字字符串的拼音。"));  

            System.out.println(Dic.GetPinyinCn("返回漢字字符串的拼音。"));  

            System.out.println(Dic.GetFirstLetter("返回漢字字符串的拼音。"));  

            System.out.println("====漢字字符串====\r\n");  

            System.out.println("用時："+(System.currentTimeMillis()-time)+"毫秒");  

        } catch (Exception e) {  

            e.printStackTrace();  

        }  

    }  

}

4.結果

[html]view
plain copy

====打字信息開始====  

成功載入字典C:\workspaces\01_java\DictLocal\dic.txt ，用時：15毫秒，載入字符數20902  

首字母：d  

拼音（中）：dǎ  

拼音（英）：da3  

部首：扌  

筆畫數目：5  

筆畫：12112  

五筆：RSH  

====打字信息結束====  

====漢字字符串====  

fan3 hui2 han4 zi4 zi4 fu2 chuan4 di2 pin1 yin1 。  

fǎn huí hàn zì zì fú chuàn dí pīn yīn 。  

fhhzzfcdpy  

====漢字字符串====  

Memory(Used/Total) : 1539/15872 KB  

用時：218毫秒

待會再上傳如何獲取字典文件的，我是通過收集http://www.zdic.net/zd/的網頁來獲取的

=============補充，如何獲取漢字的信息================

=============所有的信息都是從漢典網站上獲取的=========

目錄結構爲：

環境：eclipsse, jdk1.6, 沒有使用第三方的包，都是JDK有的。

注意，項目源文件我都使用的是UTF-8的編碼格式，如果不是，代碼裏面的漢字註釋會顯示亂碼。

設置UTF-8：windows->Preferences->General->Workspace 頁面上Text file encoding，選擇Other UTF-8

包說明：

com.siqi.http

Httpclient.Java是我寫的一個簡單的獲取網頁的類，用來獲取網頁內容；

com.siqi.dict

DictMain.java用來下載漢字網頁，從中獲取漢字的拼音信息，並保存到data.dat中

DownloadThread.java用來下載網頁（多線程）

com.siqi.pinyin

PinYin.java在執行過DictMain.java後，會生成一個data.dat，把這個文件拷貝到com.siqi.pinyin包下面，就可以調用PinYin.java裏面的函數得到漢字的拼音了

PinYinEle.java一個漢字->拼音->Unicode的模型

源碼：

Httpclient.java 可以用來獲取網頁，可以的到網頁內容，網頁編碼和網頁的header，簡版

[java]view
plain copy

package com.siqi.http;  

import java.io.IOException;  

import java.io.InputStream;  

import java.net.Socket;  

import java.net.URLEncoder;  

import java.util.regex.Matcher;  

import java.util.regex.Pattern;  

/** 

 * 使用SOCKET實現簡單的網頁GET和POST 

 *  

 * @author siqi 

 *  

 */  

public class Httpclient {  

    /** 

     * processUrl 參數 HTTP GET 

     */  

    public static final int METHOD_GET = 0;  

    /** 

     * processUrl 參數 HTTP POST 

     */  

    public static final int METHOD_POST = 1;  

    /** 

     * HTTP GET的報頭，簡化版 

     */  

    public static final String HEADER_GET = "GET %s HTTP/1.0\r\nHOST: %s\r\n\r\n";  

    /** 

     * HTTP POST的報頭，簡化版 

     */  

    public static final String HEADER_POST = "POST %s HTTP/1.0\r\nHOST: %s\r\nContent-Length: 0\r\n\r\n";  

    /** 

     * 網頁報頭和內容的分割符 

     */  

    public static final String CONTENT_SEPARATOR = "\r\n\r\n";  

    /** 

     * 網頁請求響應內容byte 

     */  

    private byte[] bytes = new byte[0];  

    /** 

     * 網頁報頭 

     */  

    private String header = "";  

    /** 

     * 網頁內容 

     */  

    private String content = "";  

    /** 

     * 網頁編碼，默認爲UTF-8 

     */  

    public static final String CHARSET_DEFAULT = "UTF-8";  

    /** 

     * 網頁編碼 

     */  

    private String charset = CHARSET_DEFAULT;  

    /** 

     * 使用Httpclient的例子 

     *  

     * @param args 

     * @throws Exception 

     */  

    public static void main(String[] args) throws Exception {  

        Httpclient httpclient = new Httpclient();  

        // 請求百度首頁（手機版）  

        httpclient.processUrl("http://m.baidu.com/");  

        System.out.println("獲取網頁http://m.baidu.com/");  

        System.out.println("報頭爲：\r\n" + httpclient.getHeader());  

        System.out.println("內容爲：\r\n" + httpclient.getContent());  

        System.out.println("編碼爲：\r\n" + httpclient.getCharset());  

        System.out.println("************************************");  

        // 使用百度搜索"中國"（手機版）  

        // 這是手機百度搜索框的源碼 <input id="word" type="text" size="20" maxlength="64"  

        // name="word">  

        String url = String.format("http://m.baidu.com/s?word=%s",  

                URLEncoder.encode("中國", CHARSET_DEFAULT));  

        httpclient.processUrl(url, METHOD_POST);  

        System.out.println("獲取網頁http://m.baidu.com/s?word=中國");  

        System.out.println("報頭爲：\r\n" + httpclient.getHeader());  

        System.out.println("內容爲：\r\n" + httpclient.getContent());  

        System.out.println("編碼爲：\r\n" + httpclient.getCharset());  

    }  

    /** 

     * 初始化，設置所有變量爲默認值 

     */  

    private void init() {  

        this.bytes = new byte[0];  

        this.charset = CHARSET_DEFAULT;  

        this.header = "";  

        this.content = "";  

    }  

    /** 

     * 獲取網頁報頭header 

     *  

     * @return 

     */  

    public String getHeader() {  

        return header;  

    }  

    /** 

     * 獲取網頁內容content 

     *  

     * @return 

     */  

    public String getContent() {  

        return content;  

    }  

    /** 

     * 獲取網頁編碼 

     *  

     * @return 

     */  

    public String getCharset() {  

        return charset;  

    }  

    /** 

     * 請求網頁內容（使用HTTP GET） 

     *  

     * @param url 

     * @throws Exception 

     */  

    public void processUrl(String url) throws Exception {  

        processUrl(url, METHOD_GET);  

    }  

    /** 

     * 使用Socket請求（獲取）一個網頁。<br/> 

     * 例如:<br/> 

     * processUrl("http://www.baidu.com/", METHOD_GET)會獲取百度首頁；<br/> 

     *  

     * @param url 

     *            這個網頁或者網頁內容的地址 

     * @param method 

     *            請求網頁的方法: METHOD_GET或者METHOD_POST 

     * @throws Exception 

     */  

    public void processUrl(String url, int method) throws Exception {  

        init();  

        // url = "http://www.zdic.net/search/?c=2&q=%E5%A4%A7";  

        // 規範化鏈接，當網址爲http://www.baidu.com時，將網址變爲：http://www.baidu.com/  

        Matcher mat = Pattern.compile("https?://[^/]+").matcher(url);  

        if (mat.find() && mat.group().equals(url)) {  

            url += "/";  

        }  

        Socket socket = new Socket(getHostUrl(url), 80); // 設置要連接的服務器地址  

        socket.setSoTimeout(3000); // 設置超時時間爲3秒  

        String request = null;  

        // 構造請求，詳情請參考HTTP協議(RFC2616)  

        if (method == METHOD_POST) {  

            request = String.format(HEADER_POST, getSubUrl(url),  

                    getHostUrl(url));  

        } else {  

            request = String  

                    .format(HEADER_GET, getSubUrl(url), getHostUrl(url));  

        }  

        socket.getOutputStream().write(request.getBytes());// 發送請求  

        this.bytes = InputStream2ByteArray(socket.getInputStream());// 讀取響應  

        // 獲取網頁編碼，我們只需要測試查找前4096個字節，一般編碼信息都會在裏面找到  

        String temp = new String(this.bytes, 0,  

                bytes.length < 4096 ? bytes.length : 4096);  

        mat = Pattern.compile("(?<=<meta.{0,100}?charset=)[a-z-0-9]*",  

                Pattern.CASE_INSENSITIVE).matcher(temp);  

        if (mat.find()) {  

            this.charset = mat.group();  

        } else {  

            this.charset = CHARSET_DEFAULT;  

        }  

        // 用正確的編碼得到網頁報頭和內容  

        temp = new String(this.bytes, this.charset);  

        int headerEnd = temp.indexOf(CONTENT_SEPARATOR);  

        this.header = temp.substring(0, headerEnd);  

        this.content = temp.substring(headerEnd + CONTENT_SEPARATOR.length(),  

                temp.length());  

        socket.close(); // 關閉socket  

    }  

    /** 

     * 根據網址，獲取服務器地址<br/> 

     * 例如：<br/> 

     * http://m.weathercn.com/common/province.jsp 

     * <p> 

     * 返回：<br/> 

     * m.weathercn.com 

     *  

     * @param url 

     *            網址 

     * @return 

     */  

    public static String getHostUrl(String url) {  

        String host = "";  

        Matcher mat = Pattern.compile("(?<=https?://).+?(?=/)").matcher(url);  

        if (mat.find()) {  

            host = mat.group();  

        }  

        return host;  

    }  

    /** 

     * 根據網址，獲取網頁路徑 例如：<br/> 

     * http://m.weathercn.com/common/province.jsp 

     * <p> 

     * 返回：<br/> 

     * /common/province.jsp 

     *  

     * @param url 

     * @return 如果沒有獲取到網頁路徑，返回""; 

     */  

    public static String getSubUrl(String url) {  

        String subUrl = "";  

        Matcher mat = Pattern.compile("https?://.+?(?=/)").matcher(url);  

        if (mat.find()) {  

            subUrl = url.substring(mat.group().length());  

        }  

        return subUrl;  

    }  

    /** 

     * 將b1和b2兩個byte數組拼接成一個, 結果=b1+b2 

     *  

     * @param b1 

     * @param b2 

     * @return 

     */  

    public static byte[] ByteArrayCat(byte[] b1, byte[] b2) {  

        byte[] b = new byte[b1.length + b2.length];  

        System.arraycopy(b1, 0, b, 0, b1.length);  

        System.arraycopy(b2, 0, b, b1.length, b2.length);  

        return b;  

    }  

    /** 

     * 讀取輸入流並轉爲byte數組，不返回字符串， 是因爲輸入流的編碼不確定，錯誤的編碼會造成亂碼。 

     *  

     * @param is 

     *            輸入流inputstream 

     * @return 字符串 

     * @throws IOException 

     */  

    public static byte[] InputStream2ByteArray(InputStream is)  

            throws IOException {  

        byte[] b = new byte[0];  

        byte[] bb = new byte[4096]; // 緩衝區  

        int len = 0;  

        while ((len = is.read(bb)) != -1) {  

            byte[] newb = new byte[b.length + len];  

            System.arraycopy(b, 0, newb, 0, b.length);  

            System.arraycopy(bb, 0, newb, b.length, len);  

            b = newb;  

        }  

        return b;  

    }  

}

DictMain.java

[java]view
plain copy

package com.siqi.dict;  

import java.io.File;  

import java.io.FileReader;  

import java.io.FileWriter;  

import java.io.IOException;  

import java.util.regex.Matcher;  

import java.util.regex.Pattern;  

/** 

 * 從漢典下載漢字網頁，並提取拼音信息 

 * @author siqi 

 * 

 */  

public class DictMain {  

    /** 

     * 網頁保存路徑 

     */  

    public static final String SAVEPATH = "dict/pages/";  

    /** 

     * 下載的漢字網頁名稱 

     */  

    public static final String FILEPATH = SAVEPATH + "%s.html";  

    /** 

     * 字典數據文件名稱 

     */  

    public static final String DATA_FILENAME = "data.txt";  

    /** 

     * 漢字unicode最小 

     */  

    public static final int UNICODE_MIN = 0x4E00;  

    /** 

     * 漢字unicode最大 

     */  

    public static final int UNICODE_MAX = 0x9FFF;  

    /** 

     * 準備工作: 

     * 1.從漢典網站下載所有漢字的頁面，注意，不要在eclipse中打開保存頁面的文件夾， 

     * 因爲每個漢字一個頁面，總共有20000+個頁面，容易卡死eclipse 

     * 2.從漢字頁面獲取漢字拼音信息，生成data.dat文件 

     * 3.生成的data.dat複製到com.siqi.pinyin下面 

     * 4.可以使用com.siqi.pinyin.PinYin.java了 

     */  

    static{  

        // 下載網頁  

        for (int i = UNICODE_MIN; i <= UNICODE_MAX; i++) {  

            // 檢查是否已經存在  

            String filePath = String.format(FILEPATH, i); // 文件名  

            File file = new File(filePath);  

            if (!file.exists()) {  

                new DownloadThread(i).start();  

            }  

        }  

        //解析網頁，得到拼音信息，並保存到data.dat  

        StringBuffer sb = new StringBuffer();  

        for (int i = UNICODE_MIN; i <= UNICODE_MAX; i++) {  

            String word = new String(Character.toChars(i));  

            String pinyin = getPinYinFromWebpageFile(String.format(FILEPATH, i));  

            String str = String.format("%s,%s,%s\r\n", i,word,pinyin);  

            System.out.print(str);  

            sb.append(str);  

        }  

        //保存到data.dat  

        try {  

            FileWriter fw = new FileWriter(DATA_FILENAME);  

            fw.write(sb.toString());  

            fw.close();  

        } catch (IOException e) {  

            e.printStackTrace();  

        }  

    }  

    public static void main(String[] args){  

        System.out.println("All prepared!");  

    }  

    /** 

     * 從網頁文件獲取拼音信息 

     * @param file 

     * @return 

     */  

    private static String getPinYinFromWebpageFile(String file) {  

        try {  

            char[] buff = new char[(int) new File(file).length()];  

            FileReader reader = new FileReader(file);  

            reader.read(buff);  

            reader.close();  

            String content = new String(buff);  

            // spf("yi1")  

            Matcher mat = Pattern.compile("(?<=spf\\(\")[a-z1-4]{0,100}",  

                    Pattern.CASE_INSENSITIVE).matcher(content);  

            if (mat.find()) {  

                return mat.group();  

            }  

            //<span class="dicpy">cal</span> spf("xin1")  

            mat = Pattern.compile("(?<=class=\"dicpy\">)[a-z1-4]{0,100}",  

                    Pattern.CASE_INSENSITIVE).matcher(content);  

            if (mat.find()) {  

                return mat.group();  

            }  

        } catch (Exception e) {  

            e.printStackTrace();  

        }  

        return "";  

    }  

}

DownloadThread.java

[java]view
plain copy

package com.siqi.dict;  

import java.io.File;  

import java.io.FileWriter;  

import java.net.URLEncoder;  

import java.util.regex.Matcher;  

import java.util.regex.Pattern;  

import com.siqi.http.Httpclient;  

/** 

 * 將漢字頁面從漢典網站下載下來，存儲到本地 

 * http://www.zdic.net/search/?c=2 

 * @author siqi 

 * 

 */  

public class DownloadThread extends Thread{  

    /** 

     * 線程最大數目 

     */  

    public static int THREAD_MAX = 10;  

    /** 

     * 下載最大重複次數 

     */  

    public static int RETRY_MAX = 5;  

    /** 

     * 漢典網站搜索網址 

     */  

    public static String SEARCH_URL = "http://www.zdic.net/search/?q=%s";  

    /** 

     * 當前線程數目 

     */  

    private static int threadCnt = 0;  

    /** 

     * 當前線程處理漢字的unicode編碼 

     */  

    private int unicode = 0;  

    /** 

     * 如果PATH文件夾不存在，那麼創建它 

     */  

    static{  

        try {  

            File file = new File(DictMain.SAVEPATH);  

            if (!file.exists()) {  

                file.mkdirs();  

            }  

        } catch (Exception e) {  

        }  

    }  

    /** 

     * 返回當前線程數量 

     * @param i 修改當前線程數量 threadCnt += i; 

     * @return 返回修改後線程數量 

     */  

    public static synchronized int threadCnt(int i){  

        threadCnt += i;  

        return threadCnt;  

    }  

    /** 

     * 下載UNICODE編碼爲unicode的漢字網頁 

     * @param unicode 

     */  

    public DownloadThread(int unicode){  

        //等待，直到當前線程數量小於THREAD_MAX  

        while(threadCnt(0)>THREAD_MAX){  

            try {  

                Thread.sleep(500);  

            } catch (InterruptedException e) {  

            }  

        }  

        threadCnt(1);   //線程數量+1  

        this.unicode = unicode;  

    }  

    @Override  

    public void run() {  

        long t1 = System.currentTimeMillis(); // 記錄時間  

        String filePath = String.format(DictMain.FILEPATH, unicode); // 文件名  

        String word = new String(Character.toChars(unicode)); // 將unicode轉換爲數字  

        boolean downloaded = false;  

        int retryCnt = 0; // 下載失敗重複次數  

        while (!downloaded && retryCnt < RETRY_MAX) {  

            try {  

                String content = DownloadPage(word);  

                SaveToFile(filePath, content);  

                downloaded = true;  

                threadCnt(-1);  

                System.out.println(String.format("%s, %s, 下載成功！線程數目：%s 用時：%s",  

                        unicode, word, threadCnt(0), System.currentTimeMillis()  

                                - t1));  

                return;  

            } catch (Exception e) {  

                retryCnt++;  

            }  

        }  

        threadCnt(-1);  

        System.err.println(String.format("%s, %s, 下載失敗！線程數目：%s 用時：%s", unicode,  

                word, threadCnt(0), System.currentTimeMillis() - t1));  

    }  

    /** 

     * 在漢典網站上查找漢字，返回漢字字典頁面內容 

     * @param word 

     * @return 

     * @throws Exception 

     */  

    public String DownloadPage(String word) throws Exception{  

        //查找word  

        Httpclient httpclient = new Httpclient();  

        String url = String.format(SEARCH_URL, URLEncoder.encode(word, "UTF-8"));  

        httpclient.processUrl(url, Httpclient.METHOD_POST);  

        //返回的是一個跳轉頁  

        //獲取跳轉的鏈接  

        Matcher mat = Pattern.compile("(?<=HREF=\")[^\"]+").matcher(httpclient.getContent());  

        if(mat.find()){  

            httpclient.processUrl(mat.group());  

        }  

        return httpclient.getContent();  

    }  

    /** 

     * 將內容content寫入file文件 

     * @param file 

     * @param content 

     */  

    public void SaveToFile(String file, String content){  

        try {  

            FileWriter fw = new FileWriter(file);  

            fw.write(content);  

            fw.close();  

        } catch (Exception e) {  

            e.printStackTrace();  

        }  

    }  

}

PinYin.java

[java]view
plain copy

package com.siqi.pinyin;  

import java.io.BufferedReader;  

import java.io.InputStreamReader;  

import java.util.HashMap;  

import java.util.Map;  

public class PinYin {  

    private static Map<Integer, PinYinEle> map = new HashMap<Integer, PinYinEle>();  

    /** 

     * 載入pinyin數據文件 

     */  

    static {  

        try {  

            BufferedReader bReader = new BufferedReader(new InputStreamReader(  

                    PinYin.class.getResourceAsStream("data.dat")));  

            String aLine = null;  

            while ((aLine = bReader.readLine()) != null) {  

                PinYinEle ele = new PinYinEle(aLine);  

                map.put(ele.getUnicode(), ele);  

            }  

            bReader.close();  

        } catch (Exception e) {  

            e.printStackTrace();  

        }  

    }  

    /** 

     * 去掉註釋可以測試一下 

     *  

     * @param args 

     */  

    public static void main(String[] args) {  

        System.out.println("　包含聲調：" + PinYin.getPinYin("大家haome12345"));  

        System.out.println("不包含聲調：" + PinYin.getPinYin("大家haome12345", false));  

    }  

    /** 

     * 獲取漢字字符串的拼音，containsNumber是否獲取拼音中的聲調1、2、3、4 

     *  

     * @param str 

     * @param containsNumber 

     *            true = 包含聲調，false = 不包含聲調 

     * @return 

     */  

    public static String getPinYin(String str, boolean containsNumber) {  

        StringBuffer sb = new StringBuffer();  

        for (Character ch : str.toCharArray()) {  

            sb.append(getPinYin(ch, containsNumber));  

        }  

        return sb.toString();  

    }  

    /** 

     * 獲取字符串的拼音 

     *  

     * @param str 

     * @return 

     */  

    public static String getPinYin(String str) {  

        StringBuffer sb = new StringBuffer();  

        for (Character ch : str.toCharArray()) {  

            sb.append(getPinYin(ch));  

        }  

        return sb.toString();  

    }  

    /** 

     * 獲取單個漢字的拼音，包含聲調 

     *  

     * @param ch 

     * @return 

     */  

    public static String getPinYin(Character ch) {  

        return getPinYin(ch, true);  

    }  

    /** 

     * 獲取單個漢字的拼音 

     *  

     * @param ch 

     *            漢字. 如果輸入非漢字，返回ch. 如果輸入null，返回空字符串； 

     * @param containsNumber 

     *            true = 包含聲調，false = 不包含聲調 

     * @return 

     */  

    public static String getPinYin(Character ch, boolean containsNumber) {  

        if (ch != null) {  

            int code = ch.hashCode();  

            if (map.containsKey(code)) {  

                if (containsNumber) {  

                    return map.get(code).getPinyin();  

                } else {  

                    return map.get(code).getPinyin().replaceAll("[0-9]", "");  

                }  

            } else {  

                return ch.toString();  

            }  

        }  

        return "";  

    }  

}

PinYinEle.java

[java]view
plain copy

package com.siqi.pinyin;  

public class PinYinEle {  

    private int unicode;  

    private String ch;  

    private String pinyin;  

    public PinYinEle(){}  

    public PinYinEle(String str){  

        if(str!=null){  

            String[] strs = str.split(",");  

            if(strs.length == 3){  

                try{  

                this.unicode = Integer.parseInt(strs[0]);  

                }catch(Exception e){  

                }  

                this.ch = strs[1];  

                this.pinyin = strs[2];  

            }  

        }  

    }  

    public int getUnicode() {  

        return unicode;  

    }  

    public void setUnicode(int unicode) {  

        this.unicode = unicode;  

    }  

    public String getCh() {  

        return ch;  

    }  

    public void setCh(String ch) {  

        this.ch = ch;  

    }  

    public String getPinyin() {  

        return pinyin;  

    }  

    public void setPinyin(String pinyin) {  

        this.pinyin = pinyin;  

    }  

}

生成的data.dat裏面內容（部分）爲：

[java]view
plain copy

﻿19968,一,yi1  

19969,丁,ding1  

19970,丂,kao3  

19971,七,qi1  

19972,丄,shang4  

19973,丅,xia4  

19974,丆,han3  

19975,萬,wan4  

19976,丈,zhang4  

19977,三,san1  

19978,上,shang4  

19979,下,xia4  

19980,丌,qi2  

19981,不,bu4

運行DictMain.java結果

執行時間可能會有幾十分鐘到幾小時不等，總共會下載200+M的網頁（20000+個網頁），每次運行都會先判斷以前下載過沒有，所以結束掉程序不會有影響

顯示All prepared!表示已經準備好了，刷新項目文件夾，可以看到網頁保持在dict/pages下面，不建議在elipse中打開那個文件夾，因爲裏面有2萬多個文件，會卡死eclipse，

還可以看到生成了data.txt文件，改爲data.dat並複製到pinyin文件夾下面

運行PinYin.java

可以看到"大家haome12345"的拼音：

[java]view
plain copy

包含聲調：da4jia1haome12345  

包含聲調：dajiahaome12345  

上面只是顯示瞭如何獲取拼音，獲取筆畫等的方法類似，在這裏就不演示了。

java實現漢字字典

1.字典文件

2.Dic.java

3.Sample.java

4.結果

釘釘打卡速度慢

Nginx R31 doc 官方文檔-01-nginx 如何安裝

Qt/C++音視頻開發74-合併標籤圖形/生成yolo運算結果圖形/文字和圖形合併成一個/水印濾鏡

挑戰程序設計競賽 2.2章習題 POJ - 3617 Best Cow Line 貪心

字節面試：MySQL什麼時候鎖表？如何防止鎖表？

.NET8連接SQL SERVER 2008 R2 報：證書鏈是由不受信任的頒發機構頒發的

golang開發環境搭建(win10)

python計算機視覺學習筆記——PIL庫的用法

Golang初學：獲取程序內存使用情況，std runtime

java解析xml的幾種方式

FusionCharts 用法心得

ECharts案例簡單介紹

oracle 存儲過程

Javaweb驗證登錄 Filter

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結