判斷文件編碼

原創

_狼_

2019-02-02 19:18

概述

常用的幾種文件編碼：

ansi
unicode
utf8
gb2312

在此主要討論驗證方法，不討論編碼定義。

創建文件

創建4個不同編碼的文件，分別命名爲unicode.txt, gb2312.txt, utf8.txt, utf8bom.txt。內容“一”，使用Nodepad++分別轉碼爲對應的編碼。

二進制讀取

import org.apache.commons.io.FileUtils;
import java.io.*;
import java.util.Arrays;
import java.util.stream.Stream;

public class FileLearning {

    public void binary(String filepath) throws IOException {

        File file = new File(filepath);
        byte[] data = FileUtils.readFileToByteArray(file);
        bytesToHex(data);
    }

    /**
     * 字節數組轉16進制
     * @param bytes 需要轉換的byte數組
     * @return  轉換後的Hex字符串
     */

    public static void bytesToHex(byte[] bytes) {

        StringBuffer sb = new StringBuffer();
        for(int i = 0; i < bytes.length; i++) {
            String hex = Integer.toHexString(bytes[i] & 0xFF);
            if(hex.length() < 2){
                sb.append(0);
            }
            sb.append(hex);
        }
        System.out.println(sb.toString());
    }
}

單元測試

fileLearning.binary("unicode.txt");
fileLearning.binary("ansi.txt");
fileLearning.binary("utf8.txt");
fileLearning.binary("utf8bom.txt");

輸出結果爲

fffe004e
d2bb
e4b880
efbbbfe4b880

分析

1. ANSI：文件的編碼就是兩個字節“D2 BB”，這正是“一”的ANSI編碼，這也暗示ANSI是採用大頭方式存儲的。
2. Unicode：編碼是四個字節“FF FE 00 4E”，其中“FF FE”表明是小頭方式存儲，真正的編碼是4E00。如果使用Big-Endian格式存儲的話（Unicode big endian）：編碼是四個字節“FE FF 4E 00”，其中“FE FF”表明是大頭方式存儲。
3. UTF-8 BOM：編碼是六個字節“EF BB BF E4 B8 80”，前三個字節“EF BB BF”表示這是UTF-8編碼，後三個“E4B880”就是“一”的具體編碼，它的存儲順序與編碼順序是一致的。
4. UTF-8：編碼沒有表使用UTF-8的“EF BB FE”。

編碼詳情請查看 https://www.cnblogs.com/gavin-num1/p/5170247.html
中文Unicode編碼查看地址 http://www.chi2ko.com/tool/CJK.htm

結論

FF FE或者FE FF開始的文件爲Unicode編碼的文件。
EF BB FE開始的文件爲UTF-8 BOM文件。
GB2312和UTF-8無法從文件頭部辨別。

補充

測試時遇到的坑，在Windows測試中，ansi.txt的內容“一”，保存重新打開之後，變成了UTF-8格式的“h”。

建議測試的不要時候單字測試，結果不夠準確。我把“一”替換成“大家好”後，結果更準確。

想準確判斷文件編碼，可以使用第三方工具，如JUniversalCharDet。

Maven依賴

<!-- https://mvnrepository.com/artifact/com.googlecode.juniversalchardet/juniversalchardet -->
<dependency>
    <groupId>com.googlecode.juniversalchardet</groupId>
    <artifactId>juniversalchardet</artifactId>
    <version>1.0.3</version>
</dependency>

測試方法(網上摘錄的方法)

public String getCharset(String filepath) throws FileNotFoundException {

    File file = new File(filepath);
    InputStream is = new FileInputStream(file);
    UniversalDetector detector = new UniversalDetector(null);
    try {
        byte[] bytes = new byte[1024];
        int nread;
        if ((nread = is.read(bytes)) > 0 && !detector.isDone()) {
            detector.handleData(bytes, 0, nread);
        }
    } catch (Exception localException) {
        logger.info("detected code:", localException);
    }
    detector.dataEnd();
    String encode = detector.getDetectedCharset();
    /** default UTF-8 */
    if (StringUtils.isEmpty(encode)) {
        encode = "UTF-8";
    }
    detector.reset();
    return encode;
}

測試方法

@Test
public void test02() throws FileNotFoundException {

    String unicodeCharset = fileLearning.getCharset("E:\\unicode.txt");
    String ansiCharset = fileLearning.getCharset("E:\\ansi.txt");
    String utf8Charset = fileLearning.getCharset("E:\\utf8.txt");
    String utf8bomCharset = fileLearning.getCharset("E:\\utf8bom.txt");
    
    System.out.println("unicode charset: " + unicodeCharset);
    System.out.println("ansi charset: " + ansiCharset);
    System.out.println("utf8 charset: " + utf8Charset);
    System.out.println("utf8bom charset: " + utf8bomCharset);
}

輸出結果

unicode: UTF-16LE
ansi charset: WINDOWS-1252
utf8 charset: UTF-8
utf8bom charset: UTF-8

從結果可知，判斷結果還蠻可靠的。

！！！有不妥之處請指教！！！

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

判斷文件編碼

概述

創建文件

二進制讀取

分析

結論

補充

開源高性能結構化日誌模塊NanoLog

杭州的 IT 崩盤了麼？

【簡寫Mybatis-02】註冊機的實現以及SqlSession處理

手繪二維碼

.NET藉助虛擬網卡實現一個簡單異地組網工具

Redis命令之BITCOUNT-官方例子解析

通過文件頭判斷文件編碼

SQLYOG中timestamp設置爲CURRENT_TIMESTAMP

log4j hello world

令人鬱悶的socket重聯問題

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結