java和unicode

原創

iteye_19890

2018-08-27 19:27

故事是這樣的，那天和同事討論上傳txt文件，如何能防止文件亂碼，其間引出瞭如下問題:
[list]
[*]1.如何防止上傳文件亂碼(無論任何語言).
[*]2.用byte array&utf-8構造string,java如何判斷幾個byte一箇中文字符.
[*]3.utf-8和unicode的區別.
[*]4.一個utf-8 string有幾個char,幾個byte?
[/list]
隨着這些問題的解決,對java和unicode,utf-8之間的關係有了更深層的認識.

[b]如何防止上傳文件亂碼(無論任何語言).[/b]
爲了支持i18n，我們必須要求上傳文件的編碼是utf-8或unicode,否則無法實現全語言的支持.utf-8的文件開頭會有EF BB BF標誌。

[b]用byte array&utf-8構造string,java如何判斷幾個byte一箇中文字符.[/b]
因utf-8是變長編碼，所以有些字符會是一個字節（如：ascii）,有些會是3個（如：中文），
但在用byte array構造string時，jvm是如何判斷以幾個字節爲一組來構造呢？
原來utf-8編碼本身有標誌可以判斷，每個字符的第一個byte前幾位是標示位10*,110*,1110*,11110*，其中1的個數代表這個字符有幾個字節。

[b]utf-8和unicode的區別.[/b]
unicode是定長編碼，每個字符都是2 byte,所以在存儲ascii時會浪費一個byte的空間。而utf-8是變長unicode編碼,在unicode編碼基礎上進行變長，在存儲ascii時只佔用一個byte.存儲中文時佔用3 byte.

[b]一個utf-8 string有幾個char,幾個byte?[/b]


    String s = "中國";   
    byte[] b = s.getBytes("utf-8");
    String s_utf8 = new String(b,"utf-8");
    System.out.println(s_utf8.getBytes("utf-8").length);
    System.out.println(s_utf8.toCharArray().length);

結果是：
6
2
按照上面的結果看好像一個char是3 byte,但java中一個char是2 byte，爲什麼？
其實java中無論什麼字符集string都會以unicode編碼來存儲，所以每個char都是一個
unicode編碼佔兩個byte。



import java.io.UnsupportedEncodingException;


public class TestUtf8File {

  /**
   * @param args
   * @throws UnsupportedEncodingException 
   */
  public static void main(String[] args) throws UnsupportedEncodingException {

    String s = "中國人";   
    byte[] b = s.getBytes("utf-8");
    String s_utf8 = new String(b,"utf-8");
    System.out.println(s_utf8.getBytes("utf-8").length);
    System.out.println("utf-8 bytes:");
    printByteArray(s_utf8.getBytes("utf-8"));
    System.out.println("chars:");
    printCharArray(s_utf8.toCharArray());

    byte[] unicodeb= s.getBytes("unicode");
    String s_unidode = new String(unicodeb,"unicode");
    System.out.println("unicode bytes:");
    printByteArray(s_unidode.getBytes("unicode"));

  }

  private static void printByteArray(byte[] b){
    for(int i = 0;i < b.length; i++){
      System.out.println((Integer.toString(b[i],16)));

    }
  }

  private static void printCharArray(char[] c){
    for(int i = 0;i < c.length; i++){
      System.out.println(Integer.toString((byte)(c[i]>>8),16));
      System.out.println(Integer.toString((byte)(c[i]&0xff),16));

    }
  }

}

output:
9
utf-8 bytes:
-1c
-48
-53
-1b
-65
-43
-1c
-46
-46
chars:
4e
2d
56
-3
4e
-46
unicode bytes:
-2
-1
4e
2d
56
-3
4e
-46

-2 -1(FE FF)是unicode big endian標誌
fe ff:big endian
ff fe: no big endian

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

java和unicode

PDManer [元數建模]-v4.9.0 發佈：一款簡單好用的數據庫建模平臺

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

cs01 CSS Syntax

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

[MASM拾遺]Offset僞指令

h30 HTML Layout Elements

瞭解顯卡

一款基於C#開發的通訊調試工具（支持Modbus RTU、MQTT調試）

Linux/Golang/glibC系統調用

cs04 CSS Measurement Units

strictfp

Chapter 3 Installing and Managing Oracle

轉：javaagent 參數使用

Chapter 4 Creating a Database and Data Dictionary

AI-Chapter 3 Database Storage and Schema Objects

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結