Java string 字符集編碼以及轉換

基本概念

關於字符集的種類常用的有utf-8, unicode，gbk，gbk2312等，詳細的字符集列表可以查看java.nio.charset.Charset類。

關鍵的字符集處理方法介紹如下：

String.getBytes()	獲取當前string表示的字符，在使用系統默認的字符集(關於系統默認的字符集後面詳細討論)時，所映射的二進制數據。不同的默認字符集生成的二進制數據是不同的，解碼時只有使用與該默認字符集相同的字符集才能獲得正確的字符。
String.getBytes(String charset)	獲取當前string表示的字符，在使用指定字符集時，所映射的二進制數據。傳入不同的字符集生成的二進制數據是不同，解碼時只有使用相同的字符集才能獲得正確的字符。
new String(byte[] data)	使用系統默認的字符集，將給定的二進制數據映射爲相應的字符，並構造成一個string。如果系統默認的字符集不變那麼可以通過String.getBytes()或String.getBytes(Charset.defaultCharset().displayName())來還原原來的二進制數據。
new String(byte[] data, String charset)	使用給定的字符集，將給定的二進制數據映射爲相應的字符，並構造成一個string。只有通過傳入相同的字符集才能還原原來的二進制數據，String.getBytes("charset")。

日常使用場景及解決方案

場景1——修改JVM系統字符集

系統默認的字符集是指，JVM運行時調用java.nio.Charset.defaultCharset().displayName()所顯示的字符集。我們有如下幾種方式更改JVM在運行時的系統字符集：

方法1

Properties pps=System. getProperties();

pps.put("file.encoding","<your-charset>");

System.setProperties(pps);

方法2

System.setProperty("file.encoding","<your-charset>");

方法3

java -D file.encoding=<your-charset>

上表中尖括弧斜體部分應該替換爲你想要的字符集。

需要注意的是，如果是在運行時更改了字符集，那麼再調用java.nio.Charset.defaultCharset().diaplayName()可能並不會變，因爲Charset源碼對default charset做了內容緩存，具體可查看Charset源碼：

private static volatile Charset defaultCharset;

/**
* Returns the default charset of this Java virtual machine.
*
* <p> The default charset is determined during virtual-machine startup and
* typically depends upon the locale and charset of the underlying
* operating system.
*
* @return A charset object for the default charset
*
* @since 1.5
*/
public static Charset defaultCharset() {
if (defaultCharset == null) {
synchronized (Charset.class) {
String csn = AccessController.doPrivileged(
new GetPropertyAction("file.encoding"));
Charset cs = lookup(csn);
if (cs != null)
defaultCharset = cs;
else
defaultCharset = forName("UTF-8");
}
}
return defaultCharset;
}

因此要獲取更改後的字符集編碼，使用方法1或方法2後，需要使用System.getProperty("file.encoding")獲取最新的字符集。在實際項目中，很多庫都會默認調用String.getBytes(), 而該方法中使用的是java.nio.Charset.defaultCharset().displayName()來獲取默認字符集。具體可查看String類和StringCoding類源碼：

String類

/**
* Encodes this {@code String} into a sequence of bytes using the
* platform's default charset, storing the result into a new byte array.
*
* <p> The behavior of this method when this string cannot be encoded in
* the default charset is unspecified. The {@link
* java.nio.charset.CharsetEncoder} class should be used when more control
* over the encoding process is required.
*
* @return The resultant byte array
*
* @since JDK1.1
*/
public byte[] getBytes() {
return StringCoding.encode(value, 0, value.length);
}

StringCoding類

static byte[] encode(char[] ca, int off, int len) {
String csn = Charset.defaultCharset().name();
try {
// use charset name encode() variant which provides caching.
return encode(csn, ca, off, len);
} catch (UnsupportedEncodingException x) {
warnUnsupportedCharset(csn);
}
try {
return encode("ISO-8859-1", ca, off, len);
} catch (UnsupportedEncodingException x) {
// If this code is hit during VM initialization, MessageUtils is
// the only way we will be able to get any kind of error message.
MessageUtils.err("ISO-8859-1 charset not available: "
+ x.toString());
// If we can not find ISO-8859-1 (a required encoding) then things
// are seriously wrong with the installation.
System.exit(1);
return null;
}
}

綜上所述，如果需要完美的修改系統默認的字符集，方法3最好。

場景2——網絡通信的字符集編碼

在開發中，前後端通信時出現亂碼問題，特別是中文更易出現亂碼。一般情況下，都是前後端默認的字符集和業務上需求的字符集不匹配。下面基於一個常見的具體情況，來做說明。

假設服務端默認的字符集編碼爲GBK，前端默認的字符集編碼爲GB2312。

如果雙方都是明文通信，並且http請求和響應的header中都正確指定了字符集(正確指定是指，內容的編碼與header中指定的字符集是匹配的)，那麼通信雙方，都從對方發過來的header中拿到charset，然後調用new String(byte[] httpBodyRawData, charset)即可。

如果雙方都是明文通信，但是並沒有正確指定header中的字符集，那麼就需要約定的字符集，如果約定的字符集爲UTF-8，那麼調用new String(byte[] httpBodyRawData, "UTF-8")即可。

如果雙方使用密文通信，約定的字符集爲UTF-8。假設需要加密的字符串爲toEncryStr, 解密後的原始數據爲byte[] decryData。加密分爲如下幾種情況：

如果加密算法接收byte[]類型，那麼需要：toEncryStr.getBytes("UTF-8")。

如果加密算法接收string類型，加密時調用String.getBytes("UTF-8")，那麼需要做的轉換是： new String(toEncryStr.getBytes("UTF-8"), "UTF-8")；

該轉換的意思是，首先將當前字符通過UTF-8字符集映射爲二進制數據(此時只有用UTF-8解碼才能正常顯示)，然後將二進制數據通過UTF-8字符集映射爲相應字符(一定是正常字符)。之後加密時，調用String.getBytes("UTF-8")時會將轉換後的字符(正常)還原爲UTF-8編碼的二進制數據。

解碼比較簡單，只需要：new String(decryData, "UTF-8")。// 約定的字符集爲UTF-8

Java string 字符集編碼以及轉換

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

Python 安裝庫指令大全

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

基於 Milvus + LlamaIndex 實現高級 RAG

【2024-05-21】以茶會友

DownloadManager不好用？試試ZlsamDownloadService

軟件質量模型

狀態機的一般實現

如何進行代碼review

一課OO設計模式：抽象工廠

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結