UTF-8和ISO-8859-1有什麼區別?

本文翻譯自:What is the difference between UTF-8 and ISO-8859-1?

UTF-8ISO-8859-1有什麼區別?


#1樓

參考:https://stackoom.com/question/TZhR/UTF-和ISO-有什麼區別


#2樓

ISO-8859-1 is a legacy standards from back in 1980s. ISO-8859-1是20世紀80年代的傳統標準。 It can only represent 256 characters so only suitable for some languages in western world. 它只能代表256個字符,因此只適用於西方世界的某些語言。 Even for many supported languages, some characters are missing. 即使對於許多支持的語言,也缺少一些字符。 If you create a text file in this encoding and try copy/paste some Chinese characters, you will see weird results. 如果您使用此編碼創建文本文件並嘗試複製/粘貼一些中文字符,您將看到奇怪的結果。 So in other words, don't use it. 換句話說,不要使用它。 Unicode has taken over the world and UTF-8 is pretty much the standards these days unless you have some legacy reasons (like HTTP headers which needs to compatible with everything). Unicode已經佔據了全世界,UTF-8幾乎就是現在的標準,除非你有一些遺留的原因(比如需要與所有東西兼容的HTTP頭)。


#3樓

UTF UTF

UTF is a family of multi-byte encoding schemes that can represent Unicode code points which can be reperesentative of up to 2^31 [roughly 2 billion] characters. UTF是一系列多字節編碼方案,可以表示Unicode代碼點,可以代表最多2 ^ 31 [大約20億]個字符。 UTF-8 is a flexible encoding system that uses between 1 and 4 bytes to represent the first 2^21 [roughly 2 million] code points. UTF-8是一種靈活的編碼系統,使用1到4個字節來表示前2 ^ 21 [大約200萬]個代碼點。

Long story short: any character with a code point/ordinal representation below 127, aka 7-bit-safe ASCII is represented by the same 1-byte sequence as most other single-byte encodings. 長話短說:任何具有低於127的代碼點/序數表示的字符,即7位安全的ASCII由與大多數其他單字節編碼相同的1字節序列表示。 Any character with a code point above 127 is represented by a sequence of two or more bytes, with the particular of encoding best explained here . 代碼點大於127的任何字符都由兩個或更多字節的序列表示,其中特定的編碼在此處解釋得最好。

ISO-8859 ISO-8859

ISO-8859 is a family of single-byte encoding schemes used to represent alphabets that can be represented within the range of 127 to 255. These various alphabets are defined as "parts" in the format ISO-8859- n , the most familiar of these likely being ISO-8859-1 aka 'Latin-1'. ISO-8859是一系列單字節編碼方案,用於表示可以在127到255範圍內表示的字母表。這些不同的字母表被定義爲ISO-8859- n格式的“部分”,最熟悉的這些可能是ISO-8859-1又名'Latin-1'。 As with UTF-8, 7-bit-safe ASCII remains unaffected regardless of the encoding family used. 與UTF-8一樣,無論使用何種編碼系列,7位安全ASCII都不受影響。

The drawback to this encoding scheme is its inability to accommodate languages comprised of more than 128 symbols, or to safely display more than one family of symbols at one time. 這種編碼方案的缺點是它不能容納由超過128個符號組成的語言,或者一次安全地顯示多個符號系列。 As well, ISO-8859 encodings have fallen out of favor with the rise of UTF. 同樣,隨着UTF的興起,ISO-8859編碼已經失寵。 The ISO "Working Group" in charge of it having disbanded in 2004, leaving maintenance up to its parent subcommittee. 負責該工作組的ISO“工作組”於2004年解散,將維護工作留給其母公司小組委員會。


#4樓

My reason for researching this question was from the perspective, is in what way are they compatible. 我研究這個問題的原因是從視角來看,它們是以什麼方式兼容的。 Latin1 charset (iso-8859) is 100% compatible to be stored in a utf8 datastore. Latin1 charset(iso-8859)100%兼容,可存儲在utf8數據存儲區中。 All ascii & extended-ascii chars will be stored as single-byte. 所有ascii和extended-ascii字符都將存儲爲單字節。

Going the other way, from utf8 to Latin1 charset may or may not work. 另一方面,從utf8到Latin1 charset可能會也可能不會起作用。 If there are any 2-byte chars (chars beyond extended-ascii 255) they will not store in a Latin1 datastore. 如果有任何2字節字符(超出擴展-ascii 255的字符),它們將不存儲在Latin1數據存儲區中。


#5樓

From another perspective, files that both unicode and ascii encodings fail to read because they have a byte 0xc0 in them, seem to get read by iso-8859-1 properly. 從另一個角度來看,unicode和ascii編碼都無法讀取的文件,因爲它們中有一個字節0xc0 ,似乎可以被iso-8859-1正確讀取。 The caveat is that the file shouldn't have unicode characters in it of course. 需要注意的是,文件當然不應該包含unicode字符。


#6樓

  • ASCII: 7 bits. ASCII:7位。 128 code points. 128個代碼點。

  • ISO-8859-1: 8 bits. ISO-8859-1:8位。 256 code points. 256個代碼點。

  • UTF-8: 8-32 bits (1-4 bytes). UTF-8:8-32位(1-4字節)。 1,112,064 code points. 1,112,064個代碼點。

Both ISO-8859-1 and UTF-8 are backwards compatible with ASCII, but UTF-8 is not backwards compatible with ISO-8859-1: ISO-8859-1和UTF-8都向後兼容ASCII,但UTF-8不向後兼容ISO-8859-1:

#!/usr/bin/env python3

c = chr(0xa9)
print(c)
print(c.encode('utf-8'))
print(c.encode('iso-8859-1'))

Output: 輸出:

©
b'\xc2\xa9'
b'\xa9'
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章