5 HTML Document Representation

Contents

本章內容

In this chapter, we discuss how HTML documents are represented on a computer and over the Internet.

本章我們討論一個HTML文檔經過互聯網（Internet）傳輸後，如何在計算機被展示的一些問題。

The section on the document character set addresses the issue of what abstract characters may be part of an HTML document. Characters include the Latin letter "A", the Cyrillic letter "I", the Chinese character meaning "water", etc.

文檔字符集部分主要討論哪些抽象字符可以在HTML文檔中出現。例如：拉丁字母“A”，斯拉夫字母"I",中文字符”水“，等等。

The section on character encodings addresses the issue of how those characters may be represented in a file or when transferred over the Internet. As some character encodings cannot directly represent all characters an author may want to include in a document, HTML offers other mechanisms, called character references, for referring to any character.

字符編碼部分主要討論這些字符在文件中存儲或者在Internet上進行傳輸時如何進行表示。由於一些字符編碼不能像作者所希望的那樣，對在文檔內出現的所有字符進行直接表示，HTML提供了另外的叫做"字符引用"的機制，該機制可以對任何字符進行引用。

Since there are a great number of characters throughout human languages, and a great variety of ways to represent those characters, proper care must be taken so that documents may be understood by user agents around the world.

由於人類語言擁有數量龐大的字符，並且對於這些字符來說又有很多種不同的表示方式，所以爲了能夠讓文檔可以被世界上所有的用戶代理理解，所以必須在該方面進行正確的處理。

5.1 The Document Character Set

To promote interoperability, SGML requires that each application (including HTML) specify its document character set. A document character set consists of:

A Repertoire: A set of abstract characters,, such as the Latin letter "A", the Cyrillic letter "I", the Chinese character meaning "water", etc.
Code positions: A set of integer references to characters in the repertoire.

爲了彰顯互操作能力，SGML要求每一個應用（當然包括HTML)都要指定文檔字符集。一個文檔字符集由如下部分組成：

字符全集: 抽象字符的集合,, 例如拉丁字母"A", 斯拉夫字符"I", 中文字符"水", 等等.
代碼位置: 指向字符全集中字符的整型引用集合。

Each SGML document (including each HTML document) is a sequence of characters from the repertoire. Computer systems identify each character by its code position; for example, in the ASCII character set, code positions 65, 66, and 67 refer to the characters 'A', 'B', and 'C', respectively.

每一個SGML文檔（當然包括HTML文檔）都是上述字符全集中字符的序列。計算機會通過它們的代碼地址來識別它們。例如：在ASCII字符集中,代碼地址65,66,和67分別代表字符'A', 'B', and 'C'。

The ASCII character set is not sufficient for a global information system such as the Web, so HTML uses the much more complete character set called the Universal Character Set (UCS), defined in [ISO10646]. This standard defines a repertoire of thousands of characters used by communities all over the world.

由於在像Web這樣的面向全球的信息系統中，ASCII字符集字符太少不夠使用，所以HTML使用更加完全的稱爲統一字符集（UCS),該字符集定義在［ISO10646］.該標準定義了全世界所有語境中所使用的成千上萬個字符的字符全集

The character set defined in [ISO10646] is character-by-character equivalent to Unicode ([UNICODE]). Both of these standards are updated from time to time with new characters, and the amendments should be consulted at the respective Web sites. In the current specification, "[ISO10646]" is used to refer to the document character set while "[UNICODE]" is reserved for references to the Unicode bidirectional text algorithm.

在 [ISO10646]定義的字符在Unicode中都有一一對應。ISO10646以及UNICODE這兩個標準會不斷地引入新字符，所以有關它們的最新修正應該去看它們相應的網站。在此規範中，"[ISO10646]"用來指文檔字符集，"[UNICODE]"被用來專指Unicode雙向文本機制。

The document character set, however, does not suffice to allow user agents to correctly interpret HTML documents as they are typically exchanged -- encoded as a sequence of bytes in a file or during a network transmission. User agents must also know the specific character encoding that was used to transform the document character stream into a byte stream.

由於HTML文檔進行交流時需要在存儲成文件或在網絡傳輸時編碼成字節序列，所以僅有文檔字符集對於用戶代理正確解析HTML文檔是不夠的。用戶代理必須還要知道將文檔字符流轉換成字節流所使用的字符編碼。

5.2 Character encodings

What this specification calls a character encoding is known by different names in other specifications (which may cause some confusion). However, the concept is largely the same across the Internet. Also, protocol headers, attributes, and parameters referring to character encodings share the same name -- "charset" -- and use the same values from the [IANA] registry (see [CHARSETS] for a complete list).

本規範所稱的字符編碼在其他的規範中可能會有其他不同的名字（這可能會導致一些衝突）。不過，在Internet領域，這個概念還是在很大程度上一樣的。另外，可以引用到字符編碼的"協議頭"，“屬性”，“參數”都共享相同的名字——“charset“——並且使用來自在 [IANA] 登記註冊的相同取值。

The "charset" parameter identifies a character encoding, which is a method of converting a sequence of bytes into a sequence of characters. This conversion fits naturally with the scheme of Web activity: servers send HTML documents to user agents as a stream of bytes; user agents interpret them as a sequence of characters. The conversion method can range from simple one-to-one correspondence to complex switching schemes or algorithms.

"charset"參數指定一個字符編碼，通過該方式將字節序列轉換成字符序列。這種轉化與Web的運行機制不謀而合：服務器以字節流的方式向用戶代理髮送數據；用戶代理將它們解析成字符序列。這種轉換方法可能是簡單的直接對應也可能是其他複雜的方案或機制。

A simple one-byte-per-character encoding technique is not sufficient for text strings over a character repertoire as large as [ISO10646]. There are several different encodings of parts of [ISO10646] in addition to encodings of the entire character set (such as UCS-4).

對於像[ISO10646]這樣巨大的字符全集來說，一個字符一個字節的編碼技術是不行的。除了對整個字符集進行編碼（例如：UCS-4）外,還有幾個針對[ISO10646]不同子集的編碼方式。

5.2.1 Choosing an encoding

Authoring tools (e.g., text editors) may encode HTML documents in the character encoding of their choice, and the choice largely depends on the conventions used by the system software. These tools may employ any convenient encoding that covers most of the characters contained in the document, provided the encoding is correctly labeled. Occasional characters that fall outside this encoding may still be represented by character references. These always refer to the document character set, not the character encoding.

文檔撰寫工具（比如：文本編輯器）可以選擇它們對HTML文檔的字符編碼方式，這種編碼方式的選擇很大程度上依賴於系統軟件的默認約定。這些工具可以指定一個能夠包含文檔中所有字符的最經濟的編碼方式，並將該編碼方式正確標記。那些在該編碼之外的不常用的字符依然可以用字符引用的方式來表示。這些都是在說文檔字符集，而不是字符編碼。

Servers and proxies may change a character encoding (called transcoding) on the fly to meet the requests of user agents (see section 14.2 of [RFC2616], the "Accept-Charset" HTTP request header). Servers and proxies do not have to serve a document in a character encoding that covers the entire document character set.

服務器或者代理（proxy）爲了迎合用戶代理的需要（參見[RFC2616]的14.2部分：HTTP請求頭部的"Accept-Charset"）可以改變字符編碼，這種操作稱爲編碼轉換。服務器以及代理（proxy）無須提供完全編碼的文檔（即，文檔採用涵蓋文檔全部字符集的編碼方式）。

Commonly used character encodings on the Web include ISO-8859-1 (also referred to as "Latin-1"; usable for most Western European languages), ISO-8859-5 (which supports Cyrillic), SHIFT_JIS (a Japanese encoding), EUC-JP (another Japanese encoding), and UTF-8 (an encoding of ISO 10646 using a different number of bytes for different characters). Names for character encodings are case-insensitive, so that for example "SHIFT_JIS", "Shift_JIS", and "shift_jis" are equivalent.

在Web上常用的一些字符編碼包括：ISO-8859-1 (也被稱爲 "Latin-1";西歐的絕大部分語言採用該字符編碼for ), ISO-8859-5 (支持斯拉夫語), SHIFT_JIS (日文編碼), EUC-JP (另外一種日文編碼), and UTF-8 (對ISO10646字符集進行編碼的方式，該編碼方式對不同的字符采用不同的數量字節進行編碼)。字符編碼的名字是大小寫不敏感的, 所以 "SHIFT_JIS", "Shift_JIS", 和"shift_jis"所代表的編碼方式是一樣的。

This specification does not mandate which character encodings a user agent must support.

本規範不強制哪個字符編碼用戶代理必須要支持。

Conforming user agents must correctly map to ISO 10646 all characters in any character encodings that they recognize (or they must behave as if they did).

符合規範的用戶代理必須可以正確地將ISO 10646映射成它們可識別的字符編碼（或者它們要表現的至少看起來是正確的）。

Notes on specific encodings

When HTML text is transmitted in UTF-16 (charset=UTF-16), text data should be transmitted in network byte order ("big-endian", high-order byte first) in accordance with [ISO10646], Section 6.3 and [UNICODE], clause C3, page 3-1.

當HTML文本採用UTF-16（即：chartset=UTF-16)編碼進行傳輸時，根據 [ISO10646], 6.3B部分以及 [UNICODE], C3 段, 頁碼3-1的規定，文本數據應該以網絡字節順序（“big-endian”,即高位字節在前的順序）形式進行傳輸。

Furthermore, to maximize chances of proper interpretation, it is recommended that documents transmitted as UTF-16 always begin with a ZERO-WIDTH NON-BREAKING SPACE character (hexadecimal FEFF, also called Byte Order Mark (BOM)) which, when byte-reversed, becomes hexadecimal FFFE, a character guaranteed never to be assigned. Thus, a user-agent receiving a hexadecimal FFFE as the first bytes of a text would know that bytes have to be reversed for the remainder of the text.

更進一步，爲了最大可能對文檔進行正確解析，我們建議在使用UTF-16傳輸時，文檔應該總是以零寬度不間斷空格（ZERO-WIDTH NON-BREAKING SPACE）字符開始，該字符的十六進制編碼爲FEFF，也被稱爲字節順序標記(BOM)，該標記被反序解析時爲十六進制FFFE，該數字沒有被分配給任何字符。當用戶代理接收到文本開頭的十六進制數字FFFE時，用戶代理就會知道餘下的文本中所有字節都應該被反向轉換。

The UTF-1 transformation format of [ISO10646] (registered by IANA as ISO-10646-UTF-1), should not be used. For information about ISO 8859-8 and the bidirectional algorithm, please consult the section on bidirectionality and character encoding.

[ISO10646]的UTF-1轉換格式（IANA官方名字ISO-10646-UTF-1）不應被使用。有關ISO 8859-8以及雙向文本機制，請參考雙向文本及字符編碼部分。

HTML4.01規範中英文對照-HTML文檔展現(1)

5 HTML Document Representation

5.1 The Document Character Set

5.2 Character encodings

5.2.1 Choosing an encoding

Notes on specific encodings

再談23種設計模式（3）：行爲型模式（學習筆記）

Power Automate Desktop 安裝完，登錄後老是提示one driver 錯誤

微前端學習筆記(4):從微前端到微模塊之EMP與hel-micro方案探索

微前端學習筆記（1）：微前端總體架構概述，從微服務發微

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

WindowsServer--SQL Server搭建主從同步實現讀寫分離 - 事務性分發

用戶shell環境基本操作2

HTML4.01規範中英文對照-有關SGML和HTML的一些事(1)

HTML4.01規範-HTML文檔的頂層結構(4)

mysql啓動參數:skip-grant-tables

HTML4.01規範-文本(1)

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結