Unicode(一):什麼是Unicode

更新於 發佈於 閱讀時間約 19 分鐘

前言

如標題所示,這篇筆記試圖整理關於Unicode的一些基本問題。為求資訊的正確性,我會直接節錄Unicode Consortium官方或內部人所撰寫的說明文字(引文中的粗體底線是我加上的)。未來希望能進一步整理像是big-endian/little-endian等議題。

Unicode的源起

我們首先對電腦如何處理文字有個簡單的概念:“Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different systems, called character encodings, for assigning these numbers.”[1]

圖一

圖一

圖一中我們可以看到許多不同的編碼,譬如用於繁體中文的Big5,以及用於日文的Shift-JIS。“In the past, different organizations have assembled different sets of characters and created encodings for them – one set may cover just Latin-based Western European languages (excluding EU countries such as Bulgaria or Greece), another may cover a particular Far Eastern language (such as Japanese), others may be one of many sets devised in a rather ad hoc way for representing another language somewhere in the world.”[2]

然而這些早期的編碼在程式的開發上衍生出一些問題。早期的網路使用者應該不少人曾遇過打開網頁時出現亂碼,須更換編碼才能正常顯示的情形(如圖一)。“The pre-existing legacy character encodings were both inconsistent and incomplete—two encodings could use the same codes for two different characters and use different codes for the same characters, while none of the encodings handled any more than a small fraction of the world's languages. Whenever textual data was converted between different programs or platforms, there was a substantial risk of corruption. Programs often were written only to support particular encodings, making development of international versions expensive.”[3]

Unicode因此誕生:“The Unicode Standard began with a simple goal: to unify the many hundreds of conflicting ways to encode characters, replacing them with a single, universal standard.”[3]

Character與Glyph的區別

為了進一步理解Unicode的角色,我們先來看character與glyph的區別:“The mark made on screen or paper, called a glyph, is a visual representation of the character. . . . Characters are the abstract representations of the smallest components of written language that have semantic value. They represent primarily, but not exclusively, the letters, punctuation, and other signs that constitute natural language text and technical notation. . . . Glyphs represent the shapes that characters can have when they are rendered or displayed.”[3]

圖二(截圖自[3])

圖二(截圖自[3])

至於所謂的font:“A font represents an organized collection of glyphs in which the various glyph representations will share a common look or styling such that, when a string of characters is rendered together, the result is highly legible, conveys a particular artistic style and provides consistent inter-character alignment and spacing.”[4]

回到Unicode的角色,Unicode處理的是character而非glyph:“The Unicode Standard does not define glyph images. That is, the standard defines how characters are interpreted, not how glyphs are rendered. . . . The Unicode Standard does not specify the precise shape, size, or orientation of on-screen characters. . . . Glyph shape and methods of identifying and selecting glyphs are the responsibility of individual font vendors and of appropriate standards and are not part of the Unicode Standard.”[3]

Codespace和Code Point

“A character set or repertoire comprises the set of characters one might use for a particular purpose – be it those required to support Western European languages in computers, or those a Chinese child will learn at school in the third grade (nothing to do with computers).”[2]“The range of integers used to code the abstract characters is called the codespace. A particular integer in this set is called a code point. When an abstract character is mapped or assigned to a particular code point in the codespace, it is then referred to as an encoded character.”[3]

回到Unicode,Unicode表示code point的方式為:“an individual Unicode code point is expressed as U+n, where n is four to six hexadecimal digits, using the digits 0–9 and uppercase letters A–F (for 10 through 15, respectively). Leading zeros are omitted, unless the code point would have fewer than four hexadecimal digits—for example, U+0001, U+0012, U+0123, U+1234, U+12345, U+102345.”[3]舉例來說,圖二中出現的U+0041是LATIN CAPITAL LETTER A的code point。至於Unicode的codespace:“In the Unicode Standard, the codespace consists of the integers from 0 to 10FFFF, comprising 1,114,112 code points available for assigning the repertoire of abstract characters.”[3](前段引文中的10FFFF原文有下標16,只是vocus的編輯器沒有下標功能。)10FFFF為十六進位的表示法,轉換成常見的十進位表示為1114111。

圖三

圖三

UTF-8、UTF-16和UTF-32

至於圖一中有出現的UTF-8,則是所謂的encoding form:“Actual implementations in computer systems represent integers in specific code units of particular size—usually 8-bit (= byte), 16-bit, or 32-bit. In the Unicode character encoding model, precisely defined encoding forms specify how each integer (code point) for a Unicode character is to be expressed as a sequence of one or more code units. The Unicode Standard provides three distinct encoding forms for Unicode characters, using 8-bit, 16-bit, and 32-bit units. These are named UTF-8, UTF-16, and UTF-32, respectively. . . . All three encoding forms can be used to represent the full range of encoded characters in the Unicode Standard; . . . Each of the three Unicode encoding forms can be efficiently transformed into either of the other two without any loss of data.”[3]

圖四(截圖自[3])

圖四(截圖自[3])

圖四展示了四個character在不同的encoding form下的code unit value,四個character的code point由左至右分別為U+0041、U+03A9、U+8A9E和U+10384。圖四中的code unit value是以十六進位表示,轉換成二進位(十六進位和二進位的轉換見圖五)後如同上一段引文所說的,UTF-8、UTF-16和UTF-32的code unit分別為8-bit、16-bit和32-bit(以十六進位的41為例,轉換成二進位表示為01000001,共8個bit)。

圖五(截圖自[5])

圖五(截圖自[5])

本文不會詳述encoding form如何將code point轉為code unit value,只簡單說明一些UTF-8、UTF-16和UTF-32間的差異。在圖四中,“the UTF-32 line shows that each example character can be expressed with one 32-bit code unit. Those code units have the same values as the code point for the character. . . . In UTF-8, a character may be expressed with one, two, three, or four bytes, and the relationship between those byte values and the code point value is more complex. . . . The value of each UTF-32 code unit corresponds exactly to the Unicode code point value. This situation differs significantly from that for UTF-16 and especially UTF-8, where the code unit values often change unrecognizably from the code point value.”[3]我們再看另一個例子:

圖六(截圖自[3])

圖六(截圖自[3])

UTF-8、UTF-16和UTF-32各有不同的適用場域,譬如從上一段所描述的UTF-32的特性,我們可以想見:“UTF-32 may be a preferred encoding form where memory or disk storage space for characters is not a particular concern, but where fixed-width, single code unit access to characters is desired.”[3]

A Historical Note

閱讀字元編碼相關文獻的時候,可能要留意文獻出版的年代。舉例來說,早期的書(譬如Angelika Langer和Klaus Kreft合著的Standard C++ IOStreams and Locales)會寫Unicode是16-bit encoding,這部分Unicode官網的說明是:“The first version of Unicode was a 16-bit encoding, from 1991 to 1995, but starting with Unicode 2.0 (July, 1996), it has not been a 16-bit encoding. The Unicode Standard encodes characters in the range U+0000..U+10FFFF, which amounts to a 21-bit code space. Depending on the encoding form you choose (UTF-8, UTF-16, or UTF-32), each character will then be represented either as a sequence of one to four 8-bit bytes, one or two 16-bit code units, or a single 32-bit code unit.”[6]



[1] https://www.unicode.org/standard/WhatIsUnicode.html

[2] Ishida R. Character encodings: Essential concepts. https://www.w3.org/International/articles/definitions-characters/

[3] Unicode Consortium. The Unicode Standard (Version 14.0).

[4] https://www.w3.org/TR/SVG11/intro.html

[5] Stroustrup B. Programming: Principles and practice using C++ (2nd Edition).

[6] https://www.unicode.org/faq/utf_bom.html

留言
avatar-img
留言分享你的想法!
avatar-img
Josh Yao的沙龍
4會員
2內容數
你可能也想看
Thumbnail
「欸!這是在哪裡買的?求連結 🥺」 誰叫你太有品味,一發就讓大家跟著剁手手? 讓你回購再回購的生活好物,是時候該介紹出場了吧! 「開箱你的美好生活」現正召喚各路好物的開箱使者 🤩
Thumbnail
「欸!這是在哪裡買的?求連結 🥺」 誰叫你太有品味,一發就讓大家跟著剁手手? 讓你回購再回購的生活好物,是時候該介紹出場了吧! 「開箱你的美好生活」現正召喚各路好物的開箱使者 🤩
Thumbnail
1.0 從函數到函算語法 1.2 函數概念小史 1.2.1 中譯的來源 1.2.2 一個速度問題 1.2.3 幾何的方法 1.2.4 微積分的記法 1.2.5 弦的振動 二 有了萊布尼茲的命名和貝努利的初步界定,函數關係被正式放在桌面上,毫無遮掩地進入了公元十八世紀歐洲數學工作者
Thumbnail
1.0 從函數到函算語法 1.2 函數概念小史 1.2.1 中譯的來源 1.2.2 一個速度問題 1.2.3 幾何的方法 1.2.4 微積分的記法 1.2.5 弦的振動 二 有了萊布尼茲的命名和貝努利的初步界定,函數關係被正式放在桌面上,毫無遮掩地進入了公元十八世紀歐洲數學工作者
Thumbnail
1.0 從函數到函算語法 1.2 函數概念小史 1.2.1 中譯的來源 1.2.2 一個速度問題 1.2.3 幾何的方法 一 因此打從輪廓的浮現,萌牙狀態的函數概念是一個幾何圖象。 有趣的是,兩個世紀之後,即公元十六世紀,歐洲文藝復興如日中天,法國數學家及哲學家勒內‧笛卡兒承襲
Thumbnail
1.0 從函數到函算語法 1.2 函數概念小史 1.2.1 中譯的來源 1.2.2 一個速度問題 1.2.3 幾何的方法 一 因此打從輪廓的浮現,萌牙狀態的函數概念是一個幾何圖象。 有趣的是,兩個世紀之後,即公元十六世紀,歐洲文藝復興如日中天,法國數學家及哲學家勒內‧笛卡兒承襲
Thumbnail
1.0 從函數到函算語法 1.2 函數概念小史 1.2.1 中譯的來源 數學中函數概念的重要性難以盡書,亦很難想像沒有函數概念的數學可以走多遠。誇張一點,我們可以說很大部份的數學都是按函數概念操作的。但少有人留意到,在某個意義上,函數可說是數學語言的一個語構處理。 漢語「函數」一詞乃
Thumbnail
1.0 從函數到函算語法 1.2 函數概念小史 1.2.1 中譯的來源 數學中函數概念的重要性難以盡書,亦很難想像沒有函數概念的數學可以走多遠。誇張一點,我們可以說很大部份的數學都是按函數概念操作的。但少有人留意到,在某個意義上,函數可說是數學語言的一個語構處理。 漢語「函數」一詞乃
Thumbnail
1.0 從函數到函算語法 1.1 句子成份 九 屈折變化沒有標誌句子成份如何構成句子的規則﹗這是我們的另一個觀察。句子成份屬規範性的操作指引。現再返回《文通》的意見。《文通》將詞分成七種便是語法上的規範性指引。就句讀而言,《文通》說﹕ 「夫文者,集句以成,如錦繡然,故謂之文。欲知文,
Thumbnail
1.0 從函數到函算語法 1.1 句子成份 九 屈折變化沒有標誌句子成份如何構成句子的規則﹗這是我們的另一個觀察。句子成份屬規範性的操作指引。現再返回《文通》的意見。《文通》將詞分成七種便是語法上的規範性指引。就句讀而言,《文通》說﹕ 「夫文者,集句以成,如錦繡然,故謂之文。欲知文,
Thumbnail
昨日补分
Thumbnail
昨日补分
Thumbnail
說明 重點 △定義變數 △文字的定義 △文字與數字的差別 △整數與浮點數 △signed(有號)與unsigned(無號)的區別 △e是什麼符號? 分類 △字元 △字串 △短整數 △整數 △長整數 △超長整數 △單精度浮點數 △雙精度浮點數 △長雙精度浮點數 應用 宣告與輸出 運算符 結論
Thumbnail
說明 重點 △定義變數 △文字的定義 △文字與數字的差別 △整數與浮點數 △signed(有號)與unsigned(無號)的區別 △e是什麼符號? 分類 △字元 △字串 △短整數 △整數 △長整數 △超長整數 △單精度浮點數 △雙精度浮點數 △長雙精度浮點數 應用 宣告與輸出 運算符 結論
追蹤感興趣的內容從 Google News 追蹤更多 vocus 的最新精選內容追蹤 Google News