更新於 2024/12/01閱讀時間約 18 分鐘

Unicode(一):什麼是Unicode

前言

如標題所示,這篇筆記試圖整理關於Unicode的一些基本問題。為求資訊的正確性,我會直接節錄Unicode Consortium官方或內部人所撰寫的說明文字(引文中的粗體底線是我加上的)。未來希望能進一步整理像是big-endian/little-endian等議題。

Unicode的源起

我們首先對電腦如何處理文字有個簡單的概念:“Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different systems, called character encodings, for assigning these numbers.”[1]
圖一
圖一
圖一中我們可以看到許多不同的編碼,譬如用於繁體中文的Big5,以及用於日文的Shift-JIS。“In the past, different organizations have assembled different sets of characters and created encodings for them – one set may cover just Latin-based Western European languages (excluding EU countries such as Bulgaria or Greece), another may cover a particular Far Eastern language (such as Japanese), others may be one of many sets devised in a rather ad hoc way for representing another language somewhere in the world.”[2]
然而這些早期的編碼在程式的開發上衍生出一些問題。早期的網路使用者應該不少人曾遇過打開網頁時出現亂碼,須更換編碼才能正常顯示的情形(如圖一)。“The pre-existing legacy character encodings were both inconsistent and incomplete—two encodings could use the same codes for two different characters and use different codes for the same characters, while none of the encodings handled any more than a small fraction of the world's languages. Whenever textual data was converted between different programs or platforms, there was a substantial risk of corruption. Programs often were written only to support particular encodings, making development of international versions expensive.”[3]
Unicode因此誕生:“The Unicode Standard began with a simple goal: to unify the many hundreds of conflicting ways to encode characters, replacing them with a single, universal standard.”[3]

Character與Glyph的區別

為了進一步理解Unicode的角色,我們先來看character與glyph的區別:“The mark made on screen or paper, called a glyph, is a visual representation of the character. . . . Characters are the abstract representations of the smallest components of written language that have semantic value. They represent primarily, but not exclusively, the letters, punctuation, and other signs that constitute natural language text and technical notation. . . . Glyphs represent the shapes that characters can have when they are rendered or displayed.”[3]
圖二(截圖自[3])
至於所謂的font:“A font represents an organized collection of glyphs in which the various glyph representations will share a common look or styling such that, when a string of characters is rendered together, the result is highly legible, conveys a particular artistic style and provides consistent inter-character alignment and spacing.”[4]
回到Unicode的角色,Unicode處理的是character而非glyph:“The Unicode Standard does not define glyph images. That is, the standard defines how characters are interpreted, not how glyphs are rendered. . . . The Unicode Standard does not specify the precise shape, size, or orientation of on-screen characters. . . . Glyph shape and methods of identifying and selecting glyphs are the responsibility of individual font vendors and of appropriate standards and are not part of the Unicode Standard.”[3]

Codespace和Code Point

“A character set or repertoire comprises the set of characters one might use for a particular purpose – be it those required to support Western European languages in computers, or those a Chinese child will learn at school in the third grade (nothing to do with computers).”[2]“The range of integers used to code the abstract characters is called the codespace. A particular integer in this set is called a code point. When an abstract character is mapped or assigned to a particular code point in the codespace, it is then referred to as an encoded character.”[3]
回到Unicode,Unicode表示code point的方式為:“an individual Unicode code point is expressed as U+n, where n is four to six hexadecimal digits, using the digits 0–9 and uppercase letters A–F (for 10 through 15, respectively). Leading zeros are omitted, unless the code point would have fewer than four hexadecimal digits—for example, U+0001, U+0012, U+0123, U+1234, U+12345, U+102345.”[3]舉例來說,圖二中出現的U+0041是LATIN CAPITAL LETTER A的code point。至於Unicode的codespace:“In the Unicode Standard, the codespace consists of the integers from 0 to 10FFFF, comprising 1,114,112 code points available for assigning the repertoire of abstract characters.”[3](前段引文中的10FFFF原文有下標16,只是vocus的編輯器沒有下標功能。)10FFFF為十六進位的表示法,轉換成常見的十進位表示為1114111。
圖三

UTF-8、UTF-16和UTF-32

至於圖一中有出現的UTF-8,則是所謂的encoding form:“Actual implementations in computer systems represent integers in specific code units of particular size—usually 8-bit (= byte), 16-bit, or 32-bit. In the Unicode character encoding model, precisely defined encoding forms specify how each integer (code point) for a Unicode character is to be expressed as a sequence of one or more code units. The Unicode Standard provides three distinct encoding forms for Unicode characters, using 8-bit, 16-bit, and 32-bit units. These are named UTF-8, UTF-16, and UTF-32, respectively. . . . All three encoding forms can be used to represent the full range of encoded characters in the Unicode Standard; . . . Each of the three Unicode encoding forms can be efficiently transformed into either of the other two without any loss of data.”[3]
圖四(截圖自[3])
圖四展示了四個character在不同的encoding form下的code unit value,四個character的code point由左至右分別為U+0041、U+03A9、U+8A9E和U+10384。圖四中的code unit value是以十六進位表示,轉換成二進位(十六進位和二進位的轉換見圖五)後如同上一段引文所說的,UTF-8、UTF-16和UTF-32的code unit分別為8-bit、16-bit和32-bit(以十六進位的41為例,轉換成二進位表示為01000001,共8個bit)。
圖五(截圖自[5])
本文不會詳述encoding form如何將code point轉為code unit value,只簡單說明一些UTF-8、UTF-16和UTF-32間的差異。在圖四中,“the UTF-32 line shows that each example character can be expressed with one 32-bit code unit. Those code units have the same values as the code point for the character. . . . In UTF-8, a character may be expressed with one, two, three, or four bytes, and the relationship between those byte values and the code point value is more complex. . . . The value of each UTF-32 code unit corresponds exactly to the Unicode code point value. This situation differs significantly from that for UTF-16 and especially UTF-8, where the code unit values often change unrecognizably from the code point value.”[3]我們再看另一個例子:
圖六(截圖自[3])
UTF-8、UTF-16和UTF-32各有不同的適用場域,譬如從上一段所描述的UTF-32的特性,我們可以想見:“UTF-32 may be a preferred encoding form where memory or disk storage space for characters is not a particular concern, but where fixed-width, single code unit access to characters is desired.”[3]

A Historical Note

閱讀字元編碼相關文獻的時候,可能要留意文獻出版的年代。舉例來說,早期的書(譬如Angelika Langer和Klaus Kreft合著的Standard C++ IOStreams and Locales)會寫Unicode是16-bit encoding,這部分Unicode官網的說明是:“The first version of Unicode was a 16-bit encoding, from 1991 to 1995, but starting with Unicode 2.0 (July, 1996), it has not been a 16-bit encoding. The Unicode Standard encodes characters in the range U+0000..U+10FFFF, which amounts to a 21-bit code space. Depending on the encoding form you choose (UTF-8, UTF-16, or UTF-32), each character will then be represented either as a sequence of one to four 8-bit bytes, one or two 16-bit code units, or a single 32-bit code unit.”[6]

[2] Ishida R. Character encodings: Essential concepts. https://www.w3.org/International/articles/definitions-characters/
[3] Unicode Consortium. The Unicode Standard (Version 14.0).
[5] Stroustrup B. Programming: Principles and practice using C++ (2nd Edition).
分享至
成為作者繼續創作的動力吧!
© 2024 vocus All rights reserved.