Unicode(一):什麼是Unicode

更新於 2024/12/01閱讀時間約 18 分鐘

前言

如標題所示,這篇筆記試圖整理關於Unicode的一些基本問題。為求資訊的正確性,我會直接節錄Unicode Consortium官方或內部人所撰寫的說明文字(引文中的粗體底線是我加上的)。未來希望能進一步整理像是big-endian/little-endian等議題。

Unicode的源起

我們首先對電腦如何處理文字有個簡單的概念:“Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different systems, called character encodings, for assigning these numbers.”[1]
圖一
圖一中我們可以看到許多不同的編碼,譬如用於繁體中文的Big5,以及用於日文的Shift-JIS。“In the past, different organizations have assembled different sets of characters and created encodings for them – one set may cover just Latin-based Western European languages (excluding EU countries such as Bulgaria or Greece), another may cover a particular Far Eastern language (such as Japanese), others may be one of many sets devised in a rather ad hoc way for representing another language somewhere in the world.”[2]
然而這些早期的編碼在程式的開發上衍生出一些問題。早期的網路使用者應該不少人曾遇過打開網頁時出現亂碼,須更換編碼才能正常顯示的情形(如圖一)。“The pre-existing legacy character encodings were both inconsistent and incomplete—two encodings could use the same codes for two different characters and use different codes for the same characters, while none of the encodings handled any more than a small fraction of the world's languages. Whenever textual data was converted between different programs or platforms, there was a substantial risk of corruption. Programs often were written only to support particular encodings, making development of international versions expensive.”[3]
Unicode因此誕生:“The Unicode Standard began with a simple goal: to unify the many hundreds of conflicting ways to encode characters, replacing them with a single, universal standard.”[3]

Character與Glyph的區別

為了進一步理解Unicode的角色,我們先來看character與glyph的區別:“The mark made on screen or paper, called a glyph, is a visual representation of the character. . . . Characters are the abstract representations of the smallest components of written language that have semantic value. They represent primarily, but not exclusively, the letters, punctuation, and other signs that constitute natural language text and technical notation. . . . Glyphs represent the shapes that characters can have when they are rendered or displayed.”[3]
圖二(截圖自[3])
至於所謂的font:“A font represents an organized collection of glyphs in which the various glyph representations will share a common look or styling such that, when a string of characters is rendered together, the result is highly legible, conveys a particular artistic style and provides consistent inter-character alignment and spacing.”[4]
回到Unicode的角色,Unicode處理的是character而非glyph:“The Unicode Standard does not define glyph images. That is, the standard defines how characters are interpreted, not how glyphs are rendered. . . . The Unicode Standard does not specify the precise shape, size, or orientation of on-screen characters. . . . Glyph shape and methods of identifying and selecting glyphs are the responsibility of individual font vendors and of appropriate standards and are not part of the Unicode Standard.”[3]

Codespace和Code Point

“A character set or repertoire comprises the set of characters one might use for a particular purpose – be it those required to support Western European languages in computers, or those a Chinese child will learn at school in the third grade (nothing to do with computers).”[2]“The range of integers used to code the abstract characters is called the codespace. A particular integer in this set is called a code point. When an abstract character is mapped or assigned to a particular code point in the codespace, it is then referred to as an encoded character.”[3]
回到Unicode,Unicode表示code point的方式為:“an individual Unicode code point is expressed as U+n, where n is four to six hexadecimal digits, using the digits 0–9 and uppercase letters A–F (for 10 through 15, respectively). Leading zeros are omitted, unless the code point would have fewer than four hexadecimal digits—for example, U+0001, U+0012, U+0123, U+1234, U+12345, U+102345.”[3]舉例來說,圖二中出現的U+0041是LATIN CAPITAL LETTER A的code point。至於Unicode的codespace:“In the Unicode Standard, the codespace consists of the integers from 0 to 10FFFF, comprising 1,114,112 code points available for assigning the repertoire of abstract characters.”[3](前段引文中的10FFFF原文有下標16,只是vocus的編輯器沒有下標功能。)10FFFF為十六進位的表示法,轉換成常見的十進位表示為1114111。
圖三

UTF-8、UTF-16和UTF-32

至於圖一中有出現的UTF-8,則是所謂的encoding form:“Actual implementations in computer systems represent integers in specific code units of particular size—usually 8-bit (= byte), 16-bit, or 32-bit. In the Unicode character encoding model, precisely defined encoding forms specify how each integer (code point) for a Unicode character is to be expressed as a sequence of one or more code units. The Unicode Standard provides three distinct encoding forms for Unicode characters, using 8-bit, 16-bit, and 32-bit units. These are named UTF-8, UTF-16, and UTF-32, respectively. . . . All three encoding forms can be used to represent the full range of encoded characters in the Unicode Standard; . . . Each of the three Unicode encoding forms can be efficiently transformed into either of the other two without any loss of data.”[3]
圖四(截圖自[3])
圖四展示了四個character在不同的encoding form下的code unit value,四個character的code point由左至右分別為U+0041、U+03A9、U+8A9E和U+10384。圖四中的code unit value是以十六進位表示,轉換成二進位(十六進位和二進位的轉換見圖五)後如同上一段引文所說的,UTF-8、UTF-16和UTF-32的code unit分別為8-bit、16-bit和32-bit(以十六進位的41為例,轉換成二進位表示為01000001,共8個bit)。
圖五(截圖自[5])
本文不會詳述encoding form如何將code point轉為code unit value,只簡單說明一些UTF-8、UTF-16和UTF-32間的差異。在圖四中,“the UTF-32 line shows that each example character can be expressed with one 32-bit code unit. Those code units have the same values as the code point for the character. . . . In UTF-8, a character may be expressed with one, two, three, or four bytes, and the relationship between those byte values and the code point value is more complex. . . . The value of each UTF-32 code unit corresponds exactly to the Unicode code point value. This situation differs significantly from that for UTF-16 and especially UTF-8, where the code unit values often change unrecognizably from the code point value.”[3]我們再看另一個例子:
圖六(截圖自[3])
UTF-8、UTF-16和UTF-32各有不同的適用場域,譬如從上一段所描述的UTF-32的特性,我們可以想見:“UTF-32 may be a preferred encoding form where memory or disk storage space for characters is not a particular concern, but where fixed-width, single code unit access to characters is desired.”[3]

A Historical Note

閱讀字元編碼相關文獻的時候,可能要留意文獻出版的年代。舉例來說,早期的書(譬如Angelika Langer和Klaus Kreft合著的Standard C++ IOStreams and Locales)會寫Unicode是16-bit encoding,這部分Unicode官網的說明是:“The first version of Unicode was a 16-bit encoding, from 1991 to 1995, but starting with Unicode 2.0 (July, 1996), it has not been a 16-bit encoding. The Unicode Standard encodes characters in the range U+0000..U+10FFFF, which amounts to a 21-bit code space. Depending on the encoding form you choose (UTF-8, UTF-16, or UTF-32), each character will then be represented either as a sequence of one to four 8-bit bytes, one or two 16-bit code units, or a single 32-bit code unit.”[6]

[2] Ishida R. Character encodings: Essential concepts. https://www.w3.org/International/articles/definitions-characters/
[3] Unicode Consortium. The Unicode Standard (Version 14.0).
[5] Stroustrup B. Programming: Principles and practice using C++ (2nd Edition).
為什麼會看到廣告
avatar-img
4會員
2內容數
留言0
查看全部
avatar-img
發表第一個留言支持創作者!
你可能也想看
Google News 追蹤
Thumbnail
*合作聲明與警語: 本文係由國泰世華銀行邀稿。 證券服務係由國泰世華銀行辦理共同行銷證券經紀開戶業務,定期定額(股)服務由國泰綜合證券提供。   剛出社會的時候,很常在各種 Podcast 或 YouTube 甚至是在朋友間聊天,都會聽到各種市場動態、理財話題,像是:聯準會降息或是近期哪些科
Thumbnail
話說在頭,我不是很關心運動賽事,雖說也不至於一無所知,但並不會特別去關注,不過如果有什麼當紅的賽事,多少還是會關心一下,要說跟風也沒錯,但若說是一日球迷,大概連這個也算不上,只是在眾多比賽類型裡面,棒球算是有些特殊情感存在的。 其實小時候算是有在注意棒球(1970年代),當年沒什麼娛樂選擇,所以轉
「Do」自己喜歡的樣子,是你的權利。 嘿! 別忘了,你有這權利。
Thumbnail
台文書的出版越來越熱烈;電子書的市場也一直在成長。技術上不論是平台、作者、編輯、排版與設計,以及「字體的開發」上,都需要不斷地學習、精進。願大家做書、讀書、賣書都開心!
Thumbnail
這一次音樂饗宴要跟大家分享黃霆睿的憑什麼。
Thumbnail
生成式AI的出現對作家產生了深遠的影響,改變了作家創作方式、市場生態和社會價值觀。未來可能出現市場壟斷、收入兩極分化、創作風格同質化等趨勢,也會引發政府幹預、創作市場的變革、社會運動與文化變革等事件。在這樣的環境中,作家需要不斷適應、創新,並挖掘新的市場需求。
Thumbnail
今天又來分享 Mo Willems 的作品了,因為他的作品淺顯易懂,又有幽默感,真的很容易戳中我家兩兄弟的笑點啊。 去看親子舞台劇時,常常有一個橋段是,主角在找某個東西或是某人,東轉西繞找不到,藉此和台下小朋友互動,小朋友拼了命地大喊:「在那裡」,主角又誇張的轉來轉去,還是找不到,看到小朋友用
Thumbnail
白天已經做太多事了,沒什麼家務可做,只好開始隨意翻一些書,後來看見一段話: 如果我只期待我想要的結果 那就不叫交託 那是交代 即時結果不如預期 也要相信上帝已將最壞的一部分挪去 把心心交在上帝手裏 祂必會穩穩地托住你 好喜歡這段文字,於是隨手拿了紙筆抄寫下來,反覆誦讀。
Thumbnail
本篇文章講解了字符編碼的基礎知識,包括ASCII, Unicode 和 UTF-8的誕生背景、解決的問題以及轉換方式。瞭解這些知識有助於解決在讀檔案時用錯誤的編碼方式轉換就會出現亂碼等問題。文章內容涉及電腦技術中的字符編碼相關歷史緣由,可幫助讀者解決相關疑問。
Thumbnail
*合作聲明與警語: 本文係由國泰世華銀行邀稿。 證券服務係由國泰世華銀行辦理共同行銷證券經紀開戶業務,定期定額(股)服務由國泰綜合證券提供。   剛出社會的時候,很常在各種 Podcast 或 YouTube 甚至是在朋友間聊天,都會聽到各種市場動態、理財話題,像是:聯準會降息或是近期哪些科
Thumbnail
話說在頭,我不是很關心運動賽事,雖說也不至於一無所知,但並不會特別去關注,不過如果有什麼當紅的賽事,多少還是會關心一下,要說跟風也沒錯,但若說是一日球迷,大概連這個也算不上,只是在眾多比賽類型裡面,棒球算是有些特殊情感存在的。 其實小時候算是有在注意棒球(1970年代),當年沒什麼娛樂選擇,所以轉
「Do」自己喜歡的樣子,是你的權利。 嘿! 別忘了,你有這權利。
Thumbnail
台文書的出版越來越熱烈;電子書的市場也一直在成長。技術上不論是平台、作者、編輯、排版與設計,以及「字體的開發」上,都需要不斷地學習、精進。願大家做書、讀書、賣書都開心!
Thumbnail
這一次音樂饗宴要跟大家分享黃霆睿的憑什麼。
Thumbnail
生成式AI的出現對作家產生了深遠的影響,改變了作家創作方式、市場生態和社會價值觀。未來可能出現市場壟斷、收入兩極分化、創作風格同質化等趨勢,也會引發政府幹預、創作市場的變革、社會運動與文化變革等事件。在這樣的環境中,作家需要不斷適應、創新,並挖掘新的市場需求。
Thumbnail
今天又來分享 Mo Willems 的作品了,因為他的作品淺顯易懂,又有幽默感,真的很容易戳中我家兩兄弟的笑點啊。 去看親子舞台劇時,常常有一個橋段是,主角在找某個東西或是某人,東轉西繞找不到,藉此和台下小朋友互動,小朋友拼了命地大喊:「在那裡」,主角又誇張的轉來轉去,還是找不到,看到小朋友用
Thumbnail
白天已經做太多事了,沒什麼家務可做,只好開始隨意翻一些書,後來看見一段話: 如果我只期待我想要的結果 那就不叫交託 那是交代 即時結果不如預期 也要相信上帝已將最壞的一部分挪去 把心心交在上帝手裏 祂必會穩穩地托住你 好喜歡這段文字,於是隨手拿了紙筆抄寫下來,反覆誦讀。
Thumbnail
本篇文章講解了字符編碼的基礎知識,包括ASCII, Unicode 和 UTF-8的誕生背景、解決的問題以及轉換方式。瞭解這些知識有助於解決在讀檔案時用錯誤的編碼方式轉換就會出現亂碼等問題。文章內容涉及電腦技術中的字符編碼相關歷史緣由,可幫助讀者解決相關疑問。