Unicode in a nutshell

Disclaimer: This post is not and does not aim to be a comprehensive guide to Unicode. The purpose of the post is for the reader to quickly acquire a minimal set of knowledge about Unicode.

I think it's no secret that many developers still don't fully understand what Unicode is and how it works. They have only a vague idea, something like "Well, Unicode is a new encoding used to display any character, and it uses 2 bytes to store one character." Such an understanding of Unicode is, to put it mildly, not quite correct.

So, what is Unicode?

Unicode is a character encoding standard that allows the representation of characters from all languages (and also includes the representation of control characters, i.e., newline, escape, new page, etc.).

The two main components of Unicode are:

A universal character set
Encodings (UTF-8, UTF-16, etc.), which define the representation of a character from the universal set in memory.

The Universal Character Set

The universal character set is singular and independent of encoding. Currently, there are approximately 155,000 characters in the universal set, and Unicode in general allows encoding ~1.1 million characters (this limitation is explained by maintaining compatibility with the UTF-16 encoding, which allows encoding 1.1 million characters).

Each character from the universal character set is called a code point and can be represented as U+XXXX, where U stands for Unicode, and XXXX is the ordinal number of the character in the universal set. The numbers of the first 127 characters in the universal set coincide with the numbers of these characters in the ASCII (ANSI) encoding, so, for example, the English letter A (ASCII code 0x41) has the number U+0041 in the universal set, and the space is U+0020.

You can view the characters that make up the universal set at the following links:

http://www.tamasoft.co.jp/en/general-info/unicode.html

http://www.utf8-chartable.de/

At the beginning of the universal set (with numbers up to U+10000) are the most important and frequently used characters. Rare characters, such as letters of ancient languages, some mathematical and musical symbols, are located at the end of the universal set. This arrangement of characters in the table is made so that the most frequently used characters can be encoded with fewer bytes.

Unicode Encodings

Now let's consider the second component of Unicode - encodings.

The main encodings in Unicode are UTF-8, UTF-16, and UTF-32. UTF-8 and UTF-16 are more widespread than UTF-32 because UTF-32 encodes each character with 4 bytes, which is rather inefficient.

UTF-8 encodes each character using 1, 2, 3, or 4 bytes. Theoretically, UTF-8 can encode characters with 6 bytes, but currently, there are no characters that would require 6 bytes for encoding, so it is considered that one character in UTF-8 requires from 1 to 4 bytes. You can see the table used for encoding characters from the universal set in UTF-8 via this link:

http://www.utf8-chartable.de/

Characters with numbers less than U+0080 (128) are encoded with one byte (the value of which equals the character number in the universal set), characters with numbers from U+0080 to U+07FF are encoded with 2 bytes, those with numbers from U+0800 to U+0FFFF with 3 bytes, and those with numbers from U+10000 with 4 bytes. The number of bytes used to encode a character is determined by the first bits. For example, if the first bit of a new character is 0, then it is a one-byte character; if it is 1, then the character consists of 2-4 bytes (the exact number is determined by the following (2-5) bits).

UTF-16 encodes each character using 2 or 4 bytes. Most of the used characters fall within the range encoded with two bytes, while the remaining characters (rare characters located at the end of the universal set, such as letters of the ancient Greek alphabet, symbols used to denote musical notes, etc.) are encoded with 4 bytes.

Characters with numbers from U+0000 to U+D7FF and from U+E000 to U+FFFF are encoded in UTF-16 using 2 bytes, and their encoded values are simply the numbers from the universal character table. The numbers from U+D800 to U+DFFF are used only when encoding a character in 4 bytes. That is, if 4 bytes are needed to encode a character in UTF-16, the first 2 bytes are encoded with a value from U+D800 to U+DBFF, and the second 2 bytes with a value from U+DC00 to U+DFFF. Thus, if the first 2 bytes of a character are in the range from U+D800 to U+DFFF, it means the character is encoded with 4 bytes.

There are two varieties of UTF-16 - UTF-16LE (LE = Little Endian) and UTF-16BE (BE = Big Endian). The UTF-16LE encoding, which is typically referred to simply as UTF-16, is more widespread. The difference between these encodings is that in UTF-16LE, the least significant byte comes first, while in UTF-16BE, the most significant byte comes first.

Byte Order Mark

Finally, let's talk about BOM. A "BOM" (Byte Order Mark) is a special sequence of bytes at the beginning of a text file that indicates the Unicode encoding scheme used, essentially telling a program how to interpret the bytes in the file, particularly important for determining the byte order (endianness) when using encodings like UTF-16 and UTF-32; in simpler terms, it acts as a marker to identify the specific way the text is encoded within a file.

Example BOM sequences:

UTF-8: 0xEF 0xBB 0xBF
UTF-16 Big Endian: 0xFE 0xFF
UTF-16 Little Endian: 0xFF 0xFE
UTF-32 Big Endian: 0x00 0x00 0xFE 0xFF
UTF-32 Little Endian: 0xFF 0xFE 0x00 0x00

Examples:

Character: A (U+0041)

UTF-8: 0x41
UTF-16BE: 0x00 0x41
UTF-16LE: 0x41 0x00
UTF-32BE: 0x00 0x00 0x00 0x41
UTF-32LE: 0x41 0x00 0x00 0x00

Character: € (Euro Sign, U+20AC)

UTF-8: 0xE2 0x82 0xAC. Explanation: The Euro sign is encoded with three bytes because it is beyond the ASCII range and falls within U+0800 to U+FFFF.
UTF-16BE: 0x20 0xAC. Explanation: The Euro sign is encoded with two bytes since it falls within the Basic Multilingual Plane.
UTF-16LE: 0xAC 0x20. Explanation: The byte order is reversed compared to Big Endian.
UTF-32BE: 0x00 0x00 0x20 0xAC. Explanation: The Euro sign is directly represented as its Unicode code point in four bytes.
UTF-32LE: 0xAC 0x20 0x00 0x00. Explanation: The byte order is reversed compared to Big Endian.

Beginning of a file with the euro sign as the first character, encoded in ETF16-BE with BOM:

0xFE 0xFF 0x20 0xAC

That's all for now. I hope this information has been useful to you and helped organize your knowledge about Unicode in your head. Generally speaking, if you are not going to write your converters from one encoding to another, this information should be sufficient in most cases when dealing with Unicode. Thank you for your attention!

Search This Blog

Nightcoder's blog