About
Unicode is a global character set that allows multilingual text to be displayed in a single application.
Unicode is a acronym of Universal Coded Character Set
Unicode enables the development of a single multilingual application and deploy it worldwide with only one character set.
It's a Multi-octet character set meaning that a character can be stored on more than one octet.
Therefore, it allows to represent a much larger variety of characters beyond the roman alphabet (greek, russian, mathematical symbols, logograms from non-phonetic writing systems such as kanji, etc.)
Unicode is the universal character set that supports:
- most of the currently spoken languages of the world.
- and many historical scripts (alphabets).
Range
Historically, the designers of Unicode miscalculated the total of code points and originally thought that Unicode would need no more than <math>2^{16}</math> code points. The original standard UCS-2 16-bit encoding was born.
The standard expanded to its current range of over <math>2^{20}</math> code points. The new increased range is organized into 17 subranges of <math>2^{16}</math> code points each.
- The first of these, known as the Basic Multilingual Plane (or BMP), consists of the original <math>2^{16}</math> code points.
- The additional 16 ranges are known as the supplementary planes.
<math>2^{20} = 1,048,576</math>
Unicode Standard | Bit | Total of code point | Bytes | First code point | Last code point |
---|---|---|---|---|---|
UTF-8 | 8 | <math>2^{8}</math> | 1 | ||
UCS-2 | 16 | <math>2^{16}</math> | 2 | 0x0 | 0xFFFF (65,535) |
UTF-16 | 16 | <math>2^{16}</math> | 2 | 0x0 | 0xFFFF (65,535) |
UTF-32 | 32 | <math>2^{32}</math> | 4 | 0x0 | 0x10FFFF (1,114,111 decimal) |
Structure
It specifies:
Form
- and UTF-32;
Scheme
Unicode allows multiple different binary encodings schemes of code points.
The most popular standard encodings of Unicode are:
- UTF-32,
The rest being:
- UTF-16BE,
- UTF-16LE,
- UTF-32BE,
- and UTF-32LE.
Notation/Encoding
Hexadecimal
Unicode code points are denoted as U+hhhh, where “hhhh” is a sequence of at least four, and at most six hexadecimal digits.
Characters are denoted using the notation used in the Unicode Standard, that is, an optional U+ followed by their hexadecimal number, using at least 4 digits, such as “U+1234” or “U+10FFFD”.
In XML or HTML this could be expressed as “ሴ” or “􏿽”.
Binary
From the hexadecimal form, you can always go the binary form
Data
The UnicodeData file is part of the Unicode Character Database maintained by the Unicode Consortium. This file specifies various properties including name and general category for every defined Unicode code point or character range.
The file and its description are available from the Unicode Consortium at: http://www.unicode.org
Specifically:
Example: scripts
Detection
Unicode and Computer Language
Unicode is the native encoding of many technologies, including:
- Java,
- XML,
- XHTML,
- ECMAScript (Javascript),
- and LDAP.
To ASCII Punycode
Unicode characters may be translated into the ASCII character set via the punycode encoding.
Invalid Sequence to Replacement Character
When printing UTF8 data bytes, if an invalid sequence is found, the invalid sequence is replaced generally with the � character (ie the U+FFFD REPLACEMENT CHARACTER)
Documentation / Reference
- UCS is specified in ISO 10646