Unicode is a global character set that allows multilingual text to be displayed in a single application.
Unicode is a acronym of Universal Coded Character Set
Unicode enables the development of a single multilingual application and deploy it worldwide with only one character set.
It's a Multi-octet character set meaning that a character can be stored on more than one octet.
Therefore, it allows to represent a much larger variety of characters beyond the roman alphabet (greek, russian, mathematical symbols, logograms from non-phonetic writing systems such as kanji, etc.)
Unicode is the universal character set that supports:
Historically, the designers of Unicode miscalculated the total of code points and originally thought that Unicode would need no more than <math>2^{16}</math> code points. The original standard UCS-2 16-bit encoding was born.
The standard expanded to its current range of over <math>2^{20}</math> code points. The new increased range is organized into 17 subranges of <math>2^{16}</math> code points each.
<math>2^{20} = 1,048,576</math>
Unicode Standard | Bit | Total of code point | Bytes | First code point | Last code point |
---|---|---|---|---|---|
UTF-8 | 8 | <math>2^{8}</math> | 1 | ||
UCS-2 | 16 | <math>2^{16}</math> | 2 | 0x0 | 0xFFFF (65,535) |
UTF-16 | 16 | <math>2^{16}</math> | 2 | 0x0 | 0xFFFF (65,535) |
UTF-32 | 32 | <math>2^{32}</math> | 4 | 0x0 | 0x10FFFF (1,114,111 decimal) |
It specifies:
Unicode allows multiple different binary encodings schemes of code points.
The most popular standard encodings of Unicode are:
The rest being:
Unicode code points are denoted as U+hhhh, where “hhhh” is a sequence of at least four, and at most six hexadecimal digits.
Characters are denoted using the notation used in the Unicode Standard, that is, an optional U+ followed by their hexadecimal number, using at least 4 digits, such as “U+1234” or “U+10FFFD”.
In XML or HTML this could be expressed as “ሴ” or “􏿽”.
From the hexadecimal form, you can always go the binary form
The UnicodeData file is part of the Unicode Character Database maintained by the Unicode Consortium. This file specifies various properties including name and general category for every defined Unicode code point or character range.
The file and its description are available from the Unicode Consortium at: http://www.unicode.org
Specifically:
Example: scripts
Unicode is the native encoding of many technologies, including:
Unicode characters may be translated into the ASCII character set via the punycode encoding.
When printing UTF8 data bytes, if an invalid sequence is found, the invalid sequence is replaced generally with the � character (ie the U+FFFD REPLACEMENT CHARACTER)