Text - Unicode (UTF) Encoding

1 - About

Unicode is a global character set that allows multilingual text to be displayed in a single application. This enables to develop a single multilingual application and deploy it worldwide. It allows to represent a much larger variety of characters beyond the roman alphabet (greek, russian, mathematical symbols, logograms from non-phonetic writing systems such as kanji, etc.)

Unicode is the universal character set that supports most of the currently spoken languages of the world. It also supports many historical scripts (alphabets). Unicode is the native encoding of many technologies, including Java, XML, XHTML, ECMAScript, and LDAP.

3 - Range

Historically, the designers of Unicode miscalculated the total of code points and originally thought that Unicode would need no more than <math>2^{16}</math> code points. The original standard UCS-2 16-bit encoding was born.

The standard expanded to its current range of over <math>2^{20}</math> code points. The new increased range is organized into 17 subranges of <math>2^{16}</math> code points each.

  • The first of these, known as the Basic Multilingual Plane (or BMP), consists of the original <math>2^{16}</math> code points.
  • The additional 16 ranges are known as the supplementary planes.

<math>2^{20} = 1,048,576</math>

Unicode Standard Bit Total of code point Bytes First code point Last code point
UTF-8 8 <math>2^{8}</math> 1
UCS-2 16 <math>2^{16}</math> 2 0x0 0xFFFF (65,535‬)
UTF-16 16 <math>2^{16}</math> 2 0x0 0xFFFF (65,535‬)
UTF-32 32 <math>2^{32}</math> 4 0x0 0x10FFFF (‭1,114,111‬ decimal)

4 - Scheme

Unicode allows multiple different binary encodings schemes of code points. The most popular standard encodings of Unicode are:


  • UTF-16BE,
  • UTF-16LE,
  • UTF-32BE,
  • and UTF-32LE.

5 - Notation/Encoding

5.1 - Hexadecimal

Unicode code points are denoted as U+hhhh, where “hhhh” is a sequence of at least four, and at most six hexadecimal digits.

Characters are denoted using the notation used in the Unicode Standard, that is, an optional U+ followed by their hexadecimal number, using at least 4 digits, such as “U+1234” or “U+10FFFD”.

In XML or HTML this could be expressed as “&#x1234;” or “&#x10FFFD;”.

5.2 - Binary

From the hexadecimal form, you can always go the binary form

6 - Data

The UnicodeData file is part of the Unicode Character Database maintained by the Unicode Consortium. This file specifies various properties including name and general category for every defined Unicode code point or character range.

The file and its description are available from the Unicode Consortium at: http://www.unicode.org


Example: scripts

7 - Detection

8 - Documentation / Reference

Data Science
Data Analysis
Data Science
Linear Algebra Mathematics

Powered by ComboStrap