What is Unicode / Universal Coded Text Character Set (UCS)?

Data System Architecture

About

Unicode is a global character set that allows multilingual text to be displayed in a single application.

Unicode is a acronym of Universal Coded Character Set

Unicode enables the development of a single multilingual application and deploy it worldwide with only one character set.

It's a Multi-octet character set meaning that a character can be stored on more than one octet.

Therefore, it allows to represent a much larger variety of characters beyond the roman alphabet (greek, russian, mathematical symbols, logograms from non-phonetic writing systems such as kanji, etc.)

Unicode is the universal character set that supports:

  • most of the currently spoken languages of the world.
  • and many historical scripts (alphabets).

Range

Historically, the designers of Unicode miscalculated the total of code points and originally thought that Unicode would need no more than <math>2^{16}</math> code points. The original standard UCS-2 16-bit encoding was born.

The standard expanded to its current range of over <math>2^{20}</math> code points. The new increased range is organized into 17 subranges of <math>2^{16}</math> code points each.

  • The first of these, known as the Basic Multilingual Plane (or BMP), consists of the original <math>2^{16}</math> code points.
  • The additional 16 ranges are known as the supplementary planes.

<math>2^{20} = 1,048,576</math>

Unicode Standard Bit Total of code point Bytes First code point Last code point
UTF-8 8 <math>2^{8}</math> 1
UCS-2 16 <math>2^{16}</math> 2 0x0 0xFFFF (65,535‬)
UTF-16 16 <math>2^{16}</math> 2 0x0 0xFFFF (65,535‬)
UTF-32 32 <math>2^{32}</math> 4 0x0 0x10FFFF (‭1,114,111‬ decimal)

Structure

It specifies:

Form

Scheme

Unicode allows multiple different binary encodings schemes of code points.

The most popular standard encodings of Unicode are:

The rest being:

  • UTF-16BE,
  • UTF-16LE,
  • UTF-32BE,
  • and UTF-32LE.

Notation/Encoding

Hexadecimal

Unicode code points are denoted as U+hhhh, where “hhhh” is a sequence of at least four, and at most six hexadecimal digits.

Characters are denoted using the notation used in the Unicode Standard, that is, an optional U+ followed by their hexadecimal number, using at least 4 digits, such as “U+1234” or “U+10FFFD”.

In XML or HTML this could be expressed as “&#x1234;” or “&#x10FFFD;”.

Binary

From the hexadecimal form, you can always go the binary form

Data

The UnicodeData file is part of the Unicode Character Database maintained by the Unicode Consortium. This file specifies various properties including name and general category for every defined Unicode code point or character range.

The file and its description are available from the Unicode Consortium at: http://www.unicode.org

Specifically:

Example: scripts

Detection

See BOM (byte order mark)

Unicode and Computer Language

Unicode is the native encoding of many technologies, including:

  • Java,
  • XML,
  • XHTML,
  • ECMAScript (Javascript),
  • and LDAP.

To ASCII Punycode

Unicode characters may be translated into the ASCII character set via the punycode encoding.

Invalid Sequence to Replacement Character

When printing UTF8 data bytes, if an invalid sequence is found, the invalid sequence is replaced generally with the � character (ie the U+FFFD REPLACEMENT CHARACTER)

Documentation / Reference





Discover More
Data System Architecture
BOM (byte order mark)

The byte order mark (BOM) is a magic number (header) (Unicode character, feffU+FEFF BYTE ORDER MARK (BOM) It is not a character, but a byte sequence at the beginning of the file. It can be found at the...
Data System Architecture
Character - Null Character (NUL)

The null character (also known as null terminator) , abbreviated NUL, is a control character with the value zero It's the first character of most of the character set such as ASCII and unicode You...
Data System Architecture
Character - Whitespace

White-space characters is a set of characters that contains: spaces, tabs, and line breaks They are part of the non printing characters. They may be a difference between the class: ...
Data System Architecture
Character Set - Code page

Code page is a number identifier for a character set. The term code page originated from IBM's EBCDIC-based mainframe systems, but many vendors use this term including Microsoft, SAP, and Oracle Corporation....
Data System Architecture
Character Set - American Standard Code for Information Interchange ( ASCII )

ASCII is a character set originally based on the English alphabet, it encodes 128 specified characters into 7-bit binary integers (8 bits for the extended ASCII table) ASCII means American Standard Code...
Data System Architecture
Character Set - UTF8

utf version 8 bytes. UTF-8 bytes are divided in “waterproof” categories as follows: Bytes 0x00 to 0x7F aresingle bytes, they each represent a single codepoint in the exact same format as in...
HTML - How to show an Unicode Character in HTML

How to show a unicode character in HTML. Example with the 1F600grinning face emoji. This character has the unicode value: 1F600 in hexadecimal or 128512 in decimal
Data System Architecture
How to see the difference between two characters (hyphen and dash) ?

This page shows you how to make the difference between two characters that are really visually similar. Are this two characters the same ? To solve this problem, you need to pass them to an application...
HyperText markup Language ( HTML )

What is HTML ? the HyperText markup Language
Utah Teapot
Icon Font

Icon font Using web font to show graphics. It is very convenient from a development perspective. You can : change the size change the color shadow their shape Designer designs the icons....



Share this page:
Follow us:
Task Runner