Unicode - Surrogate pair (UTF-16)

Data System Architecture

About

A surrogate pair is two 16-bit code units used in UTF-16 (16-bit - two-byte) that represents a character above the maximum value stored in 16bit. (ie 0xFFFF hexa or 65535 decimal)

Why ? Because the whole unicode set has way more character than 65535 (16bit), therefore to represent a code point (character) above 0xFFFF (such 0x10000 to 0x10FFFF, ), a pairs of code units known as surrogates is used.

And unfortunately, using UTF-16 as character set will require two code units to represent a single character above 0xFFFF.

Calculator

This calculator was implemented:

  • with Javascript (Click the Try the code button to see the code)
  • and the formula described below.

The default character for the calculator is the unicode character 0x1F600 - 😀

From Character Code to Surrogate Pair

From Surrogate Pair to Character Code

Syntax

A surrogate pair:

  • is composed of a two pair of code point (H and L) 1) (with a value in a special range)
  • represents a unicode code point known as S with a value above 0xFFFF
Symbol Name Hexa Range
H high or leading 0xD800–0xDBFF
L low or trailing 0xDC00–0xDFFF
S surrogate character code 0xFFFF–

Formula

There is more than one formula 2). This formula is the simple one 3)

From character code to surrogate pair

// constant
const LEAD_OFFSET = 0xD800 - (0x10000 >> 10);
const SURROGATE_OFFSET = 0x10000 - (0xD800 << 10) - 0xDC00;

// computations
const lead = LEAD_OFFSET + (codepoint >> 10);
const trail = 0xDC00 + (codepoint & 0x3FF);

where:

From surrogate to character code

const SURROGATE_OFFSET = 0x10000 - (0xD800 << 10) - 0xDC00;

const codepoint = (lead << 10) + trail + SURROGATE_OFFSET;

where:

Usage

When used in a string <MATH> \text{Surrogate Pair} = \text{(High Surrogate)}\text{(Low Surrogate)} \\ S = HL </MATH>

Example in Javascript

console.log('\uD83D\uDC69');

Management

Concatenate

Some pairs can be concatenated to obtain another code point

console.log("Emoji concatenation");
console.log('\uD83D\uDC69 + \u200D\u2764\uFE0F\u200D = \uD83D\uDC69\u200D\u2764\uFE0F\u200D');
console.log("");
console.log("Does not work for all combinations");
console.log('\uD83D\uDE00 + \u200D\u2764\uFE0F\u200D = \uD83D\uDE00\u200D\u2764\uFE0F\u200D');
console.log("");
console.log("Works also with base character");
console.log('\u0065 + \u0301 = \u0065\u0301');

Documentation / Reference

Calculator Reference





Discover More
Javascript - Character

This article is the character representation and manipulation in Javascript (ie code point). They: are all unicode UTF-16 character are an element in a string starting at the index 0. may have...
Javascript - String

The in javascript. A string in JavaScript is encoded with the ucs-2 16-bit character set. An element of a JavaScript string is therefore a 16-bit code unit. code unitscode pointssurrogate pair Strings...
Character Set Code Pages
Text - Encoding (Character Set|charset|code page)

A character set is a repertoire of characters in which each character is (assigned|encoded) into a numeric code point. An character set (as an alphabet) is any finite set of symbols (characters). In...
Character Map 0248 00f8
Text - UTF-16 Character Set

UTF-16 is a variant of unicode. It's variable-length encoding: Each code point in a UTF-16 encoding may require either one or two 16-bit code units. The size in memory of a string of length n varies based...



Share this page:
Follow us:
Task Runner