Unicode - Surrogate pair (UTF-16)

About

A surrogate pair is two 16-bit code units used in UTF-16 (16-bit - two-byte) that represents a character above the maximum value stored in 16bit. (ie 0xFFFF hexa or 65535 decimal)

Why ? Because the whole unicode set has way more character than 65535 (16bit), therefore to represent a code point (character) above 0xFFFF (such 0x10000 to 0x10FFFF, ), a pairs of code units known as surrogates is used.

And unfortunately, using UTF-16 as character set will require two code units to represent a single character above 0xFFFF.

Articles Related

Calculator

This calculator was implemented:

with Javascript (Click the Try the code button to see the code)
and the formula described below.

The default character for the calculator is the unicode character 0x1F600 - 😀

From Character Code to Surrogate Pair

Rendered by WebCode

From Surrogate Pair to Character Code

Rendered by WebCode

Syntax

A surrogate pair:

is composed of a two pair of code point (H and L) ¹⁾ (with a value in a special range)
represents a unicode code point known as S with a value above 0xFFFF

Symbol	Name	Hexa Range
H	high or leading	0xD800–0xDBFF
L	low or trailing	0xDC00–0xDFFF
S	surrogate character code	0xFFFF–

Formula

There is more than one formula ²⁾. This formula is the simple one ³⁾

From character code to surrogate pair

// constant
const LEAD_OFFSET = 0xD800 - (0x10000 >> 10);
const SURROGATE_OFFSET = 0x10000 - (0xD800 << 10) - 0xDC00;

// computations
const lead = LEAD_OFFSET + (codepoint >> 10);
const trail = 0xDC00 + (codepoint & 0x3FF);

where:

the value are in hexadecimal
» is the the shift bit operator

From surrogate to character code

const SURROGATE_OFFSET = 0x10000 - (0xD800 << 10) - 0xDC00;

const codepoint = (lead << 10) + trail + SURROGATE_OFFSET;

where:

the value are in hexadecimal
» is the the shift bit operator

Usage

When used in a string <MATH> \text{Surrogate Pair} = \text{(High Surrogate)}\text{(Low Surrogate)} \\ S = HL </MATH>

Example in Javascript

console.log('\uD83D\uDC69');

Rendered by WebCode

Management

Concatenate

Some pairs can be concatenated to obtain another code point

console.log("Emoji concatenation");
console.log('\uD83D\uDC69 + \u200D\u2764\uFE0F\u200D = \uD83D\uDC69\u200D\u2764\uFE0F\u200D');
console.log("");
console.log("Does not work for all combinations");
console.log('\uD83D\uDE00 + \u200D\u2764\uFE0F\u200D = \uD83D\uDE00\u200D\u2764\uFE0F\u200D');
console.log("");
console.log("Works also with base character");
console.log('\u0065 + \u0301 = \u0065\u0301');

Rendered by WebCode

Documentation / Reference

Calculator Reference

¹⁾

What are surrogates

²⁾

What’s the algorithm to convert from UTF-16 to character codes?

³⁾

Isn’t there a simpler way to do this?