Unicode - Surrogate pair (UTF-16)

A surrogate pair is two 16-bit code units used in UTF-16 (16-bit - two-byte) that represents a character above the maximum value stored in 16bit. (ie 0xFFFF hexa or 65535 decimal)

Why ? Because the whole unicode set has way more character than 65535 (16bit), therefore to represent a code point (character) above 0xFFFF (such 0x10000 to 0x10FFFF, ), a pairs of code units known as surrogates is used.

And unfortunately, using UTF-16 as character set will require two code units to represent a single character above 0xFFFF.

Calculator

This calculator was implemented:

• with Javascript (Click the Try the code button to see the code)
• and the formula described below.

The default character for the calculator is the unicode character 0x1F600 - 😀

Syntax

A surrogate pair:

• is composed of a two pair of code point (H and L) 1) (with a value in a special range)
• represents a unicode code point known as S with a value above 0xFFFF
Symbol Name Hexa Range
L low or trailing 0xDC00–0xDFFF
S surrogate character code 0xFFFF–

Formula

There is more than one formula 2). This formula is the simple one 3)

From character code to surrogate pair

// constant
const LEAD_OFFSET = 0xD800 - (0x10000 >> 10);
const SURROGATE_OFFSET = 0x10000 - (0xD800 << 10) - 0xDC00;

// computations
const trail = 0xDC00 + (codepoint & 0x3FF);


where:

From surrogate to character code

const SURROGATE_OFFSET = 0x10000 - (0xD800 << 10) - 0xDC00;

const codepoint = (lead << 10) + trail + SURROGATE_OFFSET;


where:

Usage

When used in a string $$\text{Surrogate Pair} = \text{(High Surrogate)}\text{(Low Surrogate)} \\ S = HL$$

Example in Javascript

console.log('\uD83D\uDC69');


Management

Concatenate

Some pairs can be concatenated to obtain another code point

console.log("Emoji concatenation");
console.log('\uD83D\uDC69 + \u200D\u2764\uFE0F\u200D = \uD83D\uDC69\u200D\u2764\uFE0F\u200D');
console.log("");
console.log("Does not work for all combinations");
console.log('\uD83D\uDE00 + \u200D\u2764\uFE0F\u200D = \uD83D\uDE00\u200D\u2764\uFE0F\u200D');
console.log("");
console.log("Works also with base character");
console.log('\u0065 + \u0301 = \u0065\u0301');