About
A surrogate pair is two 16-bit code units used in UTF-16 (16-bit - two-byte) that represents a character above the maximum value stored in 16bit. (ie 0xFFFF hexa or 65535 decimal)
Why ? Because the whole unicode set has way more character than 65535 (16bit), therefore to represent a code point (character) above 0xFFFF (such 0x10000 to 0x10FFFF, ), a pairs of code units known as surrogates is used.
And unfortunately, using UTF-16 as character set will require two code units to represent a single character above 0xFFFF.
Articles Related
Calculator
This calculator was implemented:
- with Javascript (Click the Try the code button to see the code)
- and the formula described below.
The default character for the calculator is the unicode character 0x1F600 - 😀
From Character Code to Surrogate Pair
From Surrogate Pair to Character Code
Syntax
A surrogate pair:
- is composed of a two pair of code point (H and L) 1) (with a value in a special range)
- represents a unicode code point known as S with a value above 0xFFFF
Symbol | Name | Hexa Range |
---|---|---|
H | high or leading | 0xD800–0xDBFF |
L | low or trailing | 0xDC00–0xDFFF |
S | surrogate character code | 0xFFFF– |
Formula
There is more than one formula 2). This formula is the simple one 3)
From character code to surrogate pair
// constant
const LEAD_OFFSET = 0xD800 - (0x10000 >> 10);
const SURROGATE_OFFSET = 0x10000 - (0xD800 << 10) - 0xDC00;
// computations
const lead = LEAD_OFFSET + (codepoint >> 10);
const trail = 0xDC00 + (codepoint & 0x3FF);
where:
- the value are in hexadecimal
- » is the the shift bit operator
From surrogate to character code
const SURROGATE_OFFSET = 0x10000 - (0xD800 << 10) - 0xDC00;
const codepoint = (lead << 10) + trail + SURROGATE_OFFSET;
where:
- the value are in hexadecimal
- » is the the shift bit operator
Usage
When used in a string <MATH> \text{Surrogate Pair} = \text{(High Surrogate)}\text{(Low Surrogate)} \\ S = HL </MATH>
Example in Javascript
console.log('\uD83D\uDC69');
Management
Concatenate
Some pairs can be concatenated to obtain another code point
console.log("Emoji concatenation");
console.log('\uD83D\uDC69 + \u200D\u2764\uFE0F\u200D = \uD83D\uDC69\u200D\u2764\uFE0F\u200D');
console.log("");
console.log("Does not work for all combinations");
console.log('\uD83D\uDE00 + \u200D\u2764\uFE0F\u200D = \uD83D\uDE00\u200D\u2764\uFE0F\u200D');
console.log("");
console.log("Works also with base character");
console.log('\u0065 + \u0301 = \u0065\u0301');