Bytes 0xC0 to (currently) 0xF4 or 0xFD are header bytes in a multibyte sequence. Such a byte MUST be the first byte of its sequence and the number of “one” bits above the topmost “zero” bit indicates the number of bytes (including this one) in the whole sequence.
Bytes 0x80 to 0xBF are trailer bytes in a multi-byte sequence. They can be any byte in the sequence except the first.
In the bytes of a multi-byte sequence, all bits after the topmost “zero” bit in each byte constitute the payload: they are data bits, and in UTF-8 the most significant bits always come first.
Bytes 0xFE and 0xFF are always invalid anywhere in UTF-8 text.
The Unicode code space had originally been foreseen as ranging from U+0000 to U+7FFFFFFF but the current standards say that no codepoints above U+10FFFD will ever be valid; also, codepoints whose hex representation is xxFFFE or xxFFFF (where xx is anything) have been expressly designated as invalid, never to be used.
Note that some hanzi are above U+20000; the UTF-8 code for them consists of four bytes, not three: e.g. 𠄣 = U+20123 = UTF-8 0xF0 0xA0 0x84 0xA3 = %F0%A0%84%A3 in “percent-escaped” HTTP coding
Problem: translation of the 3 Byte sequences e9 a6 ac into unicode.
- 0xE9 = 1110.1001 binary is a header byte
- payload: 1001
- 0xA6 = 1010.0110 binary is a trailer byte
- payload: 10.0110
- 0xAC = 1010.1100 binary is a trailer byte
- payload: 10.1100
Result: Concatenated payload bits 1001.1001.1010.1100 binary, or U+99AC
Problem: translation of the 3 Byte sequences e2 80 93 into unicode.
- 0xe2 = 1110.0010 binary is a header byte
- payload: 0010
- 0x80 = 1000.0000 binary is a trailer byte
- payload: 00.0000
- 0x93 = 1001.0011 binary is a trailer byte
- payload: 01.0011
Result: Concatenated payload bits 0010.0000.0001.0011 binary, or U+2013