Text - UTF-16 Character Set

Data System Architecture

About

UTF-16 is a variant of unicode. It's variable-length encoding: Each code point in a UTF-16 encoding may require either one or two 16-bit code units. The size in memory of a string of length n varies based on the particular code points in the string.

Finding the nth code point of a string is no longer a constant-time operation as in ucs-2: It generally requires searching from the beginning of the string.

Management

How to show character above 16 bit

Unicode can now show characters above 16 bit (to 32 bit), to show this additional characters, code point are concatenated in what's called Unicode - Surrogate pair (UTF-16).

More, see Unicode - Surrogate pair (UTF-16)

Example

Javascript

The charCodeAt() method returns the UTF-16 code unit (an integer between 0 and 65535) at the given index.

codePointDecimal='ø'.charCodeAt(0)
console.log(codePointDecimal);
codePointHexa=codePointDecimal.toString(16)
console.log(codePointHexa);

If you go to the character map of windows, you can search it (ie with leading zero 00F8 ) and validate that it's the good one.

Character Map 0248 00f8





Discover More
Javascript - Character

This article is the character representation and manipulation in Javascript (ie code point). They: are all unicode UTF-16 character are an element in a string starting at the index 0. may have...
Javascript - String

The in javascript. A string in JavaScript is encoded with the ucs-2 16-bit character set. An element of a JavaScript string is therefore a 16-bit code unit. code unitscode pointssurrogate pair Strings...
Character Map 0248 00f8
Text - Character

A character is: an atomic unit of text (10646ISO/IEC 10646:2000 Character specification] is categorized as a primitive data type A character is the smallest component of written language that has...
Character Set Code Pages
Text - Encoding (Character Set|charset|code page)

A character set is a repertoire of characters in which each character is (assigned|encoded) into a numeric code point. An character set (as an alphabet) is any finite set of symbols (characters). In...
Data System Architecture
Text - UCS2

Universal coded character set 2 (UCS2) (aka ucs 2) is an Unicode character set that is a subset of UTF-16 with the most common characters The designers of Unicode historically miscalculated the total...
Data System Architecture
Unicode - Surrogate pair (UTF-16)

A surrogate pair is two 16-bit code units used in UTF-16 (16-bit - two-byte) that represents a character above the maximum value stored in 16bit. (ie 0xFFFF hexa or 65535 decimal) Why ? Because the whole...
Data System Architecture
What is Unicode / Universal Coded Text Character Set (UCS)?

Unicode is a global character set that allows multilingual text to be displayed in a single application. Unicode is a acronym of Universal Coded Character Set Unicode enables the development of a single...



Share this page:
Follow us:
Task Runner