Text - Character
Table of Contents
1 - About
A character is an atomic unit of text as specified by ISO/IEC 10646:2000 [ISO/IEC 10646] and is categorized as a primitive data type
A character is the smallest component of written language that has semantic value; refers to the abstract meaning and/or shape …
Characters will not appear on your screen as intended unless you have the appropriate font (that contains the appropriate glyph)
Character are the basic unit of organization of encoded text.
A character is usually represented as an Unicode code point where an int value from 0 to 65535 represents all Unicode code points, including supplementary code points.
2 - Articles Related
3 - Example
A Character can also be simply a set of characters:
- letters,
- numbers,
- symbols (mathematical),
- ideograms,
- logograms (from non-phonetic writing systems such as kanji),
- etc…
For example, the following character set appears in several code pages:
- 26 non-accented letters A through Z ( A,B,C….X,Y,Z)
- 26 non-accented letters a through z ( a,b,c,…x,y,z)
- digits 0 through 9
- special characters: . , : ; ? ( ) ' “ / - _ & + % * = < >
4 - Type/Category
- Text - Non-printing Character (Tabulation, New Line, ...)
- Text - Control Characters
- …
5 - Management
5.1 - Typing
You can type character directly from a keyboard where each key represents a character according to the keyboard layout
5.2 - Encoding, File Storage
5.3 - Show
5.3.1 - Bash
Problem: Which character is –
Steps:
- The Character Set is UTF8. We got then hexadecimal in UTF8.
echo $LANG
The Hexadecimal in UTF8 of this character is e2 80 93. It corresponds to the unicode character 2013 - EN DASH. See Translation of a UTF-8 Multibyte sequence to Unicode - Example 2. 0a is the end of file.
echo – | hexdump -C
00000000 e2 80 93 0a |....|
00000004
5.3.2 - Javascript
The charCodeAt() method returns the UTF-16 code unit (an integer between 0 and 65535) at the given index.
console.log('ø'.charCodeAt(0));
5.3.3 - Windows
6 - Java
Character.toChars(int)[0]
For example, Character.isLetter(0x2F81A) returns true because the code point value represents a letter (a CJK ideograph).
6.1 - Diff
Diff between Characters with an hex tool such as `hexdump` on Unix that output hexadecimal digits
Problem:
- Are this two characters the same ?
–
-
Steps:
- The Character Set is UTF8. We got then hexadecimal in UTF8.
echo $LANG
en_US.UTF-8
- The Hexadecimal in UTF8 of the first character is e2 80 93. It corresponds to the unicode character 2013 - EN DASH. See Translation of a UTF-8 Multibyte sequence to Unicode - Example 2. 0a is the end of file.
echo – | hexdump -C
00000000 e2 80 93 0a |....|
00000004
- The Hexadecimal in UTF8 of the first character is 2d. This is the unicode character 2d - Hyphen Minus
echo - | hexdump -C
00000000 2d 0a |-.|
00000002
6.2 - Storage
Each character requires:
- ASCII: one byte of memory
- Unicode - UTF8: 8 bytes