About
A character is an atomic unit of text as specified by ISO/IEC 10646:2000 [ISO/IEC 10646]
Every unit of text (character) is assigned a unique integer known as a code point.
All the characters within a string have a common coding representation (ie character set) that translates a code point to a glyph (visual character representation).
The Text representation unit in computer language is a character or a String.
Without an associated data schema (such as Java script, XML, …), a text is primarily said to be unstructured.
Text is the basis of any language:
- of natural
Text Editors use also often a text tree (wiki/Rope_(data_structure)) to speed up text transformation.
Structure
Regular Expressions defined the structure of text.
Attack
Many different characters look alike and they may be the cause of attack. See Characters - Homograph
Operation
Text seems at first hand easy but it's not.
Below you can find a couple of text operations:
- Code Page/Character set Conversion: Convert text data to or from a code page
- Collation: Compare strings according to the conventions and standards of a particular language, region, or country.
- Formatting: Format numbers, dates, times, and currency amounts according to the conventions of a chosen locale. This includes translating month and day names into the selected language, choosing appropriate abbreviations, ordering fields correctly, etc.
- Bidi (Bidirectionality): support for handling text containing a mixture of left-to-right (English) and right-to-left (Arabic or Hebrew) data.
- Text Boundaries: Locate the positions of words, sentences, and paragraphs within a range of text, or identify locations that would be suitable for line wrapping when displaying the text.