Text - Double Byte Character Set

Data System Architecture

About

A Double Byte Character Set is a character set where:

DBCS meant that you need to write code that would treat these pair of code points as one.

The DBCS supports national languages that contain a large number of unique characters or symbols (the maximum number of characters that can be represented with 1 byte is 256 characters).

For programming awareness, a set of points are set aside to represent the first byte of the set and are not valued unless they are immediately followed by a defined second byte.

Example

Examples of such languages include :

  • Japanese,
  • Korean,
  • and Chinese.

Each Asian character is represented by a pair of code points (thus double-byte). Programs written for single-byte code pages won't work for Asian languages. A set of code points used for Japanese is called a double-byte code page; and a Japanese font character set is called a double-byte character set (DBCS).

History

Windows codepage 1253 provides character codes required in the Greek writing system and codepage 1250 provides the characters for Latin writing systems including English, German and French.

It is the upper 128 code points that contain either:

  • the accent characters
  • or the Greek characters.

Thus you cannot store Greek and German in the same code stream unless you put some type of identifier to indicate what codepage you are referencing.

Asian languages far exceed the 256-character limit imposed by a single byte. Japanese, for example, uses about 2000 kanji for everyday purposes, more kanji for special vocabularies, two phonetic syllabaries, Latin alphabetic characters, Arabic numerals, and both Japanese and Western punctuation marks.

A different scheme needed to be developed but it had to be based on the concept of 256 character codepages. Thus DBCS (Double Byte Character Sets) were born.





Discover More
Card Puncher Data Processing
PHP - String

The string in PHP is implemented as an array of bytes with the ascii character set and an integer indicating the length of the buffer. It has no information how those bytes translate to characters, leaving...
Pydev Default Encoding
Python - Encoding and Unicode

In PyDev, you can change it in the Run Configuration: and you get: stdout: (multi-byte character set ?) Character \ufeff is a Byte_order_markBOM The UnicodeEncodeError happens when...
Character Set Code Pages
Text - Encoding (Character Set|charset|code page)

A character set is a repertoire of characters in which each character is (assigned|encoded) into a numeric code point. An character set (as an alphabet) is any finite set of symbols (characters). In...



Share this page:
Follow us:
Task Runner