What is XML / HTML Character Entity encoding ?

About

An Entity in html is a string that represents a unicode character.

In other words, an entity is a fully qualified notation that represents any unicode character.

Encoding text in HTML means to transform:

Example

Complex Character: Phone

Example with the phone. This character has the unicode value:

Example:

  • the following HTML
To show a phone in a HTML document, you can write the following entities notation:
<ul>
  <li>&#x0260E; (hexadecimal)</li>
  <li>&#9742; (decimal)</li>
  <li>&phone; (name)</li>
</ul>  
  • will output:

Simple Character: letter A

This example shows you that you can also write any simple character (ie from the alphabet) also in entity.

Example with the letter A. This character has the unicode value:

Example:

  • the following HTML
Therefore, to show the letter A in a HTML document, you can write the following entities notation:
<ul>
  <li>&#x41; (hexadecimal)</li>
  <li>&#65; (decimal)</li>
  <li>A (the letter A)</li>
</ul>  
  • will output:

Usage

Reserved Word Encoding

They are used to encode reserved XML/HTML character that are used in the value of an attribute.

For instance, the start < and end character > of an element tag cannot be used directly. They need to be replaced (ie encoded) in entity notation.

For instance, the character > would be replaced by the following entity &gt;

Complex Characters

They are also used to show complex / special characters that are not easily accessible from the keyboard.

Format

The entity notation supports three definitions for a character:

&name; <!-- name notation -->
&#dddd;  <!-- decimal notation -->
&#xhhhh; <!-- hexadecimal notation -->

where:

All entities may not be supported by old browsers but support in recent browsers is good.

List

This list is non-exhaustive, see the named character reference for all name

Character Description Entity Name Decimal Hex Rendering in Your Browser
Entity (Name) Unicode Decimal Unicode Hex
quotation mark = APL quote &quot; &#34; &#x22;
ampersand &amp; &#38; &#x26; & & &
less-than sign &lt; &#60; &#x3C; < < <
greater-than sign &gt; &#62; &#x3E; > > >
Latin capital ligature OE &OElig; &#338; &#x152; Œ Œ Œ
Latin small ligature oe &oelig; &#339; &#x153; œ œ œ
Latin capital letter S with caron &Scaron; &#352; &#x160; Š Š Š
Latin small letter s with caron &scaron; &#353; &#x161; š š š
Latin capital letter Y with diaeresis &Yuml; &#376; &#x178; Ÿ Ÿ Ÿ
modifier letter circumflex accent &circ; &#710; &#x2C6; ˆ ˆ ˆ
small tilde &tilde; &#732; &#x2DC; ˜ ˜ ˜
en space &ensp; &#8194; &#x2002;
em space &emsp; &#8195; &#x2003;
thin space &thinsp; &#8201; &#x2009;
zero width non-joiner &zwnj; &#8204; &#x200C;
zero width joiner &zwj; &#8205; &#x200D;
left-to-right mark &lrm; &#8206; &#x200E;
right-to-left mark &rlm; &#8207; &#x200F;
en dash &ndash; &#8211; &#x2013;
em dash &mdash; &#8212; &#x2014;
left single quotation mark &lsquo; &#8216; &#x2018;
right single quotation mark &rsquo; &#8217; &#x2019;
single low-9 quotation mark &sbquo; &#8218; &#x201A;
left double quotation mark &ldquo; &#8220; &#x201C;
right double quotation mark &rdquo; &#8221; &#x201D;
double low-9 quotation mark &bdquo; &#8222; &#x201E;
dagger &dagger; &#8224; &#x2020;
double dagger &Dagger; &#8225; &#x2021;
per mille sign &permil; &#8240; &#x2030;
single left-pointing angle quotation mark &lsaquo; &#8249; &#x2039;
single right-pointing angle quotation mark &rsaquo; &#8250; &#x203A;
euro sign &euro; &#8364; &#x20AC;

Glyphs of the characters are available at the Unicode Consortium and should be already available in every browser.

Function / Library

Function

Encode

  • A function in javascript to encode from text to entities
function toEntities(text) {
    let entities = [];
	for (let i=0;i<text.length;i++) {
	    let entity = `&#${text[i].charCodeAt()};`
	    entities.push(entity);
	}
    return entities.join('');
}
  • Function Example
let reservedCharacters= `"><`;
let entities = toEntities(reservedCharacters);
console.log(`The reserved characters (${reservedCharacters}) in entities format is (${entities})`);
  • You can then use them also in a HTML string attribute value. For instance in a title of an anchor
let anchorHTML = `<a href="#" title="${entities}">Anchor with entities</a> Keep your mouse on the link to see the title tooltip.`;
document.body.insertAdjacentHTML('afterbegin', anchorHTML);
  • Output: See the entities and see the reserved characters in the title attribute of the anchor

Decode

When decoding your function should take into account the three format (name, decimal and hexadecimal)

The below javascript function shows an example for the decimal form that just uses a basic regular expression replace function

function decodeDecimalEntity(text) {
  return text.replace(/&#(\d+);/g, function(match, dec) {
     return String.fromCharCode(dec);
  });
}
console.log(decodeDecimalEntity('&#62;'));

Pure Library (Encode/Decode)

Library have already the encode/decode functions and may add extra functionalities

Library from Ascii to Entities

Library may also implement a mapping between a ascii sequence of characters to an entity.

This mapping in a font is called a ligature.

For instance:

  • -- into en-dash entity
  • --- into em-dash entity

List:





Discover More
Card Puncher Data Processing
Datacadamia - Data all the things

Computer science from a data perspective
Content Venn
HTML - (Content of an element|Content Model)

The contents of an element are its children in the DOM tree. Each element has a content model: a description of the element's expected contents. HTML Authors must not use HTML elements anywhere except...
HTML - Character

This page is character in HTML If you want to show a character that is: not accessible via your keyboard or that is a reserved characters, not part of the defined character set you can use the...
HTML - Character Set - Character Encoding (charset)

character sets (ie ) configuration in html with the meta tag to set the content-type HTTP header meta.charsetMeta Charset encoding TR/html5/infrastructure.htmlHTML5 - extractingcharacter...
HTML - Escape / Sanitizer

HTML A sanitizer is a program that will: not accept all HTML elements and or transform them as text (escape) This is to avoid script injection and should be used on the server side (ie not client)...
HTML - Whitespace

Whitespace character in HTML See
HyperText markup Language ( HTML )

What is HTML ? the HyperText markup Language
Data System Architecture
Text - Check Mark Characters

The check mark characters in the unicode character set HexaDecimal Decimal HTML 27132713 271310003 ✓ 27142714 271410004 ✔ 27152715 271510005 ✕ 27162716 271610006 ✖...
HTML /XML - Reserved Character/Word

This page is the reserved word/characters of XML based language. If a reserved character is in : the value of an attribute, it should be first encoded in a entity the node value: in Xml: you...



Share this page:
Follow us:
Task Runner