HTML - (Document) Parser
Table of Contents
Parsing
HTML documents consist of a tree of elements and text. The specification defines a set of elements that can be used in HTML, along with rules.
If a document is transmitted with the text/html mime type, then it will be processed as an HTML document by Web browsers.
HTML user agents (e.g. Web browsers) parse the markup, turning it into a DOM (Document Object Model) tree.
HTML documents represent a media-independent description of interactive content. (screen, speech synthesizer, braille display). By using a language such as CSS, the authors can influence the rendering.
Parsing of HTML files happens asynchronously and incrementally, meaning that the parser can pause at any point to let scripts run.
Articles Related
Regexp
You can't parse [X]HTML with regex. Because HTML can't be parsed by regex
Library
- Java: https://www.attoparser.org/ used by Thymleaf
Documentation / Reference
- DOM Parsing and Serialization, T. Leithead. Work in Progress. W3C.
- Rules for Parsing Number, float, Integers, Percentages and lengths, list, date, time, time-zone, duration, colours, token…