HTML documents consist of a tree of elements and text. The specification defines a set of elements that can be used in HTML, along with rules.
If a document is transmitted with the text/html mime type, then it will be processed as an HTML document by Web browsers.
HTML user agents (e.g. Web browsers) parse the markup, turning it into a DOM (Document Object Model) tree.
HTML documents represent a media-independent description of interactive content. (screen, speech synthesizer, braille display). By using a language such as CSS, the authors can influence the rendering.
Parsing of HTML files happens asynchronously and incrementally, meaning that the parser can pause at any point to let scripts run.
You can't parse [X]HTML with regex. Because HTML can't be parsed by regex
- Java: https://www.attoparser.org/ used by Thymleaf
- Node parser: https://parse5.js.org/ used by Jsdom, rehype and many more
Documentation / Reference
- DOM Parsing and Serialization, T. Leithead. Work in Progress. W3C.
- Rules for Parsing Number, float, Integers, Percentages and lengths, list, date, time, time-zone, duration, colours, token…