HTML - (Document) Parser

Parsing

HTML documents consist of a tree of elements and text. The specification defines a set of elements that can be used in HTML, along with rules.

If a document is transmitted with the text/html mime type, then it will be processed as an HTML document by Web browsers.

HTML user agents (e.g. Web browsers) parse the markup, turning it into a DOM (Document Object Model) tree.

HTML documents represent a media-independent description of interactive content. (screen, speech synthesizer, braille display). By using a language such as CSS, the authors can influence the rendering.

Parsing of HTML files happens asynchronously and incrementally, meaning that the parser can pause at any point to let scripts run.

Regexp

You can't parse [X]HTML with regex. Because HTML can't be parsed by regex

Library

Java: https://www.attoparser.org/ used by Thymleaf
Node parser: https://parse5.js.org/ used by Jsdom, rehype and many more

Documentation / Reference

HTML 5 - Parsing HTML documents, Parsing Model
Parse error
DOM Parsing and Serialization, T. Leithead. Work in Progress. W3C.
Rules for Parsing Number, float, Integers, Percentages and lengths, list, date, time, time-zone, duration, colours, token…