In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts.
See What is a bag of words model? known also as a bag of tokens in NLP
The XML schema for each dump is defined at the top of the file. And also described in the MediaWiki export help page.
https://en.wikipedia.org/w/api.php?action=query
&titles=SQL # the title of the page that are in the URL separated by |
&format=xml # The exported format
&prop=description|categories # The properties exported