RelaxNG XML schema for MaCoCu monolingual corpora.
Root element of a MaCoCu corpus.
Obligatory ID of the corpus.
One web page.
e.g. macocu.mk.1
e.g. Агенција за разузнавање - АР
e.g. 2021-07-17
e.g. [('mk', 0.66), ('ru', 0.34)]
\[\('.{2,3}', \d\.\d+\)(, \('.{2,3}', \d\.\d+\))*\]
e.g. https://www.ia.mk
e.g. ia.mk
Fluency score, based on the language model, e.g. 0.947
One paragraph-like element.
e.g. macocu.mk.1
Heading
yes
The estimated language of the paragraph, e.g. mk. Note that
this attribute is missing for @quality="short", as the language here can't be
reliably estimated.
2
3
e.g. 0.947
Estimated quality of the paragraph based on a heuristic taking
into account multiple parameters (length, number of stopwords, link density etc.).
Paragraphs with "bad" quality are filtered out, ie. they do not appear in the
corpus.
short paragraph
short
nearly good paragraph, but without punctuation
neargood_wo_punct
nearly good paragraph
neargood
good paragraph, but without punctuation
good_wo_punct
good paragraph
good
If present, means that the paragraph contains sensitive
information, e.g. e-mails, phone numbers, IP addresses. This information
was identified via the Biroamer tool - https://github.com/bitextor/biroamer.
yes