On 5/16/22 20:13, Taco Hoekwater via ntg-context wrote:
On 16 May 2022, at 18:50, Pablo Rodriguez via ntg-context
wrote: [...] If I want to typeset the whole book (https://seumasjeltzz.github.io/LinguaeGraecaePerSeIllustrata/), I will have to download and sanitize over 20 HTML files. Which can be done with a couple of command lines. Xmllint usually does a good job of cleaning up dodgy html input:
xmllint --html --xmlout
>
Many thanks for your reply, Taco. Since I have to recursively download the site (with "wget -r"), I hope I can find a way to pipe and get all in a single invocation.
It is really a pity that ConTeXt cannot totally ignore any given XML elements.
This statement is a little unfair: the problem is exactly that your input is NOT proper XML.
My apologies. I really think ConTeXt rocks. I wanted to write an introduction on how to typeset XML sources with ConTeXt (at least, in Spanish). One of the main issues I face is to find examples. It seemed natural to me to use HTML edited texts. But it turned out, it’s way trickier than I first thought. HTML edited texts could be an eye-candy for some potential interested people. But if one has to add web crawler plus XML sanitizer to the dependencies, this makes it way harder (even for myself).
If it was proper XML, ConTeXt would not have problems with it. ConTeXt explicitly has the capability to handle XML files, which your input simply is not. In fact, it is sloppy HTML-esque data that modern webbrowsers happen to be able to handle more or less correctly. It is not valid HTML either, because valid HTML has to be valid SGML, which your input clearly is not.
I agree my input isn’t proper XML, but it is valid SGML. One of the main differences between both is that SGML allows unclosed tags. This is why cases such as this one are corner-cases: https://validator.w3.org/nu/?doc=https%3A%2F%2Fseumasjeltzz.github.io%2FLing.... Since I considered this a corner-case, I thought that a command such as \xmlignore{#1}{head/(meta|link)} would make sense.
That said, Tools like xmllint exist for this stuff. Just write a small batch driver file in some scripting language ((power)shell, lua, python, perl, etc.) to preprocess the HTML stuff into clean XML, and you should be fine.
Many thanks for your for your reply again. Maybe all XML handling is way more complex than I originally thought. Many thanks for your help, Pablo