[NTG-context] ignore not closed tags in XML input

Taco Hoekwater taco at bittext.nl
Mon May 16 20:13:34 CEST 2022



> On 16 May 2022, at 18:50, Pablo Rodriguez via ntg-context <ntg-context at ntg.nl> wrote:
> 
> On 5/16/22 17:30, Hans van der Meer via ntg-context wrote:
>> Can't you use an editor with grep, searching for something like the
>> pattern <meta.*^/>?
> 
> Many thanks for your reply, dr. van der Meer.
> 
> If I want to typeset the whole book
> (https://seumasjeltzz.github.io/LinguaeGraecaePerSeIllustrata/), I will
> have to download and sanitize over 20 HTML files.

Which can be done with a couple of command lines. Xmllint usually does a good
job of cleaning up dodgy html input:

  xmllint --html --xmlout <crappy.html> > <nice.xml>

(As good as can be expected from a program, anyway).

> It is really a pity that ConTeXt cannot totally ignore any given XML elements.

This statement is a little unfair: the problem is exactly that your input is NOT proper XML.
 
If it was proper XML, ConTeXt would not have problems with it. ConTeXt explicitly has
the capability to handle XML files, which your input simply is not. In fact, it is
sloppy HTML-esque data that modern webbrowsers happen to be able to handle more or less
correctly. It is not valid HTML either, because valid HTML has to be valid SGML, which your
input clearly is not.

That said, Tools like xmllint exist for this stuff. Just write a small batch driver file in 
some scripting language ((power)shell, lua, python, perl, etc.) to preprocess the HTML 
stuff into clean XML, and you should be fine.

Taco

— 
Taco Hoekwater              E: taco at bittext.nl
genderfluid (all pronouns)





More information about the ntg-context mailing list