[NTG-context] ignore not closed tags in XML input

Pablo Rodriguez oinos at gmx.es
Tue May 17 18:36:32 CEST 2022


On 5/16/22 20:13, Taco Hoekwater via ntg-context wrote:
>> On 16 May 2022, at 18:50, Pablo Rodriguez via ntg-context <ntg-context at ntg.nl> wrote:
>> [...]
>> If I want to typeset the whole book
>> (https://seumasjeltzz.github.io/LinguaeGraecaePerSeIllustrata/), I will
>> have to download and sanitize over 20 HTML files.
>
> Which can be done with a couple of command lines. Xmllint usually does a good
> job of cleaning up dodgy html input:
>
>   xmllint --html --xmlout <crappy.html> > <nice.xml>

Many thanks for your reply, Taco.

Since I have to recursively download the site (with "wget -r"), I hope I
can find a way to pipe and get all in a single invocation.

>> It is really a pity that ConTeXt cannot totally ignore any given XML elements.
>
> This statement is a little unfair: the problem is exactly that your input is NOT proper XML.

My apologies. I really think ConTeXt rocks.

I wanted to write an introduction on how to typeset XML sources with
ConTeXt (at least, in Spanish).

One of the main issues I face is to find examples.

It seemed natural to me to use HTML edited texts. But it turned out,
it’s way trickier than I first thought.

HTML edited texts could be an eye-candy for some potential interested
people. But if one has to add web crawler plus XML sanitizer to the
dependencies, this makes it way harder (even for myself).

> If it was proper XML, ConTeXt would not have problems with it. ConTeXt explicitly has
> the capability to handle XML files, which your input simply is not. In fact, it is
> sloppy HTML-esque data that modern webbrowsers happen to be able to handle more or less
> correctly. It is not valid HTML either, because valid HTML has to be valid SGML, which your
> input clearly is not.

I agree my input isn’t proper XML, but it is valid SGML. One of the main
differences between both is that SGML allows unclosed tags.

This is why cases such as this one are corner-cases:
https://validator.w3.org/nu/?doc=https%3A%2F%2Fseumasjeltzz.github.io%2FLinguaeGraecaePerSeIllustrata%2F.

Since I considered this a corner-case, I thought that a command such as
\xmlignore{#1}{head/(meta|link)} would make sense.

> That said, Tools like xmllint exist for this stuff. Just write a small batch driver file in
> some scripting language ((power)shell, lua, python, perl, etc.) to preprocess the HTML
> stuff into clean XML, and you should be fine.

Many thanks for your for your reply again.

Maybe all XML handling is way more complex than I originally thought.

Many thanks for your help,

Pablo


More information about the ntg-context mailing list