ignore not closed tags in XML input

older
MetaFun manual's "texvar" is not...

Pablo Rodriguez

16 May 2022 16 May '22

5:08 p.m.

Dear list, I would like to feed https://seumasjeltzz.github.io/LinguaeGraecaePerSeIllustrata/001.html as XML input for ConTeXt. The problem is that (as many other XML files that I haven’t generated myself) some <meta> and <link> tags aren’t closed, such as in: <meta charset="utf-8"> <link href="https://fonts/css?greek" rel="stylesheet"> <link href="style.css" rel="stylesheet"> So, all that I get is the following message: invalid xml file - parsed text Unsuccessfully I have tried the following: \xmlsetsetup{#1}{html/head/(meta|link)}{-} Is there no way to make ConTeXt more tolerant, so that it is able to ignore those tags? Many thanks for your help, Pablo

Show replies by date

mf

16 May 16 May

5:22 p.m.

See HTML-tidy, https://www.html-tidy.org/ it could help you pre-processing your HTML files. Massi Il 16/05/22 17:08, Pablo Rodriguez via ntg-context ha scritto:

...

Dear list,

I would like to feed https://seumasjeltzz.github.io/LinguaeGraecaePerSeIllustrata/001.html as XML input for ConTeXt.

The problem is that (as many other XML files that I haven’t generated myself) some <meta> and <link> tags aren’t closed, such as in:

<meta charset="utf-8"> <link href="https://fonts/css?greek" rel="stylesheet"> <link href="style.css" rel="stylesheet">

So, all that I get is the following message:

invalid xml file - parsed text

Unsuccessfully I have tried the following:

\xmlsetsetup{#1}{html/head/(meta|link)}{-}

Is there no way to make ConTeXt more tolerant, so that it is able to ignore those tags?

Many thanks for your help,

Pablo

Pablo Rodriguez

6:37 p.m.

On 5/16/22 17:22, mf via ntg-context wrote:

...

See HTML-tidy,

https://www.html-tidy.org/

it could help you pre-processing your HTML files.

Hi Massi, the problem is that they aren’t my HTML files and that this is a very common error. I’m afraid that pre-processing could work for a few files, but this solution wouldn’t work if I would like to use it with any HTML file that I could need. Many thanks for your help, Pablo

Hans van der Meer

5:30 p.m.

Can't you use an editor with grep, searching for something like the pattern (with appropriate escapes of course). dr. Hans van der Meer

...

On 16 May 2022, at 17:08, Pablo Rodriguez via ntg-context wrote:

Dear list,

I would like to feed https://seumasjeltzz.github.io/LinguaeGraecaePerSeIllustrata/001.html as XML input for ConTeXt.

The problem is that (as many other XML files that I haven’t generated myself) some <meta> and <link> tags aren’t closed, such as in:

<meta charset="utf-8"> <link href="https://fonts/css?greek" rel="stylesheet"> <link href="style.css" rel="stylesheet">

So, all that I get is the following message:

invalid xml file - parsed text

Unsuccessfully I have tried the following:

\xmlsetsetup{#1}{html/head/(meta|link)}{-}

Is there no way to make ConTeXt more tolerant, so that it is able to ignore those tags?

Many thanks for your help,

Pablo ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://context.aanhet.net archive : https://bitbucket.org/phg/context-mirror/commits/ wiki : http://contextgarden.net ___________________________________________________________________________________

Pablo Rodriguez

6:50 p.m.

On 5/16/22 17:30, Hans van der Meer via ntg-context wrote:

...

Can't you use an editor with grep, searching for something like the pattern ?

Many thanks for your reply, dr. van der Meer. If I want to typeset the whole book (https://seumasjeltzz.github.io/LinguaeGraecaePerSeIllustrata/), I will have to download and sanitize over 20 HTML files. And I’m afraid this is only for a single PDF output. It is really a pity that ConTeXt cannot totally ignore any given XML elements. Many thanks for your help, Pablo

Taco Hoekwater

8:13 p.m.

...

On 16 May 2022, at 18:50, Pablo Rodriguez via ntg-context wrote:

On 5/16/22 17:30, Hans van der Meer via ntg-context wrote:

...
Can't you use an editor with grep, searching for something like the pattern ?

Many thanks for your reply, dr. van der Meer.

If I want to typeset the whole book (https://seumasjeltzz.github.io/LinguaeGraecaePerSeIllustrata/), I will have to download and sanitize over 20 HTML files.

Which can be done with a couple of command lines. Xmllint usually does a good job of cleaning up dodgy html input: xmllint --html --xmlout > (As good as can be expected from a program, anyway).

...

It is really a pity that ConTeXt cannot totally ignore any given XML elements.

This statement is a little unfair: the problem is exactly that your input is NOT proper XML. If it was proper XML, ConTeXt would not have problems with it. ConTeXt explicitly has the capability to handle XML files, which your input simply is not. In fact, it is sloppy HTML-esque data that modern webbrowsers happen to be able to handle more or less correctly. It is not valid HTML either, because valid HTML has to be valid SGML, which your input clearly is not. That said, Tools like xmllint exist for this stuff. Just write a small batch driver file in some scripting language ((power)shell, lua, python, perl, etc.) to preprocess the HTML stuff into clean XML, and you should be fine. Taco — Taco Hoekwater E: taco@bittext.nl genderfluid (all pronouns)

Pablo Rodriguez

17 May 17 May

6:36 p.m.

On 5/16/22 20:13, Taco Hoekwater via ntg-context wrote:

...

...
On 16 May 2022, at 18:50, Pablo Rodriguez via ntg-context wrote: [...] If I want to typeset the whole book (https://seumasjeltzz.github.io/LinguaeGraecaePerSeIllustrata/), I will have to download and sanitize over 20 HTML files.

Which can be done with a couple of command lines. Xmllint usually does a good job of cleaning up dodgy html input:

xmllint --html --xmlout >

Many thanks for your reply, Taco. Since I have to recursively download the site (with "wget -r"), I hope I can find a way to pipe and get all in a single invocation.

...

...
It is really a pity that ConTeXt cannot totally ignore any given XML elements.

This statement is a little unfair: the problem is exactly that your input is NOT proper XML.

My apologies. I really think ConTeXt rocks. I wanted to write an introduction on how to typeset XML sources with ConTeXt (at least, in Spanish). One of the main issues I face is to find examples. It seemed natural to me to use HTML edited texts. But it turned out, it’s way trickier than I first thought. HTML edited texts could be an eye-candy for some potential interested people. But if one has to add web crawler plus XML sanitizer to the dependencies, this makes it way harder (even for myself).

...

If it was proper XML, ConTeXt would not have problems with it. ConTeXt explicitly has the capability to handle XML files, which your input simply is not. In fact, it is sloppy HTML-esque data that modern webbrowsers happen to be able to handle more or less correctly. It is not valid HTML either, because valid HTML has to be valid SGML, which your input clearly is not.

I agree my input isn’t proper XML, but it is valid SGML. One of the main differences between both is that SGML allows unclosed tags. This is why cases such as this one are corner-cases: https://validator.w3.org/nu/?doc=https%3A%2F%2Fseumasjeltzz.github.io%2FLing.... Since I considered this a corner-case, I thought that a command such as \xmlignore{#1}{head/(meta|link)} would make sense.

...

That said, Tools like xmllint exist for this stuff. Just write a small batch driver file in some scripting language ((power)shell, lua, python, perl, etc.) to preprocess the HTML stuff into clean XML, and you should be fine.

Many thanks for your for your reply again. Maybe all XML handling is way more complex than I originally thought. Many thanks for your help, Pablo

Thangalin

18 May 18 May

3:23 a.m.

...

I wanted to write an introduction on how to typeset XML sources with ConTeXt (at least, in Spanish).

See: https://dave.autonoma.ca/blog/2020/04/11/project-gutenberg-projects/ It's English, but describes a fair amount of what you're probably looking to accomplish, and there are all sorts of free translation services now.

...

One of the main issues I face is to find examples.

See: https://wiki.contextgarden.net/XML https://wiki.contextgarden.net/Getting_Started_with_XML_and_ConTeXt_using_TE... And themes for my text editor, KeenWrite, in particular: https://github.com/DaveJarvis/keenwrite-themes/tree/main/xhtml https://github.com/DaveJarvis/keenwrite-themes/tree/main/tarmes https://github.com/DaveJarvis/keenwrite-themes/tree/main/boschet

...

Maybe all XML handling is way more complex than I originally thought.

It takes some elbow grease. Conceptually, it's essentially mapping XML elements to xmlsetups, which are used to apply typesetting instructions. Cheers!

Pablo Rodriguez

6 p.m.

On 5/18/22 03:23, Thangalin via ntg-context wrote:

...

[…] I wanted to write an introduction on how to typeset XML sources with ConTeXt (at least, in Spanish).

See: https://dave.autonoma.ca/blog/2020/04/11/project-gutenberg-projects/

It's English, but describes a fair amount of what you're probably looking to accomplish, and there are all sorts of free translation services now.

Hi Dave, many thanks for your reply. Your introduction clearly states (https://dave.autonoma.ca/blog/2020/04/11/project-gutenberg-projects/#xhtml-t...): Even though ConTeXt can typeset XML documents, we’ll use XSLT—the verbose language only gurus grok without gripes—to convert XHTML into a Markdown document that pandoc can read to produce a native ConTeXt file. I’m afraid I’m interested in typesetting XML documents with ConTeXt. Actually, I have been typesetting XHML documents (generated by pandoc from Markdown sources) for years now. Sorry for having explained myself like crap. I wanted to write an introduction on how to typeset XML sources in ConTeXt. I cannot see how free translation services may be of help here.

...

One of the main issues I face is to find examples.

See:

https://wiki.contextgarden.net/XML https://wiki.contextgarden.net/Getting_Started_with_XML_and_ConTeXt_using_TE...

And themes for my text editor, KeenWrite, in particular:

https://github.com/DaveJarvis/keenwrite-themes/tree/main/xhtml https://github.com/DaveJarvis/keenwrite-themes/tree/main/tarmes https://github.com/DaveJarvis/keenwrite-themes/tree/main/boschet

Sorry for explaining myself so poorly. One of the not irrelevant tasks for me is finding examples of XML code.

...

Maybe all XML handling is way more complex than I originally thought.

It takes some elbow grease. Conceptually, it's essentially mapping XML elements to xmlsetups, which are used to apply typesetting instructions.

I agree, this is basically the idea. But my worries came from having to sanitize HTML sources (which aren’t strict XML-compliant). Many thanks for your help, Pablo

Thangalin

7:14 p.m.

Hey Pablo,

...

One of the not irrelevant tasks for me is finding examples of XML code.

To clarify, XHTML documents *are* XML documents. XHTML happens to use a standardized set of XML element and attribute names. All XHTML examples are also XML examples.

...

But my worries came from having to sanitize HTML sources (which aren’t

That was discussed in the blog post: finding a source of well-formed XHTML documents. There are a number of tools to sanitize HTML, as mentioned in the thread. KeenWrite uses the Java-based JSoup library https://jsoup.org/ to sanitize HTML and then create an XHTML version. All the best!

Pablo Rodriguez

21 May 21 May

7:01 p.m.

On 5/18/22 19:14, Thangalin via ntg-context wrote:

...

Hey Pablo,

...
One of the not irrelevant tasks for me is finding examples of XML code.

To clarify, XHTML documents /are/ XML documents. XHTML happens to use a standardized set of XML element and attribute names. All XHTML examples are also XML examples.

Hi Dave, many thanks for the explanation.

...

...
But my worries came from having to sanitize HTML sources (which aren’t

That was discussed in the blog post: finding a source of well-formed XHTML documents. There are a number of tools to sanitize HTML, as mentioned in the thread. KeenWrite uses the Java-based JSoup library https://jsoup.org/ https://jsoup.org/ to sanitize HTML and then create an XHTML version.

After dealing with other (X)HTML sources, I have experienced that not few of them contain sloppy encoded data (as Taco pointed out). There are even some mismatches that xmllint doesn’t solve automatically (as Taco already mentioned too). Now I understand that I will have also to curate tidy XML sources to typeset them with ConTeXt. Many thanks for your help again, Pablo

Bruce Horrocks

19 May 19 May

12:09 a.m.

...

On 18 May 2022, at 17:00, Pablo Rodriguez via ntg-context wrote:

Sorry for explaining myself so poorly.

One of the not irrelevant tasks for me is finding examples of XML code.

Perhaps you could start by typesetting a technical source rather than prose? I suggest trying to typeset the UK Meteorological Office's Shipping Forecast :-) - web page version https://www.metoffice.gov.uk/weather/specialist-forecasts/coast-and-sea/ship... - XML source data https://www.metoffice.gov.uk/public/data/CoreProductCache/ShippingForecast/L... - as broadcast on the Radio https://www.radio-uk.co.uk/podcasts/random-shipping-forecast It's a good (in my opinion) source because it is amenable to being printed in several different ways: one might be to simply copy the webpage's layout, while another could be to use columns to fit more onto a single page of text. Alternatively, a much more demanding exercise would be to typeset the user manual for the XML editing software "Oxygen". https://www.oxygenxml.com The XML source for the manual is here: https://github.com/oxygenxml/userguide/blob/master/DITA/UserManual.ditamap — Bruce Horrocks Hampshire, UK

Pablo Rodriguez

21 May 21 May

7:28 p.m.

On 5/19/22 00:09, Bruce Horrocks via ntg-context wrote:

...

...
On 18 May 2022, at 17:00, Pablo Rodriguez via ntg-context wrote: Sorry for explaining myself so poorly.

One of the not irrelevant tasks for me is finding examples of XML code.

Perhaps you could start by typesetting a technical source rather than prose?

...
I suggest trying to typeset the UK Meteorological Office's Shipping Forecast :-)

[...]> It's a good (in my opinion) source because it is amenable to being printed in several different ways: one might be to simply copy the webpage's layout, while another could be to use columns to fit more onto a single page of text.

Hi Bruce, many thanks for your advice. This could be a good way to practice things that I’m not used to. After all, the things you can do with pandoc are rather limited when considered from XML.

...

Alternatively, a much more demanding exercise would be to typeset the user manual for the XML editing software "Oxygen": https://www.oxygenxml.com> > The XML source for the manual is here: https://github.com/oxygenxml/userguide/blob/master/DITA/UserManual.ditamap

Many thanks for your tip, but I’m afraid this isn’t my cup of tea. But this reminded me of the Guidelines from the Text Encoding Initiative (https://tei-c.org). The PDF version of these Guidelines are roughly over 2000 pages. It could be also a good exercise (and also demanding). Many thanks for your help, Pablo

juh

19 May 19 May

5:33 p.m.

Dear Pablo, sorry for answering late as I am on holidays learning Spanish in Salamanca. :-) Am Wed, May 18, 2022 at 06:00:20PM +0200 schrieb Pablo Rodriguez via ntg-context:

...

Sorry for explaining myself so poorly.

One of the not irrelevant tasks for me is finding examples of XML code.

As I know that you are fluent in German I would recommend https://deutschestextarchiv.de/ It is a collection of many, many texts in German with expired copyright in TEI XML and other formats. I had a hard time to convert even one text to ConTeXt, but I've got it to work. I had the crazy idea to get a process where I simply can download the TEI XML source and make a nice book of the text. Saludos! juh -- Autoren-Homepage: ......... http://literatur.hasecke.com Satiren & Essays: ......... http://www.sudelbuch.de Privater Blog: ............ http://www.hasecke.eu Netzliteratur-Projekt: .... http://www.generationenprojekt.de

Pablo Rodriguez

21 May 21 May

8:23 p.m.

On 5/19/22 17:33, juh via ntg-context wrote:

...

Dear Pablo,

sorry for answering late as I am on holidays learning Spanish in Salamanca. :-)

Many thanks for your reply, Jan-Ulrich. I hope you are enjoying your experience in Spain.

...

Am Wed, May 18, 2022 at 06:00:20PM +0200 schrieb Pablo Rodriguez via ntg-context:

...
Sorry for explaining myself so poorly.

One of the not irrelevant tasks for me is finding examples of XML code.

As I know that you are fluent in German I would recommend

https://deutschestextarchiv.de/

Good advice, since the DTA contains TEI XML sources.

...

It is a collection of many, many texts in German with expired copyright in TEI XML and other formats.

I had a hard time to convert even one text to ConTeXt, but I've got it to work. I had the crazy idea to get a process where I simply can download the TEI XML source and make a nice book of the text.

Just a comment. My experience with computers is that the first time doing anything is the hardest one. Many thanks for your help, Pablo

1155

Age (days ago)

1160

Last active (days ago)

List overview

Download

14 comments

7 participants

participants (7)

Bruce Horrocks
Hans van der Meer
juh
mf
Pablo Rodriguez
Taco Hoekwater
Thangalin