How to process simple HTML files with LuaTeX

newer
Re: [NTG-context] LuaTeX on Mac OS...

Mojca Miklavec

13 Sep 2007 13 Sep '07

3:04 p.m.

Hello, I was trying to figure out how to process simple HTML files with the new code, but I fail to understand the details. Here's a simple file I would like to process: <html> <head> <title>My first HTML2ConTeXt</title> </head> <body> <h1>Main Title</h1> <p>Some text ...</p> <h2>Subtitle</h2> <p>Some text again ...</p> <h1>Second title</h1> <p>... and not much more text here either ...</p> </body> </html> And the failed tries here: % engine=luatex \setupcolors[state=start] \setuphead[subject][style=bfa,color=blue] \setuphead[subsubject][style=tfa,color=blue] \starttext \xmlload{main}{test.html}{} \xmlgrab{main}{h1}{h1} \xmlgrab{main}{h2}{h2} \startxmlsetups h1 \subject{H1: #1} \stopxmlsetups \startxmlsetups h2 \subsubject{H2: #1} \stopxmlsetups How to grab only the title out of here? \xmlfilter{main}{html/head/title} \xmlflush{main} \stoptext Any hints most wellcome. Thank a lot, Mojca

Show replies by date

Hans Hagen

14 Sep 14 Sep

12:22 a.m.

Mojca Miklavec wrote:

...

Hello,

I was trying to figure out how to process simple HTML files with the new code, but I fail to understand the details. Here's a simple file I would like to process:

<html> <head> <title>My first HTML2ConTeXt</title> </head> <body> <h1>Main Title</h1> <p>Some text ...</p> <h2>Subtitle</h2> <p>Some text again ...</p> <h1>Second title</h1> <p>... and not much more text here either ...</p> </body> </html>

And the failed tries here:

% engine=luatex \setupcolors[state=start] \setuphead[subject][style=bfa,color=blue] \setuphead[subsubject][style=tfa,color=blue]

\starttext \xmlload{main}{test.html}{} \xmlgrab{main}{h1}{h1} \xmlgrab{main}{h2}{h2}

\startxmlsetups h1 \subject{H1: #1} \stopxmlsetups

\startxmlsetups h2 \subsubject{H2: #1} \stopxmlsetups

How to grab only the title out of here?

\xmlfilter{main}{html/head/title}

\xmlflush{main} \stoptext

Any hints most wellcome.

keep in mind that this is still somewhat experimental % best define mappings before loading the file \startxmlsetups all:html \xmlsetsetup{main}{head|h1|h2}{*} \stopxmlsetups \xmlregistersetup{all:html} % register this so that it's done for each load \startxmlsetups h1 \subject{\xmlflush{#1}} \stopxmlsetups \startxmlsetups h2 \subsubject{\xmlflush{#1}} \stopxmlsetups \startxmlsetups head \startstandardmakeup THIS IS ABOUT: \xmlfilter{main}{/head/title/text()} \stopstandardmakeup \stopxmlsetups % that's it \setupcolors[state=start] \setuphead[subject][style=\bfd,color=blue] \setuphead[subsubject][style=\bfc,color=blue] \starttext \xmlprocess{main}{test.html}{} \stoptext ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

Mojca Miklavec

3:46 p.m.

On 9/14/07, Hans Hagen wrote:

...

Mojca Miklavec wrote:

...
Hello,

I was trying to figure out how to process simple HTML files with the new code, but I fail to understand the details. Here's a simple file I would like to process:

keep in mind that this is still somewhat experimental

Sure :) That's why I'm sending files for testing :) :) :)

...

% best define mappings before loading the file

\startxmlsetups all:html \xmlsetsetup{main}{head|h1|h2}{*} \stopxmlsetups

\xmlregistersetup{all:html}

% register this so that it's done for each load

\startxmlsetups h1 \subject{\xmlflush{#1}} \stopxmlsetups

\startxmlsetups h2 \subsubject{\xmlflush{#1}} \stopxmlsetups

\startxmlsetups head \startstandardmakeup THIS IS ABOUT: \xmlfilter{main}{/head/title/text()} \stopstandardmakeup \stopxmlsetups

% that's it

\setupcolors[state=start] \setuphead[subject][style=\bfd,color=blue] \setuphead[subsubject][style=\bfc,color=blue]

\starttext

\xmlprocess{main}{test.html}{}

\stoptext

Great! This works perfect and seems much easier to write than the old code, though I still have no idea how to implement some parts of it: - where to plug in the entities such as , ≤, ... - how to catch classes: how to differentiate between <h1>title</h1> and <h1 class="...">title</h1> - and some more - there are some simple examples in the attachment (too long to copy-paste) Thanks again, Mojca

Hans Hagen

4:19 p.m.

Mojca Miklavec wrote:

...

Great! This works perfect and seems much easier to write than the old code, though I still have no idea how to implement some parts of it: - where to plug in the entities such as , ≤, ...

\xmlutfize{main} or just load the regular entity handlers (mkii still works and can be used mixed)

...

- how to catch classes: how to differentiate between <h1>title</h1> and <h1 class="...">title</h1> - and some more - there are some simple examples in the attachment (too long to copy-paste)

\doifelse {\xmlatt{#1}{class}} {whatever} { dothis } { dothat } Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

Mojca Miklavec

16 Sep 16 Sep

12:29 p.m.

On 9/14/07, Hans Hagen wrote:

...

Mojca Miklavec wrote:

...
Great! This works perfect and seems much easier to write than the old code, though I still have no idea how to implement some parts of it: - where to plug in the entities such as , ≤, ...

\xmlutfize{main}

Thanks. I saw it, but had no idea how to use it. I need to test more extensively ... :)

...

...
- how to catch classes: how to differentiate between <h1>title</h1> and <h1 class="...">title</h1> - and some more - there are some simple examples in the attachment (too long to copy-paste)

\doifelse {\xmlatt{#1}{class}} {whatever} { dothis } { dothat }

I have tried exactly that before, but this example fails to work for me, or I don't know how to apply it: % test.html <html> <body> <h1>Title 1</h1> <h1 class="different">Title 2</h1> </body> </html> % test.tex \startxmlsetups all:html \xmlsetsetup{main}{h1}{*} \stopxmlsetups \xmlregistersetup{all:html} \startxmlsetups h1 This title belongs to class (\xmlatt{#1}{class}): \xmlflush{#1}.\par \stopxmlsetups \starttext \xmlprocess{main}{test.html}{} \stoptext Class always comes out empty. Thanks a lot, Mojca

Hans Hagen

11:55 p.m.

Mojca Miklavec wrote:

...

I have tried exactly that before, but this example fails to work for me, or I don't know how to apply it:

i rewrote the parser (both xml and semi-xpath) so it may have been broken, i'll upload a new beta tomorrow Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

Mojca Miklavec

2 Oct 2 Oct

5:17 a.m.

On 9/16/07, Hans Hagen wrote:

...

i rewrote the parser (both xml and semi-xpath) so it may have been broken, i'll upload a new beta tomorrow

Hello Hans, Thanks a lot for fixing the issue with non-working \xmlatt. Now, I'm still slightly lost regarding two issues: - How to remove unneeded space? With \ignorespaces? - How to use the new verbatim code? I have tried to use \xmlsetfunction{main}{pre}{lxml.verbatim} but it didn't really work. % test.tex: \startxmlsetups all:html \xmlsetsetup{main}{h1|pre}{*} \stopxmlsetups \xmlregistersetup{all:html} % is this the proper way? \startxmlsetups h1 \subject{\ignorespaces\xmlflush{#1}} \stopxmlsetups \startxmlsetups pre {\bgroup\tt\obeylines\xmlflush{#1}\egroup} \stopxmlsetups \starttext \xmlprocess{main}{test.html}{} \stoptext % test.html <?xml version="1.0" encoding="utf-8"?> <html><body> <h1> How to get rid of this spacing in some elegant way? </h1> <p>Title followed by a paragraph ...</p> <pre> and some source c@de </pre> </body></html> Also, this fails because of the empty line: <h1> How to get rid of this spacing in some elegant way? </h1> Thanks a lot, Mojca

Hans Hagen

14 Sep 14 Sep

4:26 p.m.

Mojca Miklavec wrote:

...

On 9/14/07, Hans Hagen wrote:

...
Mojca Miklavec wrote:

...
Hello,

I was trying to figure out how to process simple HTML files with the new code, but I fail to understand the details. Here's a simple file I would like to process:

keep in mind that this is still somewhat experimental

Sure :) That's why I'm sending files for testing :) :) :)

- i'll make a table mapper (need it anyway), cals tables are already provided - idem for preformatted and verbatim - your code: d[k] = dk:gsub(" ",' ') dk = d[k] d[k] = dk:gsub("≤", '\\mathematics{\\le}') local dk = d[k] dk = dk:gsub(" ",' ') dk = dk:gsub("≤", '\\mathematics{\\le}') d[k] = dk or .... mojcasentities = { nbsp = " ", le = "'\\mathematics{\\le}' } local d[k]= d[k]:gsub("&(.-);",mojcasentities) (there probably already is code for that) ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

Mojca Miklavec

16 Sep 16 Sep

12:31 p.m.

On 9/14/07, Hans Hagen wrote:

...

Mojca Miklavec wrote:

...
On 9/14/07, Hans Hagen wrote:

...
Mojca Miklavec wrote:

...
Hello,

I was trying to figure out how to process simple HTML files with the new code, but I fail to understand the details. Here's a simple file I would like to process:

keep in mind that this is still somewhat experimental

Sure :) That's why I'm sending files for testing :) :) :)

- i'll make a table mapper (need it anyway), cals tables are already provided

- idem for preformatted and verbatim

Thanks a lot. I'm waiting patiently :)

...

- your code:

d[k] = dk:gsub(" ",' ') dk = d[k] d[k] = dk:gsub("≤", '\\mathematics{\\le}')

local dk = d[k] dk = dk:gsub(" ",' ') dk = dk:gsub("≤", '\\mathematics{\\le}') d[k] = dk

or ....

mojcasentities = { nbsp = " ", le = "'\\mathematics{\\le}' }

local d[k]= d[k]:gsub("&(.-);",mojcasentities)

Thanks a lot!

...

(there probably already is code for that)

Yes, I saw it, but didn't try to understand what the &(.-) serves for. In any case, that was the wrong place to replace le with something. Thanks again, Mojca

Aditya Mahajan

8:07 p.m.

On Sun, 16 Sep 2007, Mojca Miklavec wrote:

...

On 9/14/07, Hans Hagen wrote:

...

...
mojcasentities = { nbsp = " ", le = "'\\mathematics{\\le}' }

local d[k]= d[k]:gsub("&(.-);",mojcasentities)

Yes, I saw it, but didn't try to understand what the &(.-) serves for.

(Caveat: I do not really know lua regex, and have not tried out the code) Assuming lua follows standard regex syntax, this means & # The letter & ( # start a group . # any character - # As few as needed ) # end group ; # the letter ; so this will match all entities. If it helps, the equivalent vim regex will be \&$.\{-}$; I guess that $1 (the first group, that is everything that matches .-) will be compared with mojcaentities table and replaced accordingly. This looks like a really nice feature of lua. In Ruby and Vim, I often find myself writing a bunch of similar regex, and always wished there was something like what lua does. Aditya

Hans Hagen

11:58 p.m.

Aditya Mahajan wrote:

...

(Caveat: I do not really know lua regex, and have not tried out the code)

they are not regexp but expressions -)

...

Assuming lua follows standard regex syntax, this means

& # The letter & ( # start a group .. # any character - # As few as needed ) # end group ; # the letter ;

so this will match all entities.

just &(.-); with () being the capture

...

If it helps, the equivalent vim regex will be \&$.\{-}$;

I guess that $1 (the first group, that is everything that matches .-)

...

will be compared with mojcaentities table and replaced accordingly.

indeed

...

This looks like a really nice feature of lua. In Ruby and Vim, I often find myself writing a bunch of similar regex, and always wished there was something like what lua does.

the nice thing about many lua feature is that less code (lua c code) behaves more powerful ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

6499

Age (days ago)

6518

Last active (days ago)

List overview

Download

10 comments

3 participants

participants (3)

Aditya Mahajan
Hans Hagen
Mojca Miklavec

How to process simple HTML files with LuaTeX

tags

participants (3)