EPUB XHTML Format

older
brogue (OT) and a related question

Thangalin

4 Sep 2013 4 Sep '13

3:19 a.m.

Hi, The attached t.tex file produces the attached t.xhtml file. I have looked at the following documents: - http://en.wikipedia.org/wiki/EPUB#Open_Publication_Structure_2.0.1 - http://en.wikipedia.org/wiki/DTBook - http://www.idpf.org/epub/20/spec/OPS_2.0.1_draft.htm - http://www.w3.org/TR/xhtml11/doctype.html - http://www.w3.org/TR/html5/sections.html It seems that the macros in t.tex are being written out as XML elements, verbatim. It is my understanding that these XML elements, however, do not conform to the minimal content models associated with XHTML 1.1. What needs to happen to take a minimal ConTeXt file (such as the attached) to produce a minimum viable EPUB that: - Generates XHTML headers (including <!DOCTYPE and <html...>) - Produces images as img tags, rather than float tags. - Uses typical XHTML tags for <body> elements (e.g., <ol> for ordered lists). Ideally, I would like to do something such as: - context t.tex - mtxrun --script epub --make t.specification to generate an EPUB that passes validation of epubcheckhttp://code.google.com/p/epubcheck/wiki/Library, with an output XHTML file that more closely matches the XHTML specification. How can I help? Kind regards.

Attachments:

t.tex (application/x-tex — 995 bytes)
attachment.html (text/html — 1.8 KB)
t.xhtml (application/xhtml+xml — 3.0 KB)
epub-errors.log (application/octet-stream — 1.6 KB)

Show replies by date

Hans Hagen

4 Sep 4 Sep

11:20 a.m.

On 9/4/2013 3:19 AM, Thangalin wrote:

...

Hi,

The attached t.tex file produces the attached t.xhtml file. I have looked at the following documents:

* http://en.wikipedia.org/wiki/EPUB#Open_Publication_Structure_2.0..1 http://en.wikipedia.org/wiki/EPUB#Open_Publication_Structure_2.0.1 * http://en.wikipedia.org/wiki/DTBook * http://www.idpf.org/epub/20/spec/OPS_2.0.1_draft.htm * http://www.w3.org/TR/xhtml11/doctype.html * http://www.w3.org/TR/html5/sections.html

It seems that the macros in t.tex are being written out as XML elements, verbatim. It is my understanding that these XML elements, however, do not conform to the minimal content models associated with XHTML 1.1.

you get a representation in xml indeed, but not verbatim, but as close as possible to the genaric (parent) structure elements in context of course we could alternatively export all as <div class="tag-subtag-..."> but i don't like that too much; html itself is not rich enough for our purpose

...

What needs to happen to take a minimal ConTeXt file (such as the attached) to produce a minimum viable EPUB that:

* Generates XHTML headers (including <!DOCTYPE and <html...>)

not needed as we're 'standalone'

...

* Produces images as img tags, rather than float tags.

the css can deal with them (info is written to files for that) the only real problematic thing is hyperlinks as css has no provision for that so there's an option to inject <a>...

...

* Uses typical XHTML tags for <body> elements (e.g., <ol> for ordered lists).

xhtml has no typical tags .. it's xml + css (or xslt) ... unfortunately browsers have messed up html so much (extensions, too tolerant support for unmatched tags, different rendering models) that xhtml never really took off the export of context is in fact just xml, and by tagging it as xhtml we can apply css to it; but if someone has a workflow for producing epub an option if to postprocess that xml file into whatever epub one wants (i.e. the export is generic and carries as much info as possible)

...

Ideally, I would like to do something such as:

* context t.tex * mtxrun --script epub --make t.specification

to generate an EPUB that passes validation of epubcheck http://code.google.com/p/epubcheck/wiki/Library, with an output XHTML file that more closely matches the XHTML specification.

Everytime we look into epub there's another issue ... it's not a standard but reversed engineered application mess (happen soften with xml: turn some application data structures into xml and call it a standard) I only tested (long ago already) with some firefox plugin (i don't have a recent epub device, only an old firts generation one which is dead slow, never relly used, probably broken by now) and i refuse to buy a new one till resolution is decent (and i only want generic devices, not something bound to some shop)

...

How can I help?

by testing as i have no real use/demand for epub it's not something i look into on a daily basis Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

Thangalin

7:55 p.m.

Hi. of course we could alternatively export all as <div class="tag-subtag-...">

...

but i don't like that too much; html itself is not rich enough for our purpose

What about giving developers the ability to change the destination element? For example: \setuplist[chapter][ xml={\starttag[h1]#1\stoptag} ] Would produce, upon export: <h1>Chapter</h1> Or (using "export" instead of "xml"; I don't care what it is named): \setuplist[chapter][ export={\starttag[div]\startattribute[class]{chapter}#1\stopattribute\stoptag}} ] Similarly, this would produce: <div class="chapter">Chapter</div> This would offer the flexibility of custom XML documents without affecting the default behaviour. * Generates XHTML headers (including <!DOCTYPE and <html...>)

...

not needed as we're 'standalone'

Having the ability to produce the <!DOCTYPE...> and <htmnl> elements could be as simple as: \setupexport[ standalone=no, ]

...

* Produces images as img tags, rather than float tags.

...
the css can deal with them (info is written to files for that)

Yes, but they aren't standard. There is an ecosystem of tools (e.g., Calibre, normalizing CSS templates, etc.), not to mention a widespread knowledge-base, that groks the minimal XHTML specification. Plus, using XML tags that are not in the minimal XHTML spec. means more testing on more devices to make sure that their XHTML parsers render correctly.

...

xhtml has no typical tags .. it's xml + css (or xslt) ... unfortunately browsers have

That is, a Strictly Conforming XHTML Document, as per: http://www.w3.org/TR/2000/REC-xhtml1-20000126/#docconf the export of context is in fact just xml, and by tagging it as xhtml we

...

can apply css to it; but if someone has a workflow for producing epub an option if to postprocess that xml file into whatever epub one wants

I could transform the ConTeXt-generated XML into strictly conforming XHTML, but it was a step I was hoping to avoid. Right now my process is: 1. Convert XML data to a ConTeXt .tex file. 2. Convert ConTeXt to either PDF or EPUB. 3. Stylize EPUB using CSS. I want to use ConTeXt here (instead of going directly from XML data to EPUB) because ConTeXt provides functionality such as multiple indexes, table-of-contents, and bundling the .epub. Having an extra step to generate strictly conforming XHTML is architecturally painful as it means transforming the document three times (XML -> ConTeXt, ConTeXt -> XML, then XML -> XHTML).

...

Everytime we look into epub there's another issue ... it's not a standard but reversed engineered application mess (happen soften with xml: turn some application data structures into xml and call it a standard)

Some book vendors only accept validating EPUBs. ConTeXt is documented as being able to generate EPUBs. The documentation should state the EPUBs do not validate and do not generate strictly conforming XHTML. I have spent the last three weeks converting documents from LaTeX to ConTeXt because the documentation stated that ConTeXt can produce EPUBs. While true, the documentation did not mention its shortcomings. Had I known in advance, I probably would have gone straight to EPUB using Java or, with a little revulsion, PHP classes. ;-) That said, I probably should have tested this feature sooner. :-) as i have no real use/demand for epub it's not something i look into on a

...

daily basis

How can I help resolve these issues? Merely "testing" (which I am happy to do) isn't going to produce a strictly conforming XHTML document. Kindest regards.

Hans Hagen

5 Sep 5 Sep

3:55 p.m.

On 9/4/2013 7:55 PM, Thangalin wrote:

...

Hi.

of course we could alternatively export all as <div class="tag-subtag-..."> but i don't like that too much; html itself is not rich enough for our purpose

What about giving developers the ability to change the destination element? For example:

\setuplist[chapter][ xml={\starttag[h1]#1\stoptag} ]

Would produce, upon export:

<h1>Chapter</h1>

export doesn't happen at that level; something like that would add an ugly overhead; it's way easier to make some xslt script that converts the rather systematic export to something like that and it only has to be written once by someone (not me)

...

Or (using "export" instead of "xml"; I don't care what it is named):

\setuplist[chapter][

export={\starttag[div]\startattribute[class]{chapter}#1\stopattribute\stoptag}} ]

Similarly, this would produce:

<div class="chapter">Chapter</div>

you use some tex syntax but it all happens in lua; also, the only way to provide some kind of different tagging is to support plugins (read: lua functions) that could override default behaviour (but again, it's quite easy to do that as a postprocessing step)

...

This would offer the flexibility of custom XML documents without affecting the default behaviour.

* Generates XHTML headers (including <!DOCTYPE and <html...>)

not needed as we're 'standalone'

Having the ability to produce the <!DOCTYPE...> and <htmnl> elements could be as simple as:

\setupexport[ standalone=no, ]

* Produces images as img tags, rather than float tags.

the css can deal with them (info is written to files for that)

Yes, but they aren't standard. There is an ecosystem of tools (e.g., Calibre, normalizing CSS templates, etc.), not to mention a widespread knowledge-base, that groks the minimal XHTML specification. Plus, using XML tags that are not in the minimal XHTML spec. means more testing on more devices to make sure that their XHTML parsers render correctly.

most of the xml we get here is a funny mix of whatever tags and html (often for tables) and normaly there is way more structure than in the average html document; the export is meant to be close to the source and turning it into some html / div mixture makes it messy for instance, we have more levels than H1..H6, so how to do H7? if someone has to deal with that, he/she can as well transform all into H1 with some class which is a local solution then

...

xhtml has no typical tags .. it's xml + css (or xslt) ... unfortunately browsers have

That is, a Strictly Conforming XHTML Document, as per:

http://www.w3.org/TR/2000/REC-xhtml1-20000126/#docconf

the export of context is in fact just xml, and by tagging it as xhtml we can apply css to it; but if someone has a workflow for producing epub an option if to postprocess that xml file into whatever epub one wants

indeed. that was the idea: export xml, tag it as xhtml (with the option to provide hyperlinks, an exception), provide some standard css as starter and then let users deal with matters the way they like; you can be pretty sure that what you want is not the same as what someone else wants; and if more people want it, they can together write a transformation script (or hire someone) keep in mind that the export itself is already tricky enough and for me it doesn't pay off to provide tons of additional functionality (well, it doesn't pay of to export anyway)

...

I could transform the ConTeXt-generated XML into strictly conforming XHTML, but it was a step I was hoping to avoid. Right now my process is:

1. Convert XML data to a ConTeXt .tex file. 2. Convert ConTeXt to either PDF or EPUB. 3. Stylize EPUB using CSS.

but writing the transform that suits you is just one step (with yuou spending the time on it) while extending the export into a complete transformation and configuration thing would put the burden on me -)

...

I want to use ConTeXt here (instead of going directly from XML data to EPUB) because ConTeXt provides functionality such as multiple indexes, table-of-contents, and bundling the .epub. Having an extra step to generate strictly conforming XHTML is architecturally painful as it means transforming the document three times (XML -> ConTeXt, ConTeXt -> XML, then XML -> XHTML).

why is it painful? the export if quite generic and will not change; it is also flexible as it honors user defined sectioning and styling

...

Everytime we look into epub there's another issue ... it's not a standard but reversed engineered application mess (happen soften with xml: turn some application data structures into xml and call it a standard)

Some book vendors only accept validating EPUBs. ConTeXt is documented as being able to generate EPUBs. The documentation should state the EPUBs do not validate and do not generate strictly conforming XHTML.

well, i, luigi and some others did tests: the thing is that epub is evolving and we had quite some conflicting validations (and specs) and we try as good as possible to adapt so you need to be more precise in "doesn't validate": it's proper xml and therefore proper xhtml (and nothing says that there should be html tags)

...

I have spent the last three weeks converting documents from LaTeX to ConTeXt because the documentation stated that ConTeXt can produce EPUBs. While true, the documentation did not mention its shortcomings. Had I known in advance, I probably would have gone straight to EPUB using Java or, with a little revulsion, PHP classes. ;-) That said, I probably should have tested this feature sooner. :-)

the export is a reconstruction of the input, and the more structure the better; if you really need a multiple out put format, you should use xml as source and then use context fo rpdf creation and xslt for html creation i really see no problem with a transformation from the generic export to some epub (whatever variant your whatever device supports) ... really: you cannot expect me to provide an extensive configurable export system (for only one user) that will never suit all users so ... also, configuring it for some document is probably as much work as writing an xslt transformation

...

as i have no real use/demand for epub it's not something i look into on a daily basis

How can I help resolve these issues?

Merely "testing" (which I am happy to do) isn't going to produce a strictly conforming XHTML document.

indeed it isn't producing an html document (with properly matched tags) but i'm not convinced that it isn't xhtml Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

Alan BRASLAU

12 Sep 12 Sep

4:32 p.m.

On Thu, 5 Sep 2013 19:22:42 Aditya Mahajan wrote:

...

How easy is it to create a new export format. IIRC, context keeps track of the entire document tree, and flushes the XML output only at the end. Is it possible to make this pluggable so that users can write their own transformers (in lua) on how the document tree can be written. This will enable more output formats (opendocument and (shudder) latex).

Or, (gasp!) MSword .docx Alan

Hans Hagen

5 Sep 5 Sep

6:38 p.m.

On 9/4/2013 11:20 AM, Hans Hagen wrote:

...

you get a representation in xml indeed, but not verbatim, but as close as possible to the genaric (parent) structure elements in context

probably the most straightforward xhtml export is file with only

<div> </div> i.e. only divs and spans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

Thangalin

6:57 p.m.

Hi,

...

<div> </div>

i.e. only divs and spans

I think that would be a more robust output format, technically, easier to adapt, and more readily conform to the strict XHTML tag subset. The other issue I encountered was this: \startfrontmatter \startstandardmakeup Title page \stopstandardmakeup \startstandardmakeup Copyright \stopstandardmakeup \completecontent \stopfrontmatter This produced "*Title pageCopyright*" as text without any markup, which makes the EPUB output a bit difficult to parse. I thought the software should output something like: <div class="frontmatter"> <div id="standardmakeup1" class="standardmakeup">Title page</div> <div id="standardmakeup2" class="standardmakeup">Copyright</div> <div class="contents"></div> </div> This way the title and copyright pages can be styled independently. Kindest regards.

Khaled Hosny

7:57 p.m.

On Thu, Sep 05, 2013 at 09:57:59AM -0700, Thangalin wrote:

...

Hi,

...
<div> </div>

i.e. only divs and spans

I think that would be a more robust output format, technically, easier to adapt, and more readily conform to the strict XHTML tag subset.

What about accessibility? I expect that visually impaired people would depend on document structure rather than its visualisation. Regards, Khaled

Hans Hagen

8:22 p.m.

On 9/5/2013 7:57 PM, Khaled Hosny wrote:

...

On Thu, Sep 05, 2013 at 09:57:59AM -0700, Thangalin wrote:

...
Hi,

...
<div> </div>

i.e. only divs and spans

I think that would be a more robust output format, technically, easier to adapt, and more readily conform to the strict XHTML tag subset.

What about accessibility? I expect that visually impaired people would depend on document structure rather than its visualisation.

For that purpose I'd make a nice special doc. But the basic export has at least the similar structure as the original. (After all, it's one of the reasons why we *can do* an export. Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

Aditya Mahajan

7:22 p.m.

On Thu, 5 Sep 2013, Hans Hagen wrote:

...

On 9/4/2013 11:20 AM, Hans Hagen wrote:

...
you get a representation in xml indeed, but not verbatim, but as close as possible to the genaric (parent) structure elements in context

probably the most straightforward xhtml export is file with only

<div> </div>

i.e. only divs and spans

How easy is it to create a new export format. IIRC, context keeps track of the entire document tree, and flushes the XML output only at the end. Is it possible to make this pluggable so that users can write their own transformers (in lua) on how the document tree can be written. This will enable more output formats (opendocument and (shudder) latex). Aditya

Hans Hagen

8:21 p.m.

On 9/5/2013 7:22 PM, Aditya Mahajan wrote:

...

On Thu, 5 Sep 2013, Hans Hagen wrote:

...
On 9/4/2013 11:20 AM, Hans Hagen wrote:

...
you get a representation in xml indeed, but not verbatim, but as close as possible to the genaric (parent) structure elements in context

probably the most straightforward xhtml export is file with only

<div> </div>

i.e. only divs and spans

How easy is it to create a new export format. IIRC, context keeps track of the entire document tree, and flushes the XML output only at the end. Is it possible to make this pluggable so that users can write their own transformers (in lua) on how the document tree can be written. This will enable more output formats (opendocument and (shudder) latex).

sure, but first i want to clean up some code (it's rather complex) ... in principle there is a document tree so one can plug into that; alternatively one can load the xml tree and mess with that (probably easier if we provide some styles for it) Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

honyk

8:11 p.m.

On 2013-09-04 Thangalin wrote:

...

What needs to happen to take a minimal ConTeXt file (such as the attached) to produce a minimum viable EPUB that:

It is always difficult to parse and further process not well structured plain text without advanced semantics. Garbage in, garbage out. If you need both EPUB and PDF, start with a semantically rich XML vocabulary, e.g. DocBook. In this case you can relatively easy transfrom (XSLT) input data into almost any format. These basic outputs like EPUB or PDF (via XSL-FO) you can get out-of-the-box. The Context output can be generated using dbcontext: http://dblatex.sourceforge.net/ In sum, use XML as your primary source and from it derive everything else. Jan

4327

Age (days ago)

4335

Last active (days ago)

List overview

Download

11 comments

6 participants

participants (6)

Aditya Mahajan
Alan BRASLAU
Hans Hagen
honyk
Khaled Hosny
Thangalin