[NTG-context] XML, dealing with whitespace

denis.maier at unibe.ch denis.maier at unibe.ch
Sat Jan 15 13:04:32 CET 2022


Hi all,

I have sources that look like this:

%%%%%%%%%%%%%%%%%%%%%
<?xml version="1.0" encoding="UTF-8"?>
<article>
   <p>Bla Bla Bla</p>
   <p>
      <underline>
         <italic>Bla</italic>
      </underline>, Bla Bla.</p>
</article>
%%%%%%%%%%%%%%%%%%%%%

Typesetting this with context gives me a spurious space after the underlined Bla in italics. Complete MWE :

%%%%%%%%%%%%%%%%%%%%%
\startxmlsetups xml:test
    \xmlsetsetup{#1}{*}{-}
    \xmlsetsetup{#1}{article|p|italic|underline}{xml:*}
\stopxmlsetups

\xmlregistersetup{xml:test}

\startxmlsetups xml:article
\starttext
    \xmlflush{#1}
\stoptext
\stopxmlsetups

\startxmlsetups xml:p
    \xmlflush{#1}\par
\stopxmlsetups

\startxmlsetups xml:italic
    \emph{\xmlflush{#1}}
\stopxmlsetups

\startxmlsetups xml:underline
    \underbar{\xmlflush{#1}}
\stopxmlsetups

\startbuffer[test]
<?xml version="1.0" encoding="UTF-8"?>
<article>
   <p>Bla Bla Bla</p>
   <p>
      <underline>
         <italic>Bla</italic>
      </underline>, Bla Bla.</p>
</article>
\stopbuffer

\xmlprocessbuffer{test}{test}{}
%%%%%%%%%%%%%%%%%%%%%

How can I get rid off spurious leading and trailing whitespace. I've found \xmlstrip and \xmlstripped, but I don't really understand how they work. I've also found out about
\ignorespaces\xmlflush{#1}\removeunwantedspaces
but this has then to be added to every definition, which would be a bit tedious...
There have a been a couple of similar questions by Hans van der Meer about a decade ago, but I couldn't find the answer.

Then, \xmlstripanywhere is also mentioned in xml-mkiv.pdf, but it's not explained. I found one example in the sources (https://source.contextgarden.net/tex/context/modules/mkiv/x-html.mkiv?search=%5Cxmlstripanywhere#l50), but what does that do? Is that sort of need for \xmlstrip and friends to work?

So, what would be the best way to deal with that situation? (More details below, perhaps there's an easier solution outside of context, because the problem is actually caused by xslt...)

Best,
Denis


P.S. Background:

I convert docx files with pandoc to jats xml. Pandoc does quite a decent job, but I need to tweak a few things with xslt. The actual transformation that I need works ok, but the transformation also causes other problems.
This is the original markdown file :

%%%%%%%%%%%%%%%%%%%%%%%
Bla Bla Bla

[*Bla*]{.underline} Bla Bla.
%%%%%%%%%%%%%%%%%%%%%%%

Pandoc produces a jats xml file that looks like this (simplified, empty nodes deleted) :

%%%%%%%%%%%%%%%%%%%%%%%
<?xml version="1.0" encoding="utf-8" ?>
<article>
<body>
<p>Bla Bla Bla</p>
<p><underline><italic>Bla</italic></underline>, Bla Bla.</p>
</body>
</article>
%%%%%%%%%%%%%%%%%%%%%%%

I use this xsl for tweaking pandoc's output

%%%%%%%%%%%%%%%%%%%%%%%
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl=http://www.w3.org/1999/XSL/Transform xmlns:fo=http://www.w3.org/1999/XSL/Format>

<xsl:output
                method="xml"
                indent="yes" />

                <!-- <xsl:strip-space elements="*"/> -->

    <xsl:template match="*">
        <xsl:copy>
          <xsl:copy-of select="@*"/>
                                 <xsl:apply-templates/>
        </xsl:copy>
    </xsl:template>

<!-- <xsl:template match="node()|@*"> -->
     <!-- <xsl:copy> -->
       <!-- <xsl:apply-templates select="node()|@*"/> -->
     <!-- </xsl:copy> -->
<!-- </xsl:template> -->

</xsl:stylesheet>
%%%%%%%%%%%%%%%%%%%%%%%

This is again much simplified, I've omitted the templates that do the actual tweaking.
Anyway, both versions of the identity transformation produce this (using Saxon):

%%%%%%%%%%%%%%%%%%%%%%%
<?xml version="1.0" encoding="UTF-8"?>
<article>
   <body>
      <p>Bla Bla Bla</p>
      <p>
         <underline>
            <italic>Bla</italic>
         </underline>, Bla Bla.</p>
   </body>
</article>
%%%%%%%%%%%%%%%%%%%%%%%

I can get rid off all whitespace with indent="no", but that produces a rather unreadable file.
xsl:strip-space has had no effect.

Maybe someone knows a solution how to improve that step? Is there a way to convince an xslt-processor not to introduce the newlines after certain tags? Something like, treat paragraphs as a single unit or so.
Am I missing something?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.ntg.nl/pipermail/ntg-context/attachments/20220115/8683fd10/attachment.htm>


More information about the ntg-context mailing list