[NTG-context] String substitution using regular expressions and backreferences
Hans Hagen
j.hagen at freedom.nl
Fri Aug 26 09:34:08 CEST 2022
On 8/25/2022 9:44 PM, Thangalin via ntg-context wrote:
> I've attempted to apply Wolfgang's subtle suggestion of using Lua to parse
> the input document using a regular expression via lpeg.replacer. The
> replacement itself works fine; however, in doing so the XML document
> structure is converted to text, which means that it is no longer possible
> to "flush" the XML for further processing as XML. The result is that any
> unresolved XML tags are written verbatim to the PDF:
>
> https://i.stack.imgur.com/9ZFND.png
>
> There are two other issues with this approach. First is efficiency. Second
> is that the processing function would have to be called for every XML
> element to capture the replacement.
>
> My original post asked about applying regex word substitution in a ConTeXt
> way, such as:
>
> \definereplacement[SubstMac][ match={Mc([A-Z].*)}, replace={\Mac \\1} ]
> \definereplacement[SubstPostmeridian][ match={[Pp]\\.[Mm]\\.},
> replace={\cap{pm}} ]
>
> That seems like the cleanest approach because it would work on top of XML
> or any other source document. Nevertheless, here is what I tried, which
> partially works:
>
> \startbuffer[main]
> <html>
> <p>“Mr. McAnulty, I presume?”</p>
> <p>Regular text. <em>Irregular text.</em></p>
> </html>\stopbuffer
> \startxmlsetups xml:xhtml
> \xmlsetsetup{\xmldocument}{*}{-}
> \xmlsetsetup{\xmldocument}{html|p|em}{xml:*}\stopxmlsetups
> \startxmlsetups xml:html
> \startdocument
> \xmlflush{#1}
> \stopdocument\stopxmlsetups
> % Paragraphs are followed by a paragraph break, but only if not
> nested.\startxmlsetups xml:p
> \xmlfunction{#1}{p}
> \par\stopxmlsetups
> \startxmlsetups xml:em
> \dontleavehmode{\em\xmlflush{#1}}\stopxmlsetups
> \startluacode
> function xml.functions.p( t )
> rep = { [1] = { "McAnulty", "\\Mac Anulty" } }
> x = lpeg.replacer( rep ):match( tostring( xml.text( t ) ) )
>
> buffers.assign( "p", context( x ) )
> context.getbuffer{ "p" }
> end\stopluacode
> \xmlregistersetup{xml:xhtml}
> \def\Mac{%
> % Determine the sizes of 'M' and 'c'.
> \newbox\MacMBox%
> \setbox\MacMBox\hbox{M}%
> \newbox\MacCBox%
> \setbox\MacCBox\hbox{c}%
> %
> % Cheat to dynamically derive the kerning size by putting Mc in a box.
> %
> \newbox\MacKernBox%
> \setbox\MacKernBox\hbox{\inframed[offset=\zeropoint, width=fit]{Mc}}%
> \def\MacDelta{\dimexpr\wd\MacKernBox-\wd\MacMBox-\wd\MacCBox\relax}%
> \def\MacUWidth{\dimexpr\wd\MacCBox-.75\MacDelta\relax}%
> \def\MacRule{\vrule width \MacUWidth height .04em depth \zeropoint \relax}%
> \def\MacKern{\dimexpr\wd\MacKernBox-\wd\MacMBox-\wd\MacCBox\relax}%
> \def\MacHeight{\dimexpr\ht\MacMBox-\ht\MacCBox\relax}%
> %
> % Write Mc, where c has a macron, to the document.
> %
> M{%
> \dontleavehmode{\raisebox{\MacHeight}\hbox{c}}%
> \kern-1.04\MacUWidth
> \MacRule
> \kern.08\MacUWidth
> }%
> }%
> \xmlprocessbuffer{main}{main}{}
>
> As shown in the screen shot, this doesn't correctly handle nested XML
> elements.
>
> Any ideas on what approach to take to perform a string replacement in
> ConTeXt?
Best stay at the xml end ...
\startbuffer[main]
<html>
<p>“Mr. McAnulty, I presume?”</p>
<p>Regular text. <em>Irregular text.</em></p>
</html>
\stopbuffer
\startxmlsetups xml:xhtml
\xmlsetsetup{\xmldocument}{*}{-}
\xmlsetsetup{\xmldocument}{html|p|em}{xml:*}
\stopxmlsetups
\startxmlsetups xml:html
\xmlflush{#1}
\stopxmlsetups
\startxmlsetups xml:p
\xmlfunction{#1}{p}
\xmlcontext{#1}
\par
\stopxmlsetups
\startxmlsetups xml:em
\dontleavehmode{\em\xmlflush{#1}}
\stopxmlsetups
\startluacode
local rep = lpeg.replacer { [1] = { "McAnulty", "\\Mac Anulty" } }
function xml.functions.p(t)
local dt = t.dt
for i=1,#dt do
local di = dt[i]
if type(di) == "string" then
dt[i] = lpeg.match(rep,di)
end
end
end
\stopluacode
\xmlregistersetup{xml:xhtml}
\startdocument
\xmlprocessbuffer{main}{main}{}
\stopdocument
But this is more fun and probably also more reliable:
\startbuffer[main]
<html>
<p>“Mr. McAnulty, I presume?”</p>
<p>Regular text. <em>Irregular text.</em></p>
</html>
\stopbuffer
\startxmlsetups xml:xhtml
\xmlsetsetup{\xmldocument}{*}{-}
\xmlsetsetup{\xmldocument}{html|p|em}{xml:*}
\stopxmlsetups
\startxmlsetups xml:html
\xmlflush{#1}
\stopxmlsetups
\startxmlsetups xml:p
\xmlcontext{#1}
\par
\stopxmlsetups
\startxmlsetups xml:em
\dontleavehmode{\em\xmlflush{#1}}
\stopxmlsetups
\xmlregistersetup{xml:xhtml}
\usemodule[gimmicks] % in latest uploads
\chardef\MacAnulty = \getprivateglyphslot{MacAnulty}
\startsetups [box:mcanulty:\number\MacAnulty]
\Mac Anulty
\stopsetups
\registerboxglyph category {mcanulty} unicode \MacAnulty \relax
\startluacode
fonts.handlers.otf.addfeature {
name = "mcanulty",
type = "ligature",
nocheck = true,
data = {
[fonts.constructors.privateslots.MacAnulty] = {
"M", "c", "A", "n", "u", "l", "t", "y",
},
}
}
\stopluacode
\definefontfeature[default][default][box=mcanulty,mcanulty=yes]
\startdocument
\xmlprocessbuffer{main}{main}{}
\stopdocument
-----------------------------------------------------------------
Hans Hagen | PRAGMA ADE
Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl
-----------------------------------------------------------------
More information about the ntg-context
mailing list