String substitution using regular expressions and backreferences
Hi list, I'm looking to perform text replacements. \definereplacement[SubstPostmeridian][ match={[Pp].[Mm].]}, replace={\cap{pm}} ] The \replaceword command doesn't handle periods well. The translate module doesn't seem flexible enough to cover edge cases. Consider the following example document containing both sample inputs and sample outputs: \starttext {\bf Markdown Input} Our grandmother clock rang 11 p.m. and we fled. Our grandmother clock rang 11 p.m., so we fled. Our grandmother clock rang 11 p.m. We fled. \blank[big] {\bf \ConTeXt{} Output} Our grandmother clock rang 11 \cap{pm} and we fled. Our grandmother clock rang 11 \cap{pm}, so we fled. Our grandmother clock rang 11 \cap{pm}. We fled. \stoptext It would be most convenient to write: % Strip periods from p.m. \definereplacement[SubstPostmeridianLowercase][ match={[Pp].[Mm]. ([^:upper:])}, replace={\cap{pm} \1} ] % Preserve terminal period for p.m. (e.e. cummings notwithstanding) \definereplacement[SubstPostmeridianTerminal][ match={[Pp].[Mm]. ([:upper:])}, replace={\cap{pm}. \1} ] % Apply a macron for lowercase 'c' (McAnulty, McGenius, etc.) % Well, not quite a macron: https://tex.stackexchange.com/q/364024/2148 \definereplacement[SubstMac][ match={Mc([:upper:]\w)}, replace={M\macronbelow{c}\1} ] The \1 may be problematic. Other sigils include $1 and #1, which may also have issues. Thank you!
Thangalin via ntg-context schrieb am 01.08.2022 um 21:58:
Hi list,
I'm looking to perform text replacements.
Please don't omit important information, on TeX SE you mentioned you input is XML which means a lot more can be done than your simple TeX based example demonstrates. Wolfgang
Good point, Wolfgang. The Markdown is translated to XHTML then typeset as XML using the setups listed here: https://github.com/DaveJarvis/keenwrite-themes/tree/main/xhtml Having an XML string replacement solution would be great. I suppose that would help prevent substitutions within pre and code blocks, too, wouldn't it?
I've attempted to apply Wolfgang's subtle suggestion of using Lua to parse the input document using a regular expression via lpeg.replacer. The replacement itself works fine; however, in doing so the XML document structure is converted to text, which means that it is no longer possible to "flush" the XML for further processing as XML. The result is that any unresolved XML tags are written verbatim to the PDF: https://i.stack.imgur.com/9ZFND.png There are two other issues with this approach. First is efficiency. Second is that the processing function would have to be called for every XML element to capture the replacement. My original post asked about applying regex word substitution in a ConTeXt way, such as: \definereplacement[SubstMac][ match={Mc([A-Z].*)}, replace={\Mac \\1} ] \definereplacement[SubstPostmeridian][ match={[Pp]\\.[Mm]\\.}, replace={\cap{pm}} ] That seems like the cleanest approach because it would work on top of XML or any other source document. Nevertheless, here is what I tried, which partially works: \startbuffer[main] <html> <p>“Mr. McAnulty, I presume?”</p> <p>Regular text. <em>Irregular text.</em></p> </html>\stopbuffer \startxmlsetups xml:xhtml \xmlsetsetup{\xmldocument}{*}{-} \xmlsetsetup{\xmldocument}{html|p|em}{xml:*}\stopxmlsetups \startxmlsetups xml:html \startdocument \xmlflush{#1} \stopdocument\stopxmlsetups % Paragraphs are followed by a paragraph break, but only if not nested.\startxmlsetups xml:p \xmlfunction{#1}{p} \par\stopxmlsetups \startxmlsetups xml:em \dontleavehmode{\em\xmlflush{#1}}\stopxmlsetups \startluacode function xml.functions.p( t ) rep = { [1] = { "McAnulty", "\\Mac Anulty" } } x = lpeg.replacer( rep ):match( tostring( xml.text( t ) ) ) buffers.assign( "p", context( x ) ) context.getbuffer{ "p" } end\stopluacode \xmlregistersetup{xml:xhtml} \def\Mac{% % Determine the sizes of 'M' and 'c'. \newbox\MacMBox% \setbox\MacMBox\hbox{M}% \newbox\MacCBox% \setbox\MacCBox\hbox{c}% % % Cheat to dynamically derive the kerning size by putting Mc in a box. % \newbox\MacKernBox% \setbox\MacKernBox\hbox{\inframed[offset=\zeropoint, width=fit]{Mc}}% \def\MacDelta{\dimexpr\wd\MacKernBox-\wd\MacMBox-\wd\MacCBox\relax}% \def\MacUWidth{\dimexpr\wd\MacCBox-.75\MacDelta\relax}% \def\MacRule{\vrule width \MacUWidth height .04em depth \zeropoint \relax}% \def\MacKern{\dimexpr\wd\MacKernBox-\wd\MacMBox-\wd\MacCBox\relax}% \def\MacHeight{\dimexpr\ht\MacMBox-\ht\MacCBox\relax}% % % Write Mc, where c has a macron, to the document. % M{% \dontleavehmode{\raisebox{\MacHeight}\hbox{c}}% \kern-1.04\MacUWidth \MacRule \kern.08\MacUWidth }% }% \xmlprocessbuffer{main}{main}{} As shown in the screen shot, this doesn't correctly handle nested XML elements. Any ideas on what approach to take to perform a string replacement in ConTeXt? Thanks again! [Your] input is XML which means a lot more can be done than your simple TeX
based example demonstrates.
Wolfgang
On 8/25/2022 9:44 PM, Thangalin via ntg-context wrote:
I've attempted to apply Wolfgang's subtle suggestion of using Lua to parse the input document using a regular expression via lpeg.replacer. The replacement itself works fine; however, in doing so the XML document structure is converted to text, which means that it is no longer possible to "flush" the XML for further processing as XML. The result is that any unresolved XML tags are written verbatim to the PDF:
https://i.stack.imgur.com/9ZFND.png
There are two other issues with this approach. First is efficiency. Second is that the processing function would have to be called for every XML element to capture the replacement.
My original post asked about applying regex word substitution in a ConTeXt way, such as:
\definereplacement[SubstMac][ match={Mc([A-Z].*)}, replace={\Mac \\1} ] \definereplacement[SubstPostmeridian][ match={[Pp]\\.[Mm]\\.}, replace={\cap{pm}} ]
That seems like the cleanest approach because it would work on top of XML or any other source document. Nevertheless, here is what I tried, which partially works:
\startbuffer[main] <html> <p>“Mr. McAnulty, I presume?”</p> <p>Regular text. <em>Irregular text.</em></p> </html>\stopbuffer \startxmlsetups xml:xhtml \xmlsetsetup{\xmldocument}{*}{-} \xmlsetsetup{\xmldocument}{html|p|em}{xml:*}\stopxmlsetups \startxmlsetups xml:html \startdocument \xmlflush{#1} \stopdocument\stopxmlsetups % Paragraphs are followed by a paragraph break, but only if not nested.\startxmlsetups xml:p \xmlfunction{#1}{p} \par\stopxmlsetups \startxmlsetups xml:em \dontleavehmode{\em\xmlflush{#1}}\stopxmlsetups \startluacode function xml.functions.p( t ) rep = { [1] = { "McAnulty", "\\Mac Anulty" } } x = lpeg.replacer( rep ):match( tostring( xml.text( t ) ) )
buffers.assign( "p", context( x ) ) context.getbuffer{ "p" } end\stopluacode \xmlregistersetup{xml:xhtml} \def\Mac{% % Determine the sizes of 'M' and 'c'. \newbox\MacMBox% \setbox\MacMBox\hbox{M}% \newbox\MacCBox% \setbox\MacCBox\hbox{c}% % % Cheat to dynamically derive the kerning size by putting Mc in a box. % \newbox\MacKernBox% \setbox\MacKernBox\hbox{\inframed[offset=\zeropoint, width=fit]{Mc}}% \def\MacDelta{\dimexpr\wd\MacKernBox-\wd\MacMBox-\wd\MacCBox\relax}% \def\MacUWidth{\dimexpr\wd\MacCBox-.75\MacDelta\relax}% \def\MacRule{\vrule width \MacUWidth height .04em depth \zeropoint \relax}% \def\MacKern{\dimexpr\wd\MacKernBox-\wd\MacMBox-\wd\MacCBox\relax}% \def\MacHeight{\dimexpr\ht\MacMBox-\ht\MacCBox\relax}% % % Write Mc, where c has a macron, to the document. % M{% \dontleavehmode{\raisebox{\MacHeight}\hbox{c}}% \kern-1.04\MacUWidth \MacRule \kern.08\MacUWidth }% }% \xmlprocessbuffer{main}{main}{}
As shown in the screen shot, this doesn't correctly handle nested XML elements.
Any ideas on what approach to take to perform a string replacement in ConTeXt? Best stay at the xml end ...
\startbuffer[main] <html> <p>“Mr. McAnulty, I presume?”</p> <p>Regular text. <em>Irregular text.</em></p> </html> \stopbuffer \startxmlsetups xml:xhtml \xmlsetsetup{\xmldocument}{*}{-} \xmlsetsetup{\xmldocument}{html|p|em}{xml:*} \stopxmlsetups \startxmlsetups xml:html \xmlflush{#1} \stopxmlsetups \startxmlsetups xml:p \xmlfunction{#1}{p} \xmlcontext{#1} \par \stopxmlsetups \startxmlsetups xml:em \dontleavehmode{\em\xmlflush{#1}} \stopxmlsetups \startluacode local rep = lpeg.replacer { [1] = { "McAnulty", "\\Mac Anulty" } } function xml.functions.p(t) local dt = t.dt for i=1,#dt do local di = dt[i] if type(di) == "string" then dt[i] = lpeg.match(rep,di) end end end \stopluacode \xmlregistersetup{xml:xhtml} \startdocument \xmlprocessbuffer{main}{main}{} \stopdocument But this is more fun and probably also more reliable: \startbuffer[main] <html> <p>“Mr. McAnulty, I presume?”</p> <p>Regular text. <em>Irregular text.</em></p> </html> \stopbuffer \startxmlsetups xml:xhtml \xmlsetsetup{\xmldocument}{*}{-} \xmlsetsetup{\xmldocument}{html|p|em}{xml:*} \stopxmlsetups \startxmlsetups xml:html \xmlflush{#1} \stopxmlsetups \startxmlsetups xml:p \xmlcontext{#1} \par \stopxmlsetups \startxmlsetups xml:em \dontleavehmode{\em\xmlflush{#1}} \stopxmlsetups \xmlregistersetup{xml:xhtml} \usemodule[gimmicks] % in latest uploads \chardef\MacAnulty = \getprivateglyphslot{MacAnulty} \startsetups [box:mcanulty:\number\MacAnulty] \Mac Anulty \stopsetups \registerboxglyph category {mcanulty} unicode \MacAnulty \relax \startluacode fonts.handlers.otf.addfeature { name = "mcanulty", type = "ligature", nocheck = true, data = { [fonts.constructors.privateslots.MacAnulty] = { "M", "c", "A", "n", "u", "l", "t", "y", }, } } \stopluacode \definefontfeature[default][default][box=mcanulty,mcanulty=yes] \startdocument \xmlprocessbuffer{main}{main}{} \stopdocument ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl -----------------------------------------------------------------
participants (3)
-
Hans Hagen
-
Thangalin
-
Wolfgang Schuster