Dear list, I have the following sample: \usemodule[scite] \starttext \startTEXpage[offset=1ex] \type[option=xml]{<ans/>} \type[option=xml]{<áñß/>} \stopTEXpage \stoptext Using scite, I don’t get the second element right. Without scite, both elements are displayed right. In both Geany and Notepad++ (which use Scintilla internally), the two elements are displayed right. Could anyone confirm the issue? Many thanks for your help, Pablo
Am 01.06.22 um 18:47 schrieb Pablo Rodriguez via ntg-context:
Dear list,
I have the following sample:
\usemodule[scite] \starttext \startTEXpage[offset=1ex] \type[option=xml]{<ans/>} \type[option=xml]{<áñß/>} \stopTEXpage \stoptext
Using scite, I don’t get the second element right.
Without scite, both elements are displayed right.
In both Geany and Notepad++ (which use Scintilla internally), the two elements are displayed right.
Could anyone confirm the issue?
Hi Pablo, with LMTX version 2022.05.11, both elements are displayed, but the first in blue, the second in red. Apparently the scite highlighter doesn’t like non-ASCII characters in elements. Hraban
On 6/1/22 18:58, Henning Hraban Ramm via ntg-context wrote:
Am 01.06.22 um 18:47 schrieb Pablo Rodriguez via ntg-context:
[...] Could anyone confirm the issue?
Hi Pablo,
with LMTX version 2022.05.11, both elements are displayed, but the first in blue, the second in red. Apparently the scite highlighter doesn’t like non-ASCII characters in elements.
Hi Hraban, this is exactly what I’m experiencing (and sorry, I forgot to mention that I was using current latest). I experienced that without scite and Hans fixed it (in buff-imp-xml.lua). I mentioned both Geany and Notepad++, because I think it may not be an issue outside ConTeXt. But I don’t know which file deals with it (so I could try to submit a patch). Many thanks for your help, Pablo
Am 01.06.22 um 19:45 schrieb Pablo Rodriguez via ntg-context:
But I don’t know which file deals with it (so I could try to submit a patch).
That would be texmf-context/tex/context/modules/mkiv/m-scite.mkiv and texmf-context/context/data/scite/context/lexers/scite-context-lexer-xml.lua and there probably local name = (R("az","AZ","09") + S("_-."))^1 Now, I still don’t understand LPEG and don’t know if there’s a general “character” class that doesn’t need a list... Hraban
Now, I still don’t understand LPEG and don’t know if there’s a general “character” class that doesn’t need a list...
Well looking through the XML spec https://www.w3.org/TR/REC-xml/#NT-NameChar you'd think that we'd want a pattern like this: local name = (R("az","AZ","09", "\u{C0}\u{D6}", "\u{D8}\u{F6}", "\u{F8}\u{2FF}", "\u{370}\u{37D}", "\u{37F}\u{1FFF}", "\u{200C}\u{200D}", "\u{2070}\u{218F}", "\u{2C00}\u{2FEF}", "\u{3001}\u{D7FF}", "\u{F900}\u{FDCF}", "\u{FDF0}\u{FFFD}", "\u{10000}\u{EFFFF}", "\u{0300}\u{036F}", "\u{203F}\u{2040}") + S("_-.\u{B7}"))^1 But that doesn't work, since
The same is true for lpeg.R, although the latter will display an error message if used with multibyte characters. Therefore lpeg.R('aä') results in the message bad argument #1 to 'R' (range must have two characters), since to lpeg, ä is two ’characters’ (bytes), so aä totals three. (https://texdoc.org/serve/luatex/0##680)
The easiest way that I found was to just cheat and use everything with a TeX catcode 11 ("letters"): local name = (R("az","AZ","09") + S("_-.") + lpeg.utfchartabletopattern(characters.csletters))^1 This isn't strictly speaking correct, but I think that it's close enough. It seems to work correctly for Pablo's initial example, but it may break something else. -- Max diff --git a/texmf-context/context/data/scite/context/lexers/scite-context-lexer-xml.original b/texmf-context/context/data/scite/context/lexers/scite-context-lexer-xml.lua index e635d40..97de3fd 100644 --- a/texmf-context/context/data/scite/context/lexers/scite-context-lexer-xml.original +++ b/texmf-context/context/data/scite/context/lexers/scite-context-lexer-xml.lua @@ -41,7 +41,7 @@ local semicolon = P(";") local equal = P("=") local ampersand = P("&") -local name = (R("az","AZ","09") + S("_-."))^1 +local name = (R("az","AZ","09") + S("_-.") + lpeg.utfchartabletopattern(characters.csletters))^1 local openbegin = P("<") local openend = P("") local closebegin = P("/>") + P(">")
On 6/1/22 23:58, Max Chernoff via ntg-context wrote:
Now, I still don’t understand LPEG and don’t know if there’s a general “character” class that doesn’t need a list...
Many thanks for your reply, Hraban.
The easiest way that I found was to just cheat and use everything with a TeX catcode 11 ("letters"):
local name = (R("az","AZ","09") + S("_-.") + lpeg.utfchartabletopattern(characters.csletters))^1
Many thanks for your reply, Max, I’m afraid I cannot make your proposed fix work. For the sake of consistency (with buff-imp-xml.lua), I think the patch should read (also attached to the message to avoid wrong line breaking): --- scite-context-lexer-xml.lua 2022-06-01 17:24:38.625976000 +0200 +++ context/tex/texmf-context/context/data/scite/context/lexers/scite-context-lexer-xml.lua 2022-06-02 16:37:30.112824947 +0200 @@ -13,7 +13,7 @@ -- todo: parse entities in attributes local global, string, table, lpeg = _G, string, table, lpeg -local P, R, S, C, Cmt, Cp = lpeg.P, lpeg.R, lpeg.S, lpeg.C, lpeg.Cmt, lpeg.Cp +local P, R, S, C, Cmt, Cp, lpatterns = lpeg.P, lpeg.R, lpeg.S, lpeg.C, lpeg.Cmt, lpeg.Cp, lpeg.patterns local type = type local match, find = string.match, string.find @@ -41,7 +41,8 @@ local equal = P("=") local ampersand = P("&") -local name = (R("az","AZ","09") + S("_-."))^1 +local alsoname = lpatterns.utf8two + lpatterns.utf8three + lpatterns.utf8four +local name = (R("az","AZ","09") + S("_-.") + + alsoname)^1 local openbegin = P("<") local openend = P("") local closebegin = P("/>") + P(">") But I’m afraid I cannot make it work on my computer (Linux64). On another Win64 computer, both patches worked perfectly fine. Both machines run LMTX current latest. So I have an issue on my installation that I have to fix first. Many thanks for your help, Pablo
On 6/2/22 17:36, Pablo Rodriguez via ntg-context wrote:
On 6/1/22 23:58, Max Chernoff via ntg-context wrote:
local name = (R("az","AZ","09") + S("_-.") + lpeg.utfchartabletopattern(characters.csletters))^1
I’m afraid I cannot make your proposed fix work.
Even with a brand new install, neither of both patches works for me. I don’t know what I may be missing on my installation. Do you have any hint about what I am doing wrong? Many thanks for your help, Pablo
For the sake of consistency (with buff-imp-xml.lua), I think the patch should read [...] +local alsoname = lpatterns.utf8two + lpatterns.utf8three + lpatterns.utf8four
I think that that pattern is a little too broad, since it will match any non-ASCII Unicode character. Things like U+202E (xkcd.com/1137), U+00A0 (no-break space), etc are valid UTF-8 characters, but not valid XML tag names. Neither of these two characters are matched by the TeX catcode check. This doesn't make any real difference for a syntax highlighter though.
+local name = (R("az","AZ","09") + S("_-.") + + alsoname)^1
There's a doubled plus in the middle there. The patch works when I remove it.
But I’m afraid I cannot make it work on my computer (Linux64).
On another Win64 computer, both patches worked perfectly fine.
Hmm, that's really weird. Both patches work for me on my main Win64 computer (after I fixed the extra plus). I also pulled the "contextgarden/context:lmtx" Docker image (Debian sid), and both patches worked there too. I get this from inside the container: root@e8d29a32595c:~# cat /etc/os-release PRETTY_NAME="Debian GNU/Linux bookworm/sid" NAME="Debian GNU/Linux" ID=debian HOME_URL="https://www.debian.org/" SUPPORT_URL="https://www.debian.org/support" BUG_REPORT_URL="https://bugs.debian.org/" root@e8d29a32595c:~# locale LANG= LANGUAGE= LC_CTYPE="POSIX" LC_NUMERIC="POSIX" LC_TIME="POSIX" LC_COLLATE="POSIX" LC_MONETARY="POSIX" LC_MESSAGES="POSIX" LC_PAPER="POSIX" LC_NAME="POSIX" LC_ADDRESS="POSIX" LC_TELEPHONE="POSIX" LC_MEASUREMENT="POSIX" LC_IDENTIFICATION="POSIX" LC_ALL= root@e8d29a32595c:~# xxd test.tex 00000000: 5c75 7365 6d6f 6475 6c65 5b73 6369 7465 \usemodule[scite 00000010: 5d0a 5c73 7461 7274 7465 7874 0a5c 7374 ].\starttext.\st 00000020: 6172 7454 4558 7061 6765 5b6f 6666 7365 artTEXpage[offse 00000030: 743d 3165 785d 0a5c 7479 7065 5b6f 7074 t=1ex].\type[opt 00000040: 696f 6e3d 786d 6c5d 7b3c 616e 732f 3e7d ion=xml]{<ans/>} 00000050: 0a5c 7479 7065 5b6f 7074 696f 6e3d 786d .\type[option=xm 00000060: 6c5d 7b3c c3a1 c3b1 c39f 2f3e 7d0a 5c73 l]{<....../>}.\s 00000070: 746f 7054 4558 7061 6765 0a5c 7374 6f70 topTEXpage.\stop 00000080: 7465 7874 0a text root@e8d29a32595c:~# context --version mtx-context | ConTeXt Process Management 1.04 mtx-context | mtx-context | main context file: [snip] mtx-context | current version: 2022.05.11 11:36 mtx-context | main context file: [snip] mtx-context | current version: 2022.05.11 11:36 ldd "$(type -p luametatex)" linux-vdso.so.1 (0x00007ffdbe9a5000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f4b034d4000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f4b034b3000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f4b0336f000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f4b03196000) /lib64/ld-linux-x86-64.so.2 (0x00007f4b03a55000) Is this perhaps a weird locale or encoding issue? Maybe try compiling with: LC_ALL=C.UTF-8 LANG=C.UTF-8 context test.tex or LC_ALL=POSIX LANG=POSIX context test.tex I'm surprised Linux is the one not working here, since it's usually Windows that has text encoding issues with its weird hybrid of DOS codepages and UTF-16+BOM. The only other thing that I can think of is a weird library issue with your distro, but LuaMetaTeX is statically linked. Not sure what else to check here. -- Max
On 6/3/22 00:52, Max Chernoff via ntg-context wrote:
For the sake of consistency (with buff-imp-xml.lua), I think the patch should read [...] +local alsoname = lpatterns.utf8two + lpatterns.utf8three + lpatterns.utf8four
I think that that pattern is a little too broad, since it will match any non-ASCII Unicode character. Things like U+202E (xkcd.com/1137), U+00A0 (no-break space), etc are valid UTF-8 characters, but not valid XML tag names. Neither of these two characters are matched by the TeX catcode check. This doesn't make any real difference for a syntax highlighter though.
Hi Max, many thanks for your reply. At best, the patch is only a suggestion and Hans will merge the code he sees it fits.
+local name = (R("az","AZ","09") + S("_-.") + + alsoname)^1
There's a doubled plus in the middle there. The patch works when I remove it.
I noticed it too just after sending the message to the list, but I had to solve the issue with my installation first.
But I’m afraid I cannot make it work on my computer (Linux64).
On another Win64 computer, both patches worked perfectly fine.
Hmm, that's really weird. Both patches work for me on my main Win64 computer (after I fixed the extra plus).
It was a stupid mistake on my side. The patch I sent before points to the error: --- scite-context-lexer-xml.lua 2022-06-01 17:24:38.625976000 +0200 +++ context/tex/texmf-context/context/data/scite/context/lexers/scite-context-lexer-xml.lua 2022-06-02 16:37:30.112824947 +0200 I was compiling the sample file in the directory where the unmodified version of "scite-context-lexer-xml.lua" was running. ConTeXt was reading the unmodified file and not the modified one, but that was all my fault. Now I have to find a MWE for issues I’m experiencing with XML sources and using the scite module. Many thanks for your help, Pablo
participants (3)
-
Henning Hraban Ramm
-
Max Chernoff
-
Pablo Rodriguez