Problem with Lua processing UTF8 substrings
Hello ConTeXist, I want to use Lua to write characters (substrings) from a string, but I get an error message: ! String contains an invalid utf-8 sequence. I tried various Lua functions for working with UTF8 strings for example: string.subutf8(string, start[,end]) for i, char in str:nextutf8(orig_pos) string.lenutf8(string), but without success. Can you please someone help? Thanks Jaroslav Hajtmar Here is my minimal example: \def\mymacro#1{\ctxlua{for i=1, string.len('#1') do context(string.sub('#1',i,i)..", ") end}} \starttext %\mymacro{šěřěžřýčřčžáýčý} % Here is a problem \mymacro{asdfghjklqwertt} % Here is all OK \stoptext
On 2012-02-01 20:26, Jaroslav Hajtmar wrote:
I want to use Lua to write characters (substrings) from a string, but I get an error message:
! String contains an invalid utf-8 sequence.
Can you please someone help?
Have you tried the unicode library? The standard string library operates on bytes, therefore extracting a single byte yields an incomplete multibyte char if the codepoint is beyond ascii. ································································· \def\mymacro#1{% \startluacode local utf = unicode.utf8 local target = [==[\detokenize{#1}]==] for i=1, utf.len(target) do context(utf.sub(target,i,i)..", ") end \stopluacode% } %% alternatively, use utfcharacters \define[1]\myothermacro{% \startluacode local result = { } for i in string.utfcharacters[==[\detokenize{#1}]==] do result[\letterhash result+1] = i end context(table.concat(result, ", ")) \stopluacode } \starttext \mymacro{šěřěžřýčřčžáýčý}\par \myothermacro{šěřěžřýčřčžáýčý} \stoptext ································································· (Lazy people would just do a “local string = unicode.utf8” at the top of the file.) Regards Philipp
Thanks Jaroslav Hajtmar
Here is my minimal example:
\def\mymacro#1{\ctxlua{for i=1, string.len('#1') do context(string.sub('#1',i,i)..", ") end}}
\starttext
%\mymacro{šěřěžřýčřčžáýčý} % Here is a problem \mymacro{asdfghjklqwertt} % Here is all OK
\stoptext
___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki!
maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________
-- () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments
Hello Philipp. Thanx very much for very quick and perfect help. Is there any manual or source, where I can read these (and next and similar) information? One more thanx Jaroslav Hajtmar Dne 1.2.2012 21:05, Philipp Gesang napsal(a):
\def\mymacro#1{% \startluacode local utf = unicode.utf8 local target = [==[\detokenize{#1}]==] for i=1, utf.len(target) do context(utf.sub(target,i,i)..", ") end \stopluacode% }
On 2012-02-01 21:17, Jaroslav Hajtmar wrote:
Hello Philipp. Thanx very much for very quick and perfect help. Is there any manual or source, where I can read these (and next and similar) information?
I’m sorry I have to disappoint you but the utf library is documented only in the source.[1] Luckily it covers all the functionality of the native string library, thus its usage should be equivalent except that it works for utf sequences as well. If you know some German there’s also a blog post by Patrick.[2] The string.utfcharacters iterator is covered in luatexref-t.pdf. Hope this helps Philipp [1] http://files.luaforge.net/releases/sln/slnunicode/1.1a [2] http://www.luatex.de/2010/02/selene-unicode-bibliothek/?iframe=true
One more thanx Jaroslav Hajtmar
Dne 1.2.2012 21:05, Philipp Gesang napsal(a):
\def\mymacro#1{% \startluacode local utf = unicode.utf8 local target = [==[\detokenize{#1}]==] for i=1, utf.len(target) do context(utf.sub(target,i,i)..", ") end \stopluacode% }
___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki!
maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________
-- () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments
Thanx Philipp! It just - I do not mind studying the code ... Many thanx. Jaroslav Dne 1.2.2012 21:35, Philipp Gesang napsal(a):
On 2012-02-01 21:17, Jaroslav Hajtmar wrote:
Hello Philipp. Thanx very much for very quick and perfect help. Is there any manual or source, where I can read these (and next and similar) information?
I’m sorry I have to disappoint you but the utf library is documented only in the source.[1] Luckily it covers all the functionality of the native string library, thus its usage should be equivalent except that it works for utf sequences as well. If you know some German there’s also a blog post by Patrick.[2]
The string.utfcharacters iterator is covered in luatexref-t.pdf.
Hope this helps Philipp
[1] http://files.luaforge.net/releases/sln/slnunicode/1.1a [2] http://www.luatex.de/2010/02/selene-unicode-bibliothek/?iframe=true
One more thanx Jaroslav Hajtmar
Dne 1.2.2012 21:05, Philipp Gesang napsal(a):
\def\mymacro#1{% \startluacode local utf = unicode.utf8 local target = [==[\detokenize{#1}]==] for i=1, utf.len(target) do context(utf.sub(target,i,i)..", ") end \stopluacode% }
___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki!
maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________
participants (2)
-
Jaroslav Hajtmar
-
Philipp Gesang