[Dev-luatex] .BEILMOPS or how I stopped worrying and love Open Source

Jonathan Sauer Jonathan.Sauer at silverstroke.com
Tue Dec 11 09:06:04 CET 2007


Okay, this is weird (and long), but it gets clearer near the end:


The following test using lpeg to split a comma-separated values works
perfectly (never mind that subpattern B does not do anything):

----------------------------------------------------------------------

% A:
% \catcode`\:=11
% \def\FM:ifFileIncluded#1{\message{whatever}}
% 
% \FM:ifFileIncluded{}

\directlua0{\unexpanded{
	whiteSpace = lpeg.S(" \t\n")
	splitComma = lpeg.P({
		lpeg.Ct(lpeg.V("elem") * (lpeg.V("sep") *
lpeg.V("elem"))^0),
		
		sep =	lpeg.S(",{}"),
		elem =	whiteSpace^0 *
				lpeg.C((1 - lpeg.V("sep"))^1) *
				whiteSpace^0, % B
	})
}}

\def\splitComma#1{%
	\directlua0{%
		local s = '\luaescapestring{\unexpanded{#1}}'
		local t = lpeg.match(splitComma,s)
		for k,v in ipairs(t) do
			texio.write_nl('[' .. v .. ']')
		end
	}%
}

\splitComma{A, B, C, D, E, F}

% `print' is not documented, but prints a compiled pattern's bytecode
% to the console
\directlua0{lpeg.print(splitComma)}

\end

-----------------------------------------------------------------------

The pattern created is:

[1 = elem  2 = sep  3 = elem  4 = sep  ]
00: call -> 2
01: jmp -> 51
02: opencapture table(n = 0) (0)
03: call -> 10
04: choice -> 8 (0)
05: call -> 41
06: call -> 10
07: partial_commit -> 5
08: closecapture close(n = 0) (0)
09: ret 
10: span [(09-0a)(20)]
19: opencapture simple(n = 0) (0)
20: choice -> 23 (0)
21: call -> 41
22: failtwice 
23: any * 1
24: choice -> 30 (0)
25: choice -> 28 (0)
26: call -> 41
27: failtwice 
28: any * 1
29: partial_commit -> 25
30: closecapture close(n = 0) (0)
31: span [(09-0a)(20)]
40: ret 
41: set [(2c)(7b)(7d)]
50: ret 
51: end 


However, if I run this with my own format, the pattern is:

[1 = elem  2 = sep  3 = elem  4 = sep  ]
00: call -> 2
01: jmp -> 51
02: opencapture table(n = 0) (0)
03: call -> 10
04: choice -> 8 (0)
05: call -> 41
06: call -> 10
07: partial_commit -> 5
08: closecapture close(n = 0) (0)
09: ret 
10: span [(0a)(20)(2e)(42)(45)(49)(4c-4d)(4f-50)(53)]
19: opencapture simple(n = 0) (0)
20: choice -> 23 (0)
21: call -> 41
22: failtwice 
23: any * 1
24: choice -> 30 (0)
25: choice -> 28 (0)
26: call -> 41
27: failtwice 
28: any * 1
29: partial_commit -> 25
30: closecapture close(n = 0) (0)
31: span [(0a)(20)(2e)(42)(45)(49)(4c-4d)(4f-50)(53)]
40: ret 
41: set [(2c)(7b)(7d)]
50: ret 
51: end 

So instead of checking for whitespace in instructions 10 and 31, input
is checked against the character set "\n .BEILMOPS" (and splitting the
list fails almost completely). I know what a "Beil" is (an axe), and I
know what a "Mops" is (some kind of weird animal, a bit like a
groundhog [think Bill Murray]), but what is a "Beilmops"? And a "dot
Beilmops"? Is is written in C# or what?

Now you'll say "yeah, sure, who cares what weird stuff that weird
Jonathan does in that weird format of his", but: This only happens if
a macro "\FM:ifFileIncluded" is defined or referenced. If this macro
is called "\FM:ifFileLoadeded" (the same length) or
"\FM:ifFileIlcluded" ("l" instead of "n"), the pattern is compiled
correctly. "\FM:ifFileLncluded" works, too. But the moment a control
sequence called "\FM:ifFileIncluded" is used (defined or referenced),
the lpeg pattern contains that strange animal.

I tried to use Lua state 1 instead of 0 to make sure there were no
definitions that could create a side-effect, but the pattern remained
the same.

I tried to uncomment (A) in above PlainTeX code, but the pattern is
still correct. So I first suspected some kind of overflow in TeX's
hash table that only occurs when there already exist a lot of control
sequences and one of them has a very specific name and thus hash
value (this does not seem to be the case, though).

I tried moving the definition of "\FM:ifFileIncluded" to the beginning
of my format (right after setting the catcodes), but without success.
I tried defining it in the PlainTeX format, but again to no avail. I
tried removing all unnecessary files from my format, with the same
result.

More weirdness that I more or less accidentally stumbled upon:

	\directlua1{\unexpanded{lpeg.print(lpeg.S(" \t\n"))}}

results in

	set [(0a)(20)(2e)(42)(45)(49)(4c-4d)(4f-50)(53)]

which is wrong, but

	\directlua1{\detokenize{lpeg.print(lpeg.S(" \t\n"))}}

results in

	set [(09-0a)(20)]
	
which is correct. And the equivalent, but slightly longer

	\directlua1{lpeg.print(lpeg.S(" \string\t\string\n"))}

again results in the correct

	set [(09-0a)(20)]

And indeed: If I replace "\unexpanded" in the code example above by
"\detokenize" (and remove the empty lines which for some reason result
in "\par" when "\detokenize" is used, but not with "\unexpanded"), the
pattern is compiled correctly and works as expected.

The plot thickens. Let's look further; how about simply telling Lua to
print the string " \t\n"?

	\directlua1{\detokenize{texio.write_nl("[ \t\n]")}}

results in

	[        
	 ]

but

	\directlua1{\unexpanded{texio.write_nl("[ \t\n]")}}

results in

	[ IMPOSSIBLE.
	 ]

Not so weird anymore: "IMPOSSIBLE." is printed by procedure "print_cs"
in luatex.web if the control sequence's pointer is below
"active_base", that is zero or negative, or >= the pointer to the
undefined control sequence (at least as far as I understand it).

And "IMPOSSIBLE." sorted and stripped of duplicates is ...
".BEILMOPS"!

Also note that "\t" is defined in PlainTeX, but not in my format. If I
define it at the beginning of the code example above, nothing changes.
But if I define it before defining "\FM:ifFileIncluded", everything
works as expected, and

	\directlua1{\unexpanded{texio.write_nl("[ \t\n]")}}

results in

	[        
	 ]

Not "IMPOSSIBLE." anymore.

Now: If the control passed to "print_cs" (or "tokenlist_to_cstring" in
luatoken.c) is undefined, "IMPOSSIBLE." is printed. As "\t" is indeed
undefined, this is completely expected. What's not expected, is that
this only happens if the macro "\FM:ifFileIncluded" is not defined
before "\t" is defined (if at all).

And weird again: "\n" is defined by neither PlainTeX nor my format,
but does not result in "IMPOSSIBLE.".

Side note:

	\immediate\write16{\detokenize{[ \t\n]}}
	\immediate\write16{\unexpanded{[ \t\n]}}

both correctly display "[ \t \n ]". So it seems that "\unexpanded" 
works as expected, but something else does not.

And finally: If I say "\let\t\t" at the beginning of my format,
everything works as well. So "\t" may well be undefined, as long as it
is entered into TeX's hash table before "\FM:ifFileIncluded" is.


Jonathan




More information about the dev-luatex mailing list