Hi all, One of the things I have planned for the near future is an optional hook that transforms the tokens that are seen by the main control routine, and it makes sense to present that now, before I actually make changes to the program. A bit of internal documentation has to come first. Whenever TeX is ready to execute the next command, it runs the internal procedure get_x_token(). This procedure takes care of fetching the next token from the current input source (like a file or tokenlist). get_x_token() also handles all expansion, so no macros or \if statements can come through. A "token" basically consists of two integer values: a command code and a modifier (this is actually called the character code, because it most often represents a character). For characters, the command code is its category code, and the modifier the character number itself. For example, the letter "H" in a file becomes the token {cmd=11, chr=72} Control sequences are likewise converted into two parts. In this case, the character code is used to distinguish similar primitives from each other. For example, \parindent is: {cmd=79, chr=0 } and the other dimensions params only vary the chr, like \hoffset: {cmd=79, chr=18} Of course the TeX source code uses aliases for the raw numbers, the actual source code is something closer to {cmd=assign_dimen, chr=par_indent_code} (For those 'in the know': I am aware I am being a bit too informal and oversimplifying, but it is hard enough to explain already.) TeX next looks at the command code in the returned token, and jumps to a case statement with several hunderd cases. There are cases for each of the command codes, for some even different ones in each of the three processing modes: horizontal, vertical or math. (\parindent in horizontal mode behaves differently from \parindent in vertical mode) The program code inside each of these case statements take care of their own argument reading when needed, so that each command is processed by the main control function as a whole. Also, a very special exception is present: if the command code indicates a character is to be typeset while the processing mode is already horizontal, the program will jump to a special 'main_loop' case, where it will keep treating tokens as if they wre arguments of a fictional 'main_loop' command until the next command is no longer a to-be-typeset character. Only then will it jump back to the beginning of the large case statement. For example, the input \parindent 10pt Hello world \par Executes the following case statements: {vertical mode: \parindent} % the " 10pt " is read elsewhere {the letter H} % still in vertical mode {horizontal mode: the letter H} % the 'main_loop' reads "ello" {blank space } {the letter w} % the 'main_loop' reads "orld" {blank space } {\par} So much for the current state of affairs. My goal for the new luaTeX extension is twofold. One: eliminate the main_loop tricks. There is the speed-optimization trick that makes a character treat all immediately following characters as arguments, as well as the programming-logic trick that makes it read a character twice just to switch from vertical mode to horizontal mode. After these are folded back in, the case statement list would look like this: {vertical mode: \parindent} {the letter H} {the letter e} {the letter l} {the letter l} {the letter o} {blank space } {the letter w} {the letter o} {the letter r} {the letter l} {the letter d} {blank space } {\par} This has to be done with great care, because the main_loop also takes care of any otp processing and ligature building. Two: allow lua code to mutate the output of get_x_token(), before the case decision is made. Initially, the function will be called with the token from get_x_token() as argument, represented as a small lua table. The function should either return a lua table representing a to-be-processed token, or nothing at all (nil). If it returns nothing, it will be immediately called again, with yet another token from get_x_token() as argument, until it eventually does return a token. If the function does return a new token, that token will be processed in the case statement, and afterwards, the function will be called again, but now without an argument. This is repeated until it stops returning tokens. Then processing reverts back to the other branch. The point behind that roundabout calling convention is that it allows the lua function to delete, insert or buffer tokens. That in turn should make it possible to replace OTPs. Best, Taco
participants (1)
-
Taco Hoekwater