about token processing

21 Jan 2007

      Hi all,

One of the things I have planned for the near future is an optional
hook that transforms the tokens that are seen by the main control
routine, and it makes sense to present that now, before I actually
make changes to the program.

A bit of internal documentation has to come first.

   Whenever TeX is ready to execute the next command, it runs the
   internal procedure get_x_token(). This procedure takes care of
   fetching the next token from the current input source (like a
   file or tokenlist). get_x_token() also handles all expansion,
   so no macros or \if statements can come through.

   A "token" basically consists of two integer values: a command code
   and a modifier (this is actually called the character code, because
   it most often represents a character).

   For characters, the command code is its category code, and the
   modifier the character number itself. For example, the letter
   "H" in a file becomes the token

     {cmd=11, chr=72}

   Control sequences are likewise converted into two parts. In this
   case, the character code is used to distinguish similar primitives
   from each other. For example, \parindent is:

     {cmd=79, chr=0 }

   and the other dimensions params only vary the chr, like \hoffset:

     {cmd=79, chr=18}

   Of course the TeX source code uses aliases for the raw numbers,
   the actual source code is something closer to

     {cmd=assign_dimen, chr=par_indent_code}

   (For those 'in the know': I am aware I am being a bit too informal
    and oversimplifying, but it is hard enough to explain already.)

   TeX next looks at the command code in the returned token, and jumps
   to a case statement with several hunderd cases. There are cases for
   each of the command codes, for some even different ones in each of
   the three processing modes: horizontal, vertical or math.
   (\parindent in horizontal mode behaves differently from \parindent
   in vertical mode)

   The program code inside each of these case statements take care of
   their own argument reading when needed, so that each command is
   processed by the main control function as a whole.

   Also, a very special exception is present: if the command code
   indicates a character is to be typeset while the processing mode
   is already horizontal, the program will jump to a special 'main_loop'
   case, where it will keep treating tokens as if they wre arguments
   of a fictional 'main_loop' command until the next command is no
   longer a to-be-typeset character. Only then will it jump back to
   the beginning of the large case statement.

   For example, the input

   \parindent 10pt
   Hello world
   \par

   Executes the following case statements:

   {vertical mode: \parindent}      % the " 10pt " is read elsewhere
   {the letter H}                   % still in vertical mode
   {horizontal mode: the letter H}  % the 'main_loop' reads "ello"
   {blank space  }
   {the letter w}                   % the 'main_loop' reads "orld"
   {blank space  }
   {\par}

So much for the current state of affairs.

My goal for the new luaTeX extension is twofold.

One: eliminate the main_loop tricks. There is the speed-optimization
trick that makes a character treat all immediately following
characters as arguments, as well as the programming-logic trick
that makes it read a character twice just to switch from vertical
mode to horizontal mode. After these are folded back in, the
case statement list would look like this:

   {vertical mode: \parindent}
   {the letter H}
   {the letter e}
   {the letter l}
   {the letter l}
   {the letter o}
   {blank space  }
   {the letter w}
   {the letter o}
   {the letter r}
   {the letter l}
   {the letter d}
   {blank space  }
   {\par}

This has to be done with great care, because the main_loop also takes
care of any otp processing and ligature building.

Two: allow lua code to mutate the output of get_x_token(), before
the case decision is made.

Initially, the function will be called with the token from
get_x_token() as argument, represented as a small lua table.
The function should either return a lua table representing a
to-be-processed token, or nothing at all (nil).

If it returns nothing, it will be immediately called again, with
yet another token from get_x_token() as argument, until it
eventually does return a token.

If the function does return a new token, that token will be
processed in the case statement, and afterwards, the function will
be called again, but now without an argument. This is repeated
until it stops returning tokens.  Then processing reverts back
to the other branch.

The point behind that roundabout calling convention is that it
allows the lua function to delete, insert or buffer tokens. That
in turn should make it possible to replace OTPs.

Best, Taco

Taco Hoekwater

tags

participants (1)