# [Dev-luatex] plugin for external formatting

Hans Hagen pragma at wxs.nl
Sun Sep 25 20:28:10 CEST 2005

Karel Skoup wrote:

>Hi all,
>
>On Wed, 21. Sep 2005, 10.52.08 10:52:08, Hans Hagen wrote:
>
>
>>>OK, but that won't bring much, just some funny shapes.
>>>
>>>
>>>
>>>
>>sure, but on the other hand, it can be used to 'replace' the current par
>>builder by a more advanced (e.g. hyphenation) one, imagine that we have:
>>
>>\paroutput
>> {write list to file (or pipe)
>>  call plugin in one-paragraph mode
>>  read list from file (or pipe)}
>>
>>that way we can replace the current par builder, because by default it's
>>something equivalent to:
>>
>>\paroutput{\scanlist\expandafter{\the\list255}}
>>
>>
>
>Sorry, I don't understand this very well. \list# is what we want to
>have, right? And \scanlist is a primitive of eTeX? I'm not used to it,
>but I believe that the way how \scanlist works is the best fitting the
>TeX macro programming. Where can I see any examples to understand it
>better?
>
>
no, both 'list' things are what taco described as 'to do' (there is a
scantokens in etex but it's kind of broken)

in the example above (cf tacos' previous mail):

\the\list<number>      : serializes a list
\scanlist{general text}  : ' compiles' serialized list

the main idea i wanted to introduce was the concept of a paragraph
output routine

concerning \scantokens (etex) .. this is a different animal,

\def\pqr{pqr} \edef\abc{def \string\pqr}

now \abc is a sequence of just letters

\scantokens\expandafter{\abc}

this gives "def pqr" because the string'd \pqr is tokenized again

>Then, \paroutput should be analogous to \output, right?  Then
>\scanlist\expandafter{\the\list255} would just put the paragraph on the
>
>
indeed

>current list, right? But where is the paragraph broken by the default
>algorithm? If it is before activating \paroutput, then it is too late to
>rebreak (some information is lost and it would be wasting of processor
>anyway) so it can be only afterwards, but how would it recognize whether
>the list needs braking by the default breaker or not?
>
>
this is still open ... maybe we should keep a copy of the original
input, i don't know how complex that is, but the list *before* it gets
broken, the raw data that enters the par builder, so maybe we should have

\parmode=0   : current behaviour
\parmode=1   : current behaviour but stop at par building, save list in
list255
\parmode=2  : current behaviour but do par building, save list in list255

so, with parmode=1, \the\list255 would provide you the raw list, unbroken

>Anyway, \output and ending a paragraph are not analogous because \output
>is asynchronous and \par is synchronous (\paroutput would be activated
>by \par?). Or do you think that \paroutput should be just for one line
>when considering a break (that would be terribly complicated).
>
>
no, just a way to grab a paragraph and feed it to an external process
(or to handle it in tex, whatever that means since the only thing in tex
that we then can do it \the\list -)

>Sorry, I'm confused, but maybe I just take this line too seriously :-)
>
>
right but it's no problem since we need to explore the idea

>>i wonder how hard this is to implement, you and taco should know -)
>>
>>
>
>Of course that handling just one paragraph externally must by much
>easier than handling several ones, especially because nothing
>substantial in TeX model must be changed (one atomic operation would be
>replaced by another one on the same data).
>
>But for my research it has almost no value, I really need to work with
>whole chunks and layout chains if I want to achieve anything
>interesting.
>
>
i know, but if we have the 'simple one paragraph' one, we can already do
a lot of experiments; the next step would be to define a higher level
things (not a paragraph but a sequence of areas to fill etc etc)

>
>
>>>Not only page crossing, but also column/shape/container crossing ...
>>>The problem is that we are used to \parshape, which just specifies
>>>something for certain lines in the current paragraph. But if we want to
>>>introduce real page layouts, then the shapes are not relative to the
>>>paragraphs any more. It will be a matter of formatting where a
>>>particular paragraph starts in the layout.
>>>
>>>
>>>
>>>
>>>
>>it's a combination:
>>
>>- a main gutter shape (can be colums or whatever)
>>- shapes bound to places on the gutter
>>- shapes bound to specific places in the stream
>>- shapes that may float (within boundary condition)
>>
>>
>
>That's right.
>
>
>
>>>But concerning the metric files, if I want to treat hyphenation locally,
>>>then I also need the kerning and ligature programs. In TeX it is done
>>>too early (and then it is taken apart and (wrongly) reconstructed during
>>>hyphenation pass). I want to do ligatures and kernings on demand,
>>>basically after hyphenation (it's not that simple, but anyway).
>>>
>>>
>>>
>>>
>>how about a font daemon, that one could cache/access font files; we need
>>to go open type anyway so maybe such a deamon can be built on top of
>>existing (non tex) libraries (port 31415)
>>
>>
>
>A real daemon or a C library linked to the application (like kpathsea)?
>Well, there should be a library which provides simple interface to a
>client in any case, that library can (optionally) communicate with a
>daemon, that is how the client-daemon usually work.
>
>
>
>>-)
>>
>>that's indeed too hard-coded for our purpose, so, next to a font daemon,
>>we need a hyphenation daemon
>>
>>
>
>If it accesses the files or communicates with a daemon is an internal
>matter of the module implementing the interface.
>
>
>
ok

>>>Maybe we should make a whole new glossary, for example 'node' is quite
>>>OK for everything in the list (char, box, glue, penalty, ...), but 'list'
>>>is so ambiguous, there should be something more specific (maybe 'node
>>>list'). TeX itself doesn't give clear names (classes) for those objects.
>>>I had to make them names in NTS (to name the classes), maybe we can look
>>>into it.
>>>
>>>
>>>
>>good idea; we indeed need to define proper names and descriptions; can
>>you make a proposal for that based on your nts experiences?
>>
>>
>
>Well the whole problem is, that there are no properly defined
>data structures with unique type names in TeX (just some all-purpose
>data structures and macros which one must extremely careful with). Then
>there is no explicit need for unique names of different data types.
>So in NTS, I had to make names for types which are no explicit types in
>TeX.
>
>The two most outstanding examples are:
>* NodeList (nothing new, just the 'node' is always explicit)
>* Builder (currently built vertical/horizontal/... list)
>
>
>
it's probably the builder that needs to get an alternative; something
parlists and/or shapelists and since it then spawns the task to the
plugin .. well, the plugin will have its own data structures, so from
that perspective we can keep tex's node list (input for plugin) as well
as  vertical and horizontal lists (output of plugin) and forget about
the rest

>>>It works for English (does it really always ?), because it is simple,
>>>right? I don't know, whether it is a real problem in any other language
>>>in practice. I just know the code and I think that it is incorrect,
>>>inconsistent and illogical.
>>>
>>>
>>>
>>>
>>my impression is that tehnumber of missed/wrong cases for english is so
>>small that it falls within the 'no problem to correct it manually'
>>criteria; languages with compound words, accented characters etc hav
>>ehigher demands
>>
>>
>
>Its maybe not a real problem in practice but it was the biggest pain to
>reimplement in NTS. It could be actually easier to do it in the 'right
>way'. So if I have to do it next time I don't want to repeat the same
>annoyance.
>
>On Wed, 21. Sep 2005, 10.59.33 10:59:33, Taco Hoekwater wrote:
>
>
>>>For me the fully restorable read syntax is very important (can I get all
>>>
>>>
>> [...]
>>
>>I believe all extra parameters had better be in-line, for optimal
>>flexibility. As much as possible, as least:  some information is
>>irretrievably lost in current TeX.
>>
>>Quite a lot can be solved by adding a new read syntax for character
>>and language nodes, one that does not depend on font and language
>>id numbers. It'll be rather verbose and a tad slow, that is the
>>price you pay for extra flexibility.
>>
>>
>
>Well, inline is convenient to import but might be an overkill if there
>is a lot of info (like font file name and 'at size') for each character.
>That can be saved by using references, i.e. outputting a table first and
>then referencing the ids from the table. For scanning it's not much more
>work but for writing it requires a pass which collects the table from
>the references in the data first.
>
>
indeed. some reference is needed; it also makes scanning the result more
efficient since the refs are already known then

>
>
>>>But concerning the metric files, if I want to treat hyphenation locally,
>>>then I also need the kerning and ligature programs. In TeX it is done
>>>too early (and then it is taken apart and (wrongly) reconstructed during
>>>hyphenation pass). I want to do ligatures and kernings on demand,
>>>basically after hyphenation (it's not that simple, but anyway).
>>>
>>>
>>In current TeX, it is not done too early: ligkerns can influence which
>>line breaks are chosen, so the ligkern programs have to be applied
>>first thing.
>>
>>
>
>Yes, I know, I wrote that it's not that simple. IMO the ligkerns
>should be considered many times but the final modification of the data
>should be late. I would postpone it until output and ask dynamically
>(with maybe some caching) each time it is needed (getting sizes, ...).
>This approach would need to represent ligature prevention ({}) explicitly
>as a node.
>
>
>
a new kind of node indeed

>>>NO. It screws up everything, not only taken or potential breaks, but
>>>even the potential hyphenation points which are never considered a
>>>break.
>>>
>>>
>>It does all potential hyphenation points, but that is still a subset
>>of all hyphenation points: absolutely impossible points are ignored
>>(like in the middle of the first line). At least, that's what Knuth's
>>web comments say, and note rhat is not a feature of the algorithm,
>>only an optimization.
>>
>>
>
>I forgot about the first line, but is there anything else?
>
>
>
>>Perhaps just a little, but you have a valid case ;-)
>>
>>
>
>Well, we can stop it here and make a unique thread if it is ever needed.
>
>
>
>>>right? I don't know, whether it is a real problem in any other language
>>>in practice. I just know the code and I think that it is incorrect,
>>>inconsistent and illogical.
>>>
>>>
>>It is also near-impossible to fix while maintaining compatibility,
>>which is probably why no-one has seriously attempted to clean up
>>the code, up-til-now.
>>
>>
>
>Sure, I originally wanted to do it in the 'right way' in NTS, but then I
>realized that was impossible while keeping compatibility and it was a
>real pain to reimplement it.
>
>On Wed, 21. Sep 2005, 11.26.53 11:26:53, Taco Hoekwater wrote:
>
>
>>Hans Hagen wrote:
>>
>>
>>>so what happens if you remove the optimizations (forget about 100%
>>>compatibility)
>>>
>>>
>>Probably (hopefully) nothing except some bloat in the data structure,
>>but I won't take bets on that.
>>
>>
>
>But the only optimization is not changing the ligkerns in the first line
>of the paragraph while hyphenating, right? Then removing that would be
>even worse, but the difference is so small, anyway; this optimization is
>really not a problem.
>
>
>
>>>>It is also near-impossible to fix while maintaining compatibility,
>>>>which is probably why no-one has seriously attempted to clean up
>>>>the code, up-til-now.
>>>>
>>>>
>>>but we don't care much about that part of compatibility, do we?
>>>
>>>
>>Nah. (but it was a big issue for etex, nts, and pdftex-in-dvi mode)
>>
>>
>
>Exactly, it was a nightmare for me.
>
>On Wed, 21. Sep 2005, 06.25.50 06:25:50, Thanh Han The wrote:
>
>
>>My first thought is that some small modifications to
>>\showlist and \showbox will help a lot. It's easy to write
>>additional info like dimensions of each item in the list, or
>>in case of characters the filename of a tfm with fontsize
>>(or we may write the dimensions of each char as Hans
>>suggested, but this is an overkill IMHO).
>>
>>
>
>If I get the dimensions of characters explicitly, then I don't need to
>access/know the metric files. But this changes if I want to handle
>the hyphenation locally (which seems like the only way). Then I need
>also the ligkerns so I would either also need them explicitly (I mean
>the ligkern programs) -- that would be quite complicated to export (and
>import) -- or I would need to access the metric files anyway.
>
>
is it really a program or just a list of char combinations representing
ligs

>Therefore the explicit char dimensions seems like a temporary solution
>only and I don't think it's worth doing.
>
>
>
>>My feeling is that we need to work out the specification and
>>format of the  node list'' first. In the first step, I
>>would prefer to have only node-specific things, eg only what
>>comes out after a box construction. I also got a similar
>>request: to provide a primitive that writes out the content
>>of a box and another primitive to re-construct that box back
>>extensions later on.
>>
>>
>
>Well, I think that the \showlists output contains everything except
>the reliable font id (and the language id?) and it is parseable. Well,
>the syntax could be slightly changed to make it more compatible with the
>input syntax (or maybe it can be really written in the input syntax) or
>to be better parseable by a plugin but the information carried by the
>syntax is the real matter.
>
>If I have all referenced fonts explicitly defined at the beginning (with
>maybe some renaming of the font ids when conflicts arise (can it happen?))
>then I'm happy.
>
>So with the current syntax it would be something like:
>
>\tenrm=select font cmr10.
>\twelveit=select font cmti10 at 12.0pt.
>\hbox(6.94444+1.94444)x435.9297, glue set 318.73502fil
>.\hbox(0.0+0.0)x0.0
>.\tenrm F
>.\kern-0.83334
>.\tenrm r
>.\tenrm e
>.\tenrm e
>.\tenrm -
>.\discretionary
>.\tenrm s
>.\tenrm h
>.\tenrm a
>.\tenrm p
>.\kern0.27779
>.\tenrm e
>.\glue 3.33333 plus 1.66666 minus 1.11111
>.\twelveit t
>.\twelveit e
>.\twelveit x
>.\twelveit t
>
>or in the input syntax:
>
>\font\tenrm=cmr10 at 10.0pt
>\font\twelveit=cmti10 at 12.0pt
>\hbox to 435.9297pt {%
> \hbox{}%
> \tenrm
> F\kern -0.83334pt r{}e{}e%
>[...]
> .\twelveit
> t{}e{}x{}t%
>
>It seems that the input syntax would have to prevent the normal ligkern
>building, that would be quite awkward.
>
>
>
you mean

.\twelveit t
.\nolig
.\twelveit e
.\nolig
.\twelveit x
.\nolig
.\twelveit t

>So maybe some customary syntax in between.
>
>
>
>>At the moment I cannot see clearly what is needed, but I am
>>willing to write some extensions so that we can experiment with
>>to see what is really needed and perhaps change what have been done.
>>
>>
>
>I can even play myself and then send a patch (as was suggested). I only
>need to install the right sources, I'll be grateful for pointing me to
>them and telling me any building tricks if needed.
>
>
Hans

-----------------------------------------------------------------