# [Dev-luatex] plugin for external formatting

Karel Skoupý skoupy at inf.ethz.ch
Sun Sep 25 06:22:48 CEST 2005

Hi all,

On Wed, 21. Sep 2005, 10.52.08 10:52:08, Hans Hagen wrote:
> >OK, but that won't bring much, just some funny shapes.
> >
> >
> sure, but on the other hand, it can be used to 'replace' the current par
> builder by a more advanced (e.g. hyphenation) one, imagine that we have:
>
> \paroutput
>  {write list to file (or pipe)
>   call plugin in one-paragraph mode
>   read list from file (or pipe)}
>
> that way we can replace the current par builder, because by default it's
> something equivalent to:
>
> \paroutput{\scanlist\expandafter{\the\list255}}

Sorry, I don't understand this very well. \list# is what we want to
have, right? And \scanlist is a primitive of eTeX? I'm not used to it,
but I believe that the way how \scanlist works is the best fitting the
TeX macro programming. Where can I see any examples to understand it
better?

Then, \paroutput should be analogous to \output, right?  Then
\scanlist\expandafter{\the\list255} would just put the paragraph on the
current list, right? But where is the paragraph broken by the default
algorithm? If it is before activating \paroutput, then it is too late to
rebreak (some information is lost and it would be wasting of processor
anyway) so it can be only afterwards, but how would it recognize whether
the list needs braking by the default breaker or not?

Anyway, \output and ending a paragraph are not analogous because \output
is asynchronous and \par is synchronous (\paroutput would be activated
by \par?). Or do you think that \paroutput should be just for one line
when considering a break (that would be terribly complicated).

Sorry, I'm confused, but maybe I just take this line too seriously :-)

>
> i wonder how hard this is to implement, you and taco should know -)

Of course that handling just one paragraph externally must by much
easier than handling several ones, especially because nothing
substantial in TeX model must be changed (one atomic operation would be
replaced by another one on the same data).

But for my research it has almost no value, I really need to work with
whole chunks and layout chains if I want to achieve anything
interesting.

> >Not only page crossing, but also column/shape/container crossing ...
> >The problem is that we are used to \parshape, which just specifies
> >something for certain lines in the current paragraph. But if we want to
> >introduce real page layouts, then the shapes are not relative to the
> >paragraphs any more. It will be a matter of formatting where a
> >particular paragraph starts in the layout.
> >
> >
> >
> it's a combination:
>
> - a main gutter shape (can be colums or whatever)
> - shapes bound to places on the gutter
> - shapes bound to specific places in the stream
> - shapes that may float (within boundary condition)

That's right.

> >But concerning the metric files, if I want to treat hyphenation locally,
> >then I also need the kerning and ligature programs. In TeX it is done
> >too early (and then it is taken apart and (wrongly) reconstructed during
> >hyphenation pass). I want to do ligatures and kernings on demand,
> >basically after hyphenation (it's not that simple, but anyway).
> >
> >
> how about a font daemon, that one could cache/access font files; we need
> to go open type anyway so maybe such a deamon can be built on top of
> existing (non tex) libraries (port 31415)

A real daemon or a C library linked to the application (like kpathsea)?
Well, there should be a library which provides simple interface to a
client in any case, that library can (optionally) communicate with a
daemon, that is how the client-daemon usually work.

> -)
>
> that's indeed too hard-coded for our purpose, so, next to a font daemon,
> we need a hyphenation daemon

If it accesses the files or communicates with a daemon is an internal
matter of the module implementing the interface.

> >Maybe we should make a whole new glossary, for example 'node' is quite
> >OK for everything in the list (char, box, glue, penalty, ...), but 'list'
> >is so ambiguous, there should be something more specific (maybe 'node
> >list'). TeX itself doesn't give clear names (classes) for those objects.
> >I had to make them names in NTS (to name the classes), maybe we can look
> >into it.
> >
> good idea; we indeed need to define proper names and descriptions; can
> you make a proposal for that based on your nts experiences?

Well the whole problem is, that there are no properly defined
data structures with unique type names in TeX (just some all-purpose
data structures and macros which one must extremely careful with). Then
there is no explicit need for unique names of different data types.
So in NTS, I had to make names for types which are no explicit types in
TeX.

The two most outstanding examples are:
* NodeList (nothing new, just the 'node' is always explicit)
* Builder (currently built vertical/horizontal/... list)

> >It works for English (does it really always ?), because it is simple,
> >right? I don't know, whether it is a real problem in any other language
> >in practice. I just know the code and I think that it is incorrect,
> >inconsistent and illogical.
> >
> >
> my impression is that tehnumber of missed/wrong cases for english is so
> small that it falls within the 'no problem to correct it manually'
> criteria; languages with compound words, accented characters etc hav
> ehigher demands

Its maybe not a real problem in practice but it was the biggest pain to
reimplement in NTS. It could be actually easier to do it in the 'right
way'. So if I have to do it next time I don't want to repeat the same
annoyance.

On Wed, 21. Sep 2005, 10.59.33 10:59:33, Taco Hoekwater wrote:
> >For me the fully restorable read syntax is very important (can I get all
>  [...]
>
> I believe all extra parameters had better be in-line, for optimal
> flexibility. As much as possible, as least:  some information is
> irretrievably lost in current TeX.
>
> Quite a lot can be solved by adding a new read syntax for character
> and language nodes, one that does not depend on font and language
> id numbers. It'll be rather verbose and a tad slow, that is the
> price you pay for extra flexibility.

Well, inline is convenient to import but might be an overkill if there
is a lot of info (like font file name and 'at size') for each character.
That can be saved by using references, i.e. outputting a table first and
then referencing the ids from the table. For scanning it's not much more
work but for writing it requires a pass which collects the table from
the references in the data first.

> >But concerning the metric files, if I want to treat hyphenation locally,
> >then I also need the kerning and ligature programs. In TeX it is done
> >too early (and then it is taken apart and (wrongly) reconstructed during
> >hyphenation pass). I want to do ligatures and kernings on demand,
> >basically after hyphenation (it's not that simple, but anyway).
>
> In current TeX, it is not done too early: ligkerns can influence which
> line breaks are chosen, so the ligkern programs have to be applied
> first thing.

Yes, I know, I wrote that it's not that simple. IMO the ligkerns
should be considered many times but the final modification of the data
should be late. I would postpone it until output and ask dynamically
(with maybe some caching) each time it is needed (getting sizes, ...).
This approach would need to represent ligature prevention ({}) explicitly
as a node.

> >NO. It screws up everything, not only taken or potential breaks, but
> >even the potential hyphenation points which are never considered a
> >break.
>
> It does all potential hyphenation points, but that is still a subset
> of all hyphenation points: absolutely impossible points are ignored
> (like in the middle of the first line). At least, that's what Knuth's
> web comments say, and note rhat is not a feature of the algorithm,
> only an optimization.

I forgot about the first line, but is there anything else?

> Perhaps just a little, but you have a valid case ;-)

Well, we can stop it here and make a unique thread if it is ever needed.

> >right? I don't know, whether it is a real problem in any other language
> >in practice. I just know the code and I think that it is incorrect,
> >inconsistent and illogical.
>
> It is also near-impossible to fix while maintaining compatibility,
> which is probably why no-one has seriously attempted to clean up
> the code, up-til-now.

Sure, I originally wanted to do it in the 'right way' in NTS, but then I
realized that was impossible while keeping compatibility and it was a
real pain to reimplement it.

On Wed, 21. Sep 2005, 11.26.53 11:26:53, Taco Hoekwater wrote:
> Hans Hagen wrote:
> >so what happens if you remove the optimizations (forget about 100%
> >compatibility)
>
> Probably (hopefully) nothing except some bloat in the data structure,
> but I won't take bets on that.

But the only optimization is not changing the ligkerns in the first line
of the paragraph while hyphenating, right? Then removing that would be
even worse, but the difference is so small, anyway; this optimization is
really not a problem.

> >>It is also near-impossible to fix while maintaining compatibility,
> >>which is probably why no-one has seriously attempted to clean up
> >>the code, up-til-now.
> >
> >
> >but we don't care much about that part of compatibility, do we?
>
> Nah. (but it was a big issue for etex, nts, and pdftex-in-dvi mode)

Exactly, it was a nightmare for me.

On Wed, 21. Sep 2005, 06.25.50 06:25:50, Thanh Han The wrote:
> My first thought is that some small modifications to
> \showlist and \showbox will help a lot. It's easy to write
> additional info like dimensions of each item in the list, or
> in case of characters the filename of a tfm with fontsize
> (or we may write the dimensions of each char as Hans
> suggested, but this is an overkill IMHO).

If I get the dimensions of characters explicitly, then I don't need to
access/know the metric files. But this changes if I want to handle
the hyphenation locally (which seems like the only way). Then I need
also the ligkerns so I would either also need them explicitly (I mean
the ligkern programs) -- that would be quite complicated to export (and
import) -- or I would need to access the metric files anyway.

Therefore the explicit char dimensions seems like a temporary solution
only and I don't think it's worth doing.

> My feeling is that we need to work out the specification and
> format of the  node list'' first. In the first step, I
> would prefer to have only node-specific things, eg only what
> comes out after a box construction. I also got a similar
> request: to provide a primitive that writes out the content
> of a box and another primitive to re-construct that box back
> from the output. We can start with this and make further
> extensions later on.

Well, I think that the \showlists output contains everything except
the reliable font id (and the language id?) and it is parseable. Well,
the syntax could be slightly changed to make it more compatible with the
input syntax (or maybe it can be really written in the input syntax) or
to be better parseable by a plugin but the information carried by the
syntax is the real matter.

If I have all referenced fonts explicitly defined at the beginning (with
maybe some renaming of the font ids when conflicts arise (can it happen?))
then I'm happy.

So with the current syntax it would be something like:

\tenrm=select font cmr10.
\twelveit=select font cmti10 at 12.0pt.
\hbox(6.94444+1.94444)x435.9297, glue set 318.73502fil
.\hbox(0.0+0.0)x0.0
.\tenrm F
.\kern-0.83334
.\tenrm r
.\tenrm e
.\tenrm e
.\tenrm -
.\discretionary
.\tenrm s
.\tenrm h
.\tenrm a
.\tenrm p
.\kern0.27779
.\tenrm e
.\glue 3.33333 plus 1.66666 minus 1.11111
.\twelveit t
.\twelveit e
.\twelveit x
.\twelveit t

or in the input syntax:

\font\tenrm=cmr10 at 10.0pt
\font\twelveit=cmti10 at 12.0pt
\hbox to 435.9297pt {%
\hbox{}%
\tenrm
F\kern -0.83334pt r{}e{}e%
[...]
.\twelveit
t{}e{}x{}t%

It seems that the input syntax would have to prevent the normal ligkern
building, that would be quite awkward.

So maybe some customary syntax in between.

> At the moment I cannot see clearly what is needed, but I am
> willing to write some extensions so that we can experiment with
> to see what is really needed and perhaps change what have been done.

I can even play myself and then send a patch (as was suggested). I only
need to install the right sources, I'll be grateful for pointing me to
them and telling me any building tricks if needed.

Best regards to all,

--ksk