[NTG-pdftex] Re: Optimizing the generated pdf

Hartmut Henkel hartmut_henkel at gmx.de
Sat Nov 19 00:50:03 CET 2005


On Fri, 18 Nov 2005, Heiko Oberdiek wrote:

> On Thu, Nov 17, 2005 at 06:46:58PM +0100, Hartmut Henkel wrote:
>
> > And when the object is gone, it's nasty to seek around in the PDF
> > file.
>
> The position and length of the objects could be stored in memory.

yes. But this would imply you would search in the written file, which is
slow, i believe, but can be done.

Against this might also be that it seems that the only reason why pdftex
currently can't write the PDF to stdout is that it has to seek and write
the /Length. Are there other places where it backs up? So if we could
get rid of this only (?) seek (without writing a separate /Length
object) e. g. by buffering the streams in the memory (which should be
harmless, given the current memory sizes), we would gain the stdout
writing capability. Any additional file seek brings us more away from
there. Ok, one can live without stdout writing :-)

> > > > This would e.g. condense all the obj <</S /GoTo /D [n 0 R
> > > > /Fit]>> endobj in the pdfTeX manual. :-)
> >
> > if it's enough to scan the last say 100 non-stream objects: this can
> > be done, at least it would catch these next to each other similar
> > objects.
>
> The matches of "similar" objects can be increased by normalization:
> * Removal of unnecessary spaces.
> * Ordering of dictionary keys.
> * Normalization of strings and names.

Yes. As a tiny start, maybe we should remove all redundant spaces like
in "/Type /Page" (a tip from this Fat PDF paper).

> Disadvantage: parsing of pdf objects would be necessary.

maybe one can use xpdf for this...

> > Maybe MD5 would be overkill, just a hash + comparison would be ok.
>
> Yes.

Not even a need to make it perfect (like hash with list). If it brings
down the duplicates to 1% or 10% it would be ok...

Regards, Hartmut


More information about the ntg-pdftex mailing list