On Fri, 18 Nov 2005, Heiko Oberdiek wrote:
On Thu, Nov 17, 2005 at 06:46:58PM +0100, Hartmut Henkel wrote:
And when the object is gone, it's nasty to seek around in the PDF file.
The position and length of the objects could be stored in memory.
yes. But this would imply you would search in the written file, which is slow, i believe, but can be done. Against this might also be that it seems that the only reason why pdftex currently can't write the PDF to stdout is that it has to seek and write the /Length. Are there other places where it backs up? So if we could get rid of this only (?) seek (without writing a separate /Length object) e. g. by buffering the streams in the memory (which should be harmless, given the current memory sizes), we would gain the stdout writing capability. Any additional file seek brings us more away from there. Ok, one can live without stdout writing :-)
This would e.g. condense all the obj <> endobj in the pdfTeX manual. :-)
if it's enough to scan the last say 100 non-stream objects: this can be done, at least it would catch these next to each other similar objects.
The matches of "similar" objects can be increased by normalization: * Removal of unnecessary spaces. * Ordering of dictionary keys. * Normalization of strings and names.
Yes. As a tiny start, maybe we should remove all redundant spaces like in "/Type /Page" (a tip from this Fat PDF paper).
Disadvantage: parsing of pdf objects would be necessary.
maybe one can use xpdf for this...
Maybe MD5 would be overkill, just a hash + comparison would be ok.
Yes.
Not even a need to make it perfect (like hash with list). If it brings down the duplicates to 1% or 10% it would be ok... Regards, Hartmut