Re: [dev-context] Re: \showskips bug
Hans Hagen wrote:
That's fixed, a while back. my pdfetex now quitted at
[94729] [94730] [94731] [94732] [94733] [94734] [94735] [94736 ! TeX capacity exceeded, sorry [indirect objects table size=300000]. <output> {\shipout \box 255 \global \advance \pageno 1 \unhbox 255 } <to be read again> \end l.3 \bye
! ==> Fatal error occurred, the output PDF file is not finished! Transcript written on out.log.
After a minute or two.
ah, so in practice it will quit earlier depending on other objects used (i'm not sure if for instance objects of included graphics use that piece of memory)
This is mostly the required page objects. 3 objects are used per actual (totally empty) page: 13 0 obj << /Length 0 >> stream endstream endobj 12 0 obj << /Type /Page /Contents 13 0 R /Resources 11 0 R /MediaBox [0 0 595.2756 841.8898] /Parent 7 0 R >> endobj 11 0 obj << /ProcSet [ /PDF ] >> endobj It is a bit wasteful to keep those in the indirects objects table for ever and onwards, but I am not sure if it is doable to flush them right away. (CC ntg-pdftex) Greetings, Taco
Taco Hoekwater wrote:
Hans Hagen wrote:
That's fixed, a while back. my pdfetex now quitted at
[94729] [94730] [94731] [94732] [94733] [94734] [94735] [94736 ! TeX capacity exceeded, sorry [indirect objects table size=300000]. <output> {\shipout \box 255 \global \advance \pageno 1 \unhbox 255 } <to be read again> \end l.3 \bye
! ==> Fatal error occurred, the output PDF file is not finished! Transcript written on out.log.
After a minute or two.
ah, so in practice it will quit earlier depending on other objects used (i'm not sure if for instance objects of included graphics use that piece of memory)
This is mostly the required page objects. 3 objects are used per actual (totally empty) page:
13 0 obj << /Length 0 >> stream endstream endobj 12 0 obj << /Type /Page /Contents 13 0 R /Resources 11 0 R /MediaBox [0 0 595.2756 841.8898] /Parent 7 0 R >> endobj 11 0 obj << /ProcSet [ /PDF ] >> endobj
It is a bit wasteful to keep those in the indirects objects table for ever and onwards, but I am not sure if it is doable to flush them right away. (CC ntg-pdftex)
well, the procset is obsolete anyway, so that could save one object already Hans
On 2005-11-17 09:45:06 +0100, Hans Hagen wrote:
well, the procset is obsolete anyway, so that could save one object already
It should still be written, says the reference. :-} Best Martin -- http://www.tm.oneiros.de
On 2005-11-17 09:31:19 +0100, Taco Hoekwater wrote:
This is mostly the required page objects. 3 objects are used per actual (totally empty) page:
13 0 obj << /Length 0 >> stream endstream endobj
This one is even longer with compresslevel > 0. It would be nice to not compress empty streams, but I think that would be too difficult to implement and isn't needed very often.
12 0 obj << /Type /Page /Contents 13 0 R /Resources 11 0 R /MediaBox [0 0 595.2756 841.8898] /Parent 7 0 R >> endobj 11 0 obj << /ProcSet [ /PDF ] >> endobj
And 11 and 13 are created for every page. :-(
11 could simply be empty (or null) for empty pages. Looking at
It is a bit wasteful to keep those in the indirects objects table for ever and onwards, but I am not sure if it is doable to flush them right away. (CC ntg-pdftex)
I don't think that optimizations like these are generally usefull as they are seldom needed and make the code more complex. When I look at a typical result of ConTeXt or hyperref, they seem unneeded. Btw: Is there a tool that compresses a pdf by replacing identical objects with references? Best Martin -- http://www.tm.oneiros.de
Martin � wrote:
On 2005-11-17 09:31:19 +0100, Taco Hoekwater wrote:
This is mostly the required page objects. 3 objects are used per actual (totally empty) page:
13 0 obj << /Length 0 >> stream endstream endobj
This one is even longer with compresslevel > 0. It would be nice to not compress empty streams, but I think that would be too difficult to implement and isn't needed very often.
couldn't that be a null object then ?
12 0 obj << /Type /Page /Contents 13 0 R /Resources 11 0 R /MediaBox [0 0 595.2756 841.8898] /Parent 7 0 R >> endobj 11 0 obj << /ProcSet [ /PDF ] >> endobj
And 11 and 13 are created for every page. :-(
11 could simply be empty (or null) for empty pages. Looking at
, it doesn't seem too difficult to optimize for empty Resources. Of course, the question is: How often do we have empty /Resources? Normally they at least have a /Font entry. If we start optimizations like these, it would be nice to move the /MediaBox to the root object (or pages) and write it only for different-sized pages (Hans: ConTeXt writes /TrimBox and /CropBox on every page (even if they are allways the same); adding them to pdfpagesattr instead would save quite some space -- look at pdftex-a.pdf).
context can have mixed page sized in one document (which i need -)
It is a bit wasteful to keep those in the indirects objects table for ever and onwards, but I am not sure if it is doable to flush them right away. (CC ntg-pdftex)
I don't think that optimizations like these are generally usefull as they are seldom needed and make the code more complex. When I look at a typical result of ConTeXt or hyperref, they seem unneeded.
indeed; we can have a 'nice to-do' list for that; however, i think that the procset can safely be removed (not used by viewers anyway)
Btw: Is there a tool that compresses a pdf by replacing identical objects with references?
acrobat professional? in pdftex it would mean calculating a checksum for each object before flushing it; it may slow down things a bit Hans Hans
On 2005-11-17 11:39:06 +0100, Hans Hagen wrote:
Martin ??? wrote: ^^^ :-(((
If we start optimizations like these, it would be nice to move the /MediaBox to the root object (or pages) and write it only for different-sized pages (Hans: ConTeXt writes /TrimBox and /CropBox on every page (even if they are allways the same); adding them to pdfpagesattr instead would save quite some space -- look at pdftex-a.pdf).
context can have mixed page sized in one document (which i need -)
Of course. But typical documents have one page size. Write the default in pdfpagesattr and only differences in pdfpageattr. [...]
Btw: Is there a tool that compresses a pdf by replacing identical objects with references?
acrobat professional?
cmdline on Linux, please.
in pdftex it would mean calculating a checksum for each object before flushing it; it may slow down things a bit
Not much. Best Martin -- http://www.tm.oneiros.de
On Thu, Nov 17, 2005 at 11:39:06AM +0100, Hans Hagen wrote:
Martin ??? wrote:
On 2005-11-17 09:31:19 +0100, Taco Hoekwater wrote:
This is mostly the required page objects. 3 objects are used per actual (totally empty) page:
13 0 obj << /Length 0 >> stream endstream endobj
This one is even longer with compresslevel > 0. It would be nice to not compress empty streams, but I think that would be too difficult to implement and isn't needed very often.
I think this is the task of a separate optimizer that tries different compression methods and chooses the best one.
couldn't that be a null object then ?
12 0 obj << /Type /Page /Contents 13 0 R /Resources 11 0 R /MediaBox [0 0 595.2756 841.8898] /Parent 7 0 R >> endobj 11 0 obj << /ProcSet [ /PDF ] >> endobj
And 11 and 13 are created for every page. :-(
11 could simply be empty (or null) for empty pages. Looking at
, it doesn't seem too difficult to optimize for empty Resources. Of course, the question is: How often do we have empty /Resources? Normally they at least have a /Font entry. If we start optimizations like these, it would be nice to move the /MediaBox to the root object (or pages) and write it only for different-sized pages (Hans: ConTeXt writes /TrimBox and /CropBox on every page (even if they are allways the same); adding them to pdfpagesattr instead would save quite some space -- look at pdftex-a.pdf).
context can have mixed page sized in one document (which i need -)
Does not a standard exist that forbids inherited properties and requires the setting of /MediaBox in each page object?
It is a bit wasteful to keep those in the indirects objects table for ever and onwards, but I am not sure if it is doable to flush them right away. (CC ntg-pdftex)
I don't think that optimizations like these are generally usefull as they are seldom needed and make the code more complex. When I look at a typical result of ConTeXt or hyperref, they seem unneeded.
indeed; we can have a 'nice to-do' list for that; however, i think that the procset can safely be removed (not used by viewers anyway)
Btw: Is there a tool that compresses a pdf by replacing identical objects with references?
acrobat professional?
in pdftex it would mean calculating a checksum for each object before flushing it; it may slow down things a bit
And it is a partial optimization only. Example: Same images with
a color table as separate object. In the first pass, the
identical color table objects are detected and replaced by one
object. Then the images itself contain the same reference to the
color table object and become identical and optimized in the
second run.
But also identical cyclic structures are possible ...
I think, the job of pdfTeX is to generate PDF. Optimization is
the job of another program, before there are too many things
to consider:
New compression features of newer PDF versions: compressed
cross-ref table, compressed object streams, filters before
applying compression, ...
Yours sincerely
Heiko
On 2005-11-18 23:08:15 +0100, Heiko Oberdiek wrote:
On Thu, Nov 17, 2005 at 11:39:06AM +0100, Hans Hagen wrote: Does not a standard exist that forbids inherited properties and requires the setting of /MediaBox in each page object?
No, /MediaBox is inheritable. [...]
I think, the job of pdfTeX is to generate PDF. Optimization is the job of another program, before there are too many things
I agree: E.g. it would be an interesting extension for pdftk. Best Martin -- http://www.tm.oneiros.de
On Sat, Nov 19, 2005 at 01:21:34AM +0100, Martin Schröder wrote:
On 2005-11-18 23:08:15 +0100, Heiko Oberdiek wrote:
On Thu, Nov 17, 2005 at 11:39:06AM +0100, Hans Hagen wrote: Does not a standard exist that forbids inherited properties and requires the setting of /MediaBox in each page object?
No, /MediaBox is inheritable.
I am not talking about PDF. What about PDF/X?
Yours sincerely
Heiko
On 2005-11-20 14:01:15 +0100, Heiko Oberdiek wrote:
On Sat, Nov 19, 2005 at 01:21:34AM +0100, Martin Schröder wrote:
No, /MediaBox is inheritable.
I am not talking about PDF. What about PDF/X?
I don't know (I don't have the standard) and I can't test. Can somebody confirm this? Best Martin -- http://www.tm.oneiros.de
No, /MediaBox is inheritable.
I am not talking about PDF. What about PDF/X?
I don't know (I don't have the standard) and I can't test.
Can somebody confirm this?
Version 3 says nothing about inheritance (see quoted below). I'll try to get the newest PDF/X spec. However, some apps may treat inherited boxes different; afaik Acrobat don't display boxes if inherited from /Pages -) #### PDF/X-3:2002 "_2.9 Box usage and management_ The inclusion of the bounding boxes as specified in Portable Document Format Reference Manual, Version 1.3, Second Edition, is a required operation in the process of creating a PDF/X file. For all PDF files, the MediaBox is required. Additionally, each PDF/X page shall include either an ArtBox or TrimBox, but not both. The inclusion of a BleedBox is optional. If a BleedBox is present, neither the ArtBox nor the TrimBox may extend beyond the boundaries of the BleedBox. In the case where the CropBox is present, neither the ArtBox nor the TrimBox may extend beyond the boundaries of the CropBox. The bounding boxes may be used by PDF/X-compliant pagination applications to automatically position the file in a predetermined space within a layout construct. Within some industry workflows, both the BleedBox and TrimBox are necessary. For example, commercial, non-newspaper printing may include large numbers of pages containing bleed and trim information. It is important that boxes that represent this information be included. The accurate inclusion of the BleedBox and TrimBox will allow for the correct portion of the file to be imposed and rendered, and appropriate automation to be applied. NOTE: The use of TrimBox is recommended in preference to ArtBox." -- Pawe/l Jackowski P.Jackowski@gust.org.pl
On 2005-11-20 19:44:44 +0100, Pawel Jackowski wrote:
Version 3 says nothing about inheritance (see quoted below). I'll try to get the newest PDF/X spec. However, some apps may treat inherited boxes different; afaik Acrobat don't display boxes if inherited from /Pages -)
I just reread the 1.6 spec (section 3.6.2): - /MediaBox is required, and inheritable - /CropBox is optional, and inheritable - /BleedBox is optional, and _not_ inheritable - /TrimBox is optional, and _not_ inheritable - /ArtBox is optional, and _not_ inheritable So Hans is right in specifying Trim on every page, but can specify Media and Crop only once.
The inclusion of the bounding boxes as specified in Portable Document Format Reference Manual, Version 1.3, Second Edition, is a required operation in the process of creating a PDF/X file. For all PDF files, the MediaBox is required. Additionally, each PDF/X page shall include either an ArtBox or TrimBox, but not both. The inclusion of a BleedBox is optional. If a BleedBox is present, neither the ArtBox nor the
I can imagine that PDF/X has the same requirements as linearized PDF with regards to inheritance. Best Martin -- http://www.tm.oneiros.de
On Sun, Nov 20, 2005 at 03:45:35PM +0100, Martin Schröder wrote:
On 2005-11-20 14:01:15 +0100, Heiko Oberdiek wrote:
On Sat, Nov 19, 2005 at 01:21:34AM +0100, Martin Schröder wrote:
No, /MediaBox is inheritable.
I am not talking about PDF. What about PDF/X?
I don't know (I don't have the standard) and I can't test.
I haven't found it either.
Now I have looked in old mails, probably I remembered
linearized PDF:
| F.2.6
| ... This page object must explicitly specify all required
| attributes, such as Resources and MediaBox; the attributes
| cannot be inherited from ancestor page tree nodes.
| ...
|
| F.2.9
| ... Note that all Resources attributes and other inheritable
| attributes of the page object must be pushed down and replicated
| in each of the leaf page objects (but they may contain indirect
| references to shared objects).
But I don't think that pdfTeX wants to generate linearized PDF,
thus "non-inheritence" is probably not an issue.
Yours sincerely
Heiko
On 2005-11-20 20:26:05 +0100, Heiko Oberdiek wrote:
But I don't think that pdfTeX wants to generate linearized PDF,
No, linearizing PDF is too complicated for pdfTeX; let's leave it to specialized applications (like pdfopt or pdftk). Best Martin -- http://www.tm.oneiros.de
Heiko Oberdiek wrote:
On Sat, Nov 19, 2005 at 01:21:34AM +0100, Martin Schr�der wrote:
On 2005-11-18 23:08:15 +0100, Heiko Oberdiek wrote:
On Thu, Nov 17, 2005 at 11:39:06AM +0100, Hans Hagen wrote: Does not a standard exist that forbids inherited properties and requires the setting of /MediaBox in each page object?
No, /MediaBox is inheritable.
I am not talking about PDF. What about PDF/X?
it's indeed pdf/x that made me do this kind of things; ok, it may have been a problem of validators but ...; also, keep in mind that gs is not always bug free, so ... [i have no time after each update of a validator, gs, xpdf to go over all things i did to see what can be dropped] Hans
Martin Schröder wrote:
It is a bit wasteful to keep those in the indirects objects table for ever and onwards, but I am not sure if it is doable to flush them right away. (CC ntg-pdftex)
I don't think that optimizations like these are generally usefull as they are seldom needed and make the code more complex. When I look at a typical result of ConTeXt or hyperref, they seem unneeded.
I wasn't worried about the actual objects/pdf file size, but about the space they take up in the indirect objects table, thereby indirectly limiting the total page length of a PDF document. Are there no objects that can simply be flushed to the file and then forgotten about? Huge page runs are not that unusual in database publishing. Greetings, Taco
On 2005-11-17 11:59:25 +0100, Taco Hoekwater wrote:
I wasn't worried about the actual objects/pdf file size, but about the space they take up in the indirect objects table, thereby indirectly limiting the total page length of a PDF document. Are
One needs a suitable large obj_tab_size; I set that to the max (2^23) for ArtCom (and yes, we had included pages with >2^16 objects). :-)
there no objects that can simply be flushed to the file and then forgotten about? Huge page runs are not that unusual in database
Probably. But look at how the /GoTo objects are written and used: Written at the start and used at the end. :-( Best Martin -- http://www.tm.oneiros.de
On 2005-11-17 10:46:51 +0100, Martin Schröder wrote:
Btw: Is there a tool that compresses a pdf by replacing identical objects with references?
pdfTeX could do this by itself: Store the md5 of the shortest n objects (e.g. n = 1024) smaller then x bytes (e.g. x = 1024, longer objects will typically be unique) and replace new identical objects with references to the already existing ones. This would e.g. condense all the obj <> endobj in the pdfTeX manual. :-) Best Martin -- http://www.tm.oneiros.de
Martin � wrote:
On 2005-11-17 10:46:51 +0100, Martin Schr�der wrote:
Btw: Is there a tool that compresses a pdf by replacing identical objects with references?
pdfTeX could do this by itself: Store the md5 of the shortest n objects (e.g. n = 1024) smaller then x bytes (e.g. x = 1024, longer objects will typically be unique) and replace new identical objects with references to the already existing ones.
This would e.g. condense all the obj <> endobj in the pdfTeX manual. :-)
such a feature makes sense indeed; maybe even configurable: \pdfshareobjsize=1024 % with 0 meaning no checking done (that way we can experiment) Hans
On Thu, 17 Nov 2005, Hans Hagen wrote:
Martin wrote:
On 2005-11-17 10:46:51 +0100, Martin Schrder wrote:
Btw: Is there a tool that compresses a pdf by replacing identical objects with references?
pdfTeX could do this by itself: Store the md5 of the shortest n objects (e.g. n = 1024) smaller then x bytes (e.g. x = 1024, longer objects will typically be unique) and replace new identical objects with references to the already existing ones.
i won't want to rely on md5 alone (shit happens). Finally one needs a literal comparison. And when the object is gone, it's nasty to seek around in the PDF file.
This would e.g. condense all the obj <> endobj in the pdfTeX manual. :-)
if it's enough to scan the last say 100 non-stream objects: this can be done, at least it would catch these next to each other similar objects. Maybe MD5 would be overkill, just a hash + comparison would be ok. As it happens, these non-streams are collected in a separate buffer here before being written out. Let's see...
such a feature makes sense indeed; maybe even configurable:
\pdfshareobjsize=1024 % with 0 meaning no checking done
would fit in this case. Regards, Hartmut ------------------------------------------------------------------------ Dr.-Ing. Hartmut Henkel In den Auwiesen 6, D-68723 Oftersheim, Germany E-Mail: hartmut_henkel@gmx.de http://www.circuitwizard.de ------------------------------------------------------------------------
On Thu, Nov 17, 2005 at 06:46:58PM +0100, Hartmut Henkel wrote:
On Thu, 17 Nov 2005, Hans Hagen wrote:
Martin wrote:
On 2005-11-17 10:46:51 +0100, Martin Schrder wrote:
Btw: Is there a tool that compresses a pdf by replacing identical objects with references?
pdfTeX could do this by itself: Store the md5 of the shortest n objects (e.g. n = 1024) smaller then x bytes (e.g. x = 1024, longer objects will typically be unique) and replace new identical objects with references to the already existing ones.
i won't want to rely on md5 alone (shit happens). Finally one needs a literal comparison.
I agree.
And when the object is gone, it's nasty to seek around in the PDF file.
The position and length of the objects could be stored in memory.
This would e.g. condense all the obj <> endobj in the pdfTeX manual. :-)
if it's enough to scan the last say 100 non-stream objects: this can be done, at least it would catch these next to each other similar objects.
The matches of "similar" objects can be increased by normalization: * Removal of unnecessary spaces. * Ordering of dictionary keys. * Normalization of strings and names. Disadvantage: parsing of pdf objects would be necessary.
Maybe MD5 would be overkill, just a hash + comparison would be ok.
Yes.
Yours sincerely
Heiko
On Fri, 18 Nov 2005, Heiko Oberdiek wrote:
On Thu, Nov 17, 2005 at 06:46:58PM +0100, Hartmut Henkel wrote:
And when the object is gone, it's nasty to seek around in the PDF file.
The position and length of the objects could be stored in memory.
yes. But this would imply you would search in the written file, which is slow, i believe, but can be done. Against this might also be that it seems that the only reason why pdftex currently can't write the PDF to stdout is that it has to seek and write the /Length. Are there other places where it backs up? So if we could get rid of this only (?) seek (without writing a separate /Length object) e. g. by buffering the streams in the memory (which should be harmless, given the current memory sizes), we would gain the stdout writing capability. Any additional file seek brings us more away from there. Ok, one can live without stdout writing :-)
This would e.g. condense all the obj <> endobj in the pdfTeX manual. :-)
if it's enough to scan the last say 100 non-stream objects: this can be done, at least it would catch these next to each other similar objects.
The matches of "similar" objects can be increased by normalization: * Removal of unnecessary spaces. * Ordering of dictionary keys. * Normalization of strings and names.
Yes. As a tiny start, maybe we should remove all redundant spaces like in "/Type /Page" (a tip from this Fat PDF paper).
Disadvantage: parsing of pdf objects would be necessary.
maybe one can use xpdf for this...
Maybe MD5 would be overkill, just a hash + comparison would be ok.
Yes.
Not even a need to make it perfect (like hash with list). If it brings down the duplicates to 1% or 10% it would be ok... Regards, Hartmut
Hartmut Henkel wrote:
Against this might also be that it seems that the only reason why pdftex currently can't write the PDF to stdout is that it has to seek and write the /Length. Are there other places where it backs up? So if we could get rid of this only (?) seek (without writing a separate /Length object) e. g. by buffering the streams in the memory (which should be harmless, given the current memory sizes), we would gain the stdout
eh ... only as option; 500 meg files are no exception here (sometimes even more than a gig; mostly due to huge graphics) Hans
Yes. As a tiny start, maybe we should remove all redundant spaces like in "/Type /Page" (a tip from this Fat PDF paper).
hm, do we really need to end up with the same ugly pdf as other progs produce? i kind of like the current clean (kind of readable) pdf that pdftex provides removing duplicates, ok, but decreasing readablility of the pdf file ... (ok, i know that not many people look into the source, but ...) Hans
On Thu, 17 Nov 2005, Taco Hoekwater wrote:
Hans Hagen wrote:
That's fixed, a while back. my pdfetex now quitted at
[94729] [94730] [94731] [94732] [94733] [94734] [94735] [94736 ! TeX capacity exceeded, sorry [indirect objects table size=300000].
after patch 386 one won't need to care for this mem anymore as it would grow dynamically up to size=10.000.000. Then such an error would happen only in really extreme cases. Regards, Hartmut
participants (6)
-
Hans Hagen
-
Hartmut Henkel
-
Heiko Oberdiek
-
Martin Schröder
-
Pawel Jackowski
-
Taco Hoekwater