Creating identical PDF files with different pdfTeX runs

Frank Küster

15 Mar 2006 15 Mar '06

2:07 p.m.

Hi, we (i.e. a couple of Debian developers, taking up old ideas of each individual and the tests in http://tug.org/svn/texlive/trunk/Master/support/tests/) are trying to implement some automated testing of pdf and dvi creation by pdftex. Possible applications are regression tests for the binary, for distributions (changed font setup etc.), or for package authors ("does the new version still cooperate with hyperref?"). For this, it would be great if it were possible to create identical pdf files in subsequent runs of pdfTeX. With \pdfinfo{/CreationDate (1980-09-09)} \year=1980 \month=9 \day=9 \time=10 \pdfcompresslevel0 I get pdf files that can be compared, but only after some grepping: egrep -a -v '/BaseFont|/FontName|/UniqueID|/ID|/CreationDate' $< > $@ With dvipdfmx (and perhaps with newer pdfTeX versions than the one in teTeX) /CMap or something like this needs to be excluded, too. Is it possible to achieve identical pdf files directly, by adding the proper commands, or would it be possible to add this feature? TIA, Frank -- Frank Küster Single Molecule Spectroscopy, Protein Folding @ Inst. f. Biochemie, Univ. Zürich Debian Developer (teTeX)

Show replies by date

Hans Hagen

15 Mar 15 Mar

2:21 p.m.

� wrote:

...

Hi,

we (i.e. a couple of Debian developers, taking up old ideas of each individual and the tests in http://tug.org/svn/texlive/trunk/Master/support/tests/) are trying to implement some automated testing of pdf and dvi creation by pdftex. Possible applications are regression tests for the binary, for distributions (changed font setup etc.), or for package authors ("does the new version still cooperate with hyperref?").

For this, it would be great if it were possible to create identical pdf files in subsequent runs of pdfTeX. With

\pdfinfo{/CreationDate (1980-09-09)} \year=1980 \month=9 \day=9 \time=10 \pdfcompresslevel0

I get pdf files that can be compared, but only after some grepping:

egrep -a -v '/BaseFont|/FontName|/UniqueID|/ID|/CreationDate' $< > $@

With dvipdfmx (and perhaps with newer pdfTeX versions than the one in teTeX) /CMap or something like this needs to be excluded, too.

Is it possible to achieve identical pdf files directly, by adding the proper commands, or would it be possible to add this feature?

i think that the changes are nil - pdf itself is moving which may demand additional of different resources being added - the pdftex stream depends on for instance cm directives and font references and there has been changes in this area over time (improvements like collapsing, removing redundant code) - macro packages may change their implementations of annotations, color, graphics and such, which results in different object ordering, numbering and content - macro packages may add/support new features which in turn may result in differences between pdf files; - macro packages may improve/change/patch special things (hz metrics and such) - font resources may change (metrics are normally stable, but the rest may change) the best you can do is not to look at the pdf file, but to parse the log for errors, like overfull boxes which can be signals of old/new code doing weird things, missing fonts, map files, encodings and characters. Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

Frank Küster

2:39 p.m.

Hans Hagen wrote:

...

- pdf itself is moving which may demand additional of different resources being added - the pdftex stream depends on for instance cm directives and font references and there has been changes in this area over time (improvements like collapsing, removing redundant code) - macro packages may change their implementations of annotations, color, graphics and such, which results in different object ordering, numbering and content - macro packages may add/support new features which in turn may result in differences between pdf files; - macro packages may improve/change/patch special things (hz metrics and such) - font resources may change (metrics are normally stable, but the rest may change)

I don't think that these arguments make such tests unuseful. If such changes occur, the tests will fail, and the known-good documents need to be regenerated and manually checked. However, most of the time this will *not* happen, and then the tests would be very helpful.

...

the best you can do is not to look at the pdf file, but to parse the log for errors, like overfull boxes which can be signals of old/new code doing weird things, missing fonts, map files, encodings and characters.

I don't think the log helps. The log files contain absolute paths, so they would need lots of replacements before you can even start comparing. They contain version information for the packages - but checking whether a new version gives identical results is one of our goals. And renamed files, or splitting a package into different input files loaded by the "master" file, would completely break when we would try to automatically parse the log file. Regards, Frank -- Frank Küster Single Molecule Spectroscopy, Protein Folding @ Inst. f. Biochemie, Univ. Zürich Debian Developer (teTeX)

Hans Hagen

3:41 p.m.

� wrote:

...

I don't think the log helps.

The log files contain absolute paths, so they would need lots of replacements before you can even start comparing. They contain version information for the packages - but checking whether a new version gives identical results is one of our goals. And renamed files, or splitting a package into different input files loaded by the "master" file, would completely break when we would try to automatically parse the log file.

i had the impression that you wanted to test if teh pdf files were ok and then you don;t need to compare with older files; just analyze the log for anomalities (like missing resources which point to a problem in the font setup - which is the most probable source of problems); regression tests at the macro package (output) level is something else; afaik latex already has regression tests; as said: analyzing the page stream or objects is non trivial, and even simple changes in pdftex or in annot/object/literal support will make a comparison nearly impossible Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

Frank Küster

4:57 p.m.

Hans Hagen wrote:

...

regression tests at the macro package (output) level is something else; afaik latex already has regression tests

Oh, very interesting - where are these?

...

; as said: analyzing the page stream or objects is non trivial, and even simple changes in pdftex or in annot/object/literal support will make a comparison nearly impossible

So maybe for pdfTeX development mainly the bitmap-based tests are useful; but for macro packages, based on a given pdfTeX version, I think the PDF based tests make sense, too. Regards, Frank -- Frank Küster Single Molecule Spectroscopy, Protein Folding @ Inst. f. Biochemie, Univ. Zürich Debian Developer (teTeX)

Martin Schröder

2:25 p.m.

New subject: Creating identical PDF files with different pdfTeX runs

On 2006-03-15 14:07:16 +0100, Frank Küster wrote:

...

we (i.e. a couple of Debian developers, taking up old ideas of each individual and the tests in http://tug.org/svn/texlive/trunk/Master/support/tests/) are trying to implement some automated testing of pdf and dvi creation by pdftex. Possible applications are regression tests for the binary, for distributions (changed font setup etc.), or for package authors ("does the new version still cooperate with hyperref?").

Good idea.

...

For this, it would be great if it were possible to create identical pdf files in subsequent runs of pdfTeX. With

This will only get you so far. If you want to compare PDFs, you should not test for identical files, but for identical output: Render the PDF to a bitmap (e.g. with ghostscript) and compare the generated bitmaps. Otherwise your test will fail whenever the output of pdfTeX is changed in any way. Even then your tests will fail when we improve the typesetting of pdfTeX. :-}

...

\pdfinfo{/CreationDate (1980-09-09)} \year=1980 \month=9 \day=9 \time=10 \pdfcompresslevel0

I get pdf files that can be compared, but only after some grepping:

egrep -a -v '/BaseFont|/FontName|/UniqueID|/ID|/CreationDate' $< > $@

A start would be setting the system date -- this would seed /ID and /CreationDate. Note that /ID also includes the filename.

...

Is it possible to achieve identical pdf files directly, by adding the proper commands, or would it be possible to add this feature?

Currently it's impossible (and that's a feature :). Adding the feature can't be that hard, though. Best Martin -- http://www.tm.oneiros.de

Frank Küster

3:06 p.m.

Martin Schröder wrote:

...

On 2006-03-15 14:07:16 +0100, Frank Küster wrote:

...
we (i.e. a couple of Debian developers, taking up old ideas of each individual and the tests in http://tug.org/svn/texlive/trunk/Master/support/tests/) are trying to implement some automated testing of pdf and dvi creation by pdftex. Possible applications are regression tests for the binary, for distributions (changed font setup etc.), or for package authors ("does the new version still cooperate with hyperref?").

Good idea.

...
For this, it would be great if it were possible to create identical pdf files in subsequent runs of pdfTeX. With

This will only get you so far. If you want to compare PDFs, you should not test for identical files, but for identical output: Render the PDF to a bitmap (e.g. with ghostscript) and compare the generated bitmaps. Otherwise your test will fail whenever the output of pdfTeX is changed in any way.

That's a point; on the other hand, all I get then is glyphs on a page, but the information about the internal structure is lost: hyperlinks, character information (will a search for "fl" find the fl ligature, or will "ü" be pasted correctly?), etc.

...

Even then your tests will fail when we improve the typesetting of pdfTeX. :-}

I don't expect those tests, or rather the known-good documents, to be carved in stone. There will be chances on occasion which require manual checking of the new document. I guess it's best to have both kinds of test: Comparing the bitmaps gives information about the actual typesetting (and about the differences you wouldn't spot when checking by eye), comparing the pdf files gives additional information about the document structure.

...

A start would be setting the system date -- this would seed /ID and /CreationDate. Note that /ID also includes the filename.

The system date? You need to be root to do that, or is there a way to fake the system date in the local environment?

...

...
Is it possible to achieve identical pdf files directly, by adding the proper commands, or would it be possible to add this feature?

Currently it's impossible (and that's a feature :). Adding the feature can't be that hard, though.

I would very much appreciate this. Regards, Frank -- Frank Küster Single Molecule Spectroscopy, Protein Folding @ Inst. f. Biochemie, Univ. Zürich Debian Developer (teTeX)

Martin Schröder

3:49 p.m.

New subject: Creating identical PDF files with different pdfTeX runs

On 2006-03-15 15:06:43 +0100, Frank Küster wrote:

...

Martin Schröder wrote:

...
This will only get you so far. If you want to compare PDFs, you should not test for identical files, but for identical output: Render the PDF to a bitmap (e.g. with ghostscript) and compare the generated bitmaps. Otherwise your test will fail whenever the output of pdfTeX is changed in any way.

That's a point; on the other hand, all I get then is glyphs on a page, but the information about the internal structure is lost: hyperlinks, character information (will a search for "fl" find the fl ligature, or will "ü" be pasted correctly?), etc.

How far will you get with the logfile with a suitable combination of \tracingoutput etc.?

...

...
Even then your tests will fail when we improve the typesetting of pdfTeX. :-}

I don't expect those tests, or rather the known-good documents, to be carved in stone. There will be chances on occasion which require manual checking of the new document.

Agreed.

...

I guess it's best to have both kinds of test: Comparing the bitmaps gives information about the actual typesetting (and about the differences you wouldn't spot when checking by eye), comparing the pdf files gives additional information about the document structure.

Agreed.

...

...
A start would be setting the system date -- this would seed /ID and /CreationDate. Note that /ID also includes the filename.

The system date? You need to be root to do that, or is there a way to fake the system date in the local environment?

I don't know. What you would probably need is a way to dump und set all run-specific settings of pdfTeX (seeds for various random numbers, date etc).

...

...
...
Is it possible to achieve identical pdf files directly, by adding the proper commands, or would it be possible to add this feature?

Currently it's impossible (and that's a feature :). Adding the feature can't be that hard, though.

I would very much appreciate this.

File a feature request, please. :-) Best Martin -- http://www.tm.oneiros.de

Frank Küster

16 Mar 16 Mar

9:48 p.m.

Martin Schröder wrote:

...

On 2006-03-15 15:06:43 +0100, Frank Küster wrote:

...
Martin Schröder wrote:

...
This will only get you so far. If you want to compare PDFs, you should not test for identical files, but for identical output: Render the PDF to a bitmap (e.g. with ghostscript) and compare the generated bitmaps. Otherwise your test will fail whenever the output of pdfTeX is changed in any way.

That's a point; on the other hand, all I get then is glyphs on a page, but the information about the internal structure is lost: hyperlinks, character information (will a search for "fl" find the fl ligature, or will "ü" be pasted correctly?), etc.

How far will you get with the logfile with a suitable combination of \tracingoutput etc.?

I don't know; I have never delved into font tracing commands. I'm also not sure that this is the best approach, because when we develop a testing framework, we want it to be extensible and easy to use, even for e.g. package authors. If they have a version of their package that produces a particular pdf feature (like a special internal hyperlink pointing to exactly the right place), they want to make sure that future versions don't loose that feature. For this, a suitable pdf file comparison is all that's needed. Parsing the log file might additionally be helpful to find out what's wrong, but it only works if the testing framework already considered this feature. Also, in case I'm one of the developers of this testing framework, I'd rather concentrate on implementing it cleanly, extensible, maintainable, etc., and not so much on learning about pdf internals I didn't need up to now, or pdfTeX-special tracing possibilities. Regards, Frank -- Frank Küster Single Molecule Spectroscopy, Protein Folding @ Inst. f. Biochemie, Univ. Zürich Debian Developer (teTeX)

Reinhard Kotucha

12:02 a.m.

New subject: Creating identical PDF files with different pdfTeX runs

...

...
...
...
...
"Frank" == Frank Küster writes:

...

Is it possible to achieve identical pdf files directly, by adding the proper commands, or would it be possible to add this feature?

This would be useful for package authors who want to compare the output of different versions of their macro packages using the same version of pdftex. If you want to compare different versions of pdftex, your tests will hopefully always fail and it is sufficient to compare file sizes. For some time Hartmut is trying to convince pdftex to produce significantly smaller pdf files, especially if font expansion is used. The current version includes the same font with a modified FontMatrix for each expasion factor, while the next version will embed each font only once. There are a few other things which will not provide such a large improvement and hence might be postponed for future versions. As an example, the width of glyphs can be specified by setting a variable or by providing an array. pdftex always provides arrays but for monospaced fonts it is sufficient to set the variable and omit the array. Comparing bitmaps, as some people already suggested, is a good thing. It shouldn't be too difficult to write a script which produces a bitmap file for each page (using ghostscript) and then creates a file which consists of lines like <pagenumber> <md5sum of the bitmap file> The bitmap files can be removed by the script when it is finished and standard UNIX tools can be used to examine the output files. Particularly, diff(1) can be used efficiently. It will tell you the numbers of the pages which are different. However, it would be nice if the pdftex version number could be retrieved more easily. At the moment the options "-v" and "--version" are both quite verbose and provide copyright stuff. Maybe one of these options should provide the pdftex version ("1.30.7-beta" for example) only and nothing else. A script might want to insert the pdftex version number into its output filename, but at the moment it can only be done using something like sed|awk|perl. And maybe the output of pdftex {-v,--version} will change in the future, which will break such scripts. -- ---------------------------------------------------------------------------- Reinhard Kotucha Phone: +49-511-4592165 Marschnerstr. 25 D-30167 Hannover mailto:reinhard.kotucha@web.de ---------------------------------------------------------------------------- Microsoft isn't the answer. Microsoft is the question, and the answer is NO. ----------------------------------------------------------------------------

Reinhard Kotucha

1:05 a.m.

New subject: Creating identical PDF files with different pdfTeX runs

Sorry, I accidentally sent the previous mail before it was ready. There is another point I'd like to put on my wish list. When Knuth developed TeX, he obviously did not take into account that people want to pipe STDOUT into other programs. It would be nice to have an option "-q" or "--quiet" which instructs TeX to suppress any output except what is reqested implicitly by \write16, for instance. If you want to determine the number of pages in a particular pdf file, you can say pdftex '\pdfximage page 1 {file.pdf}\write16{\the\pdflastximagepages}\end' but you get a lot of more or less useless messages and have to filter the output somehow. Regards, Reinhard -- ---------------------------------------------------------------------------- Reinhard Kotucha Phone: +49-511-4592165 Marschnerstr. 25 D-30167 Hannover mailto:reinhard.kotucha@web.de ---------------------------------------------------------------------------- Microsoft isn't the answer. Microsoft is the question, and the answer is NO. ----------------------------------------------------------------------------

Frank Küster

9:48 a.m.

Reinhard Kotucha wrote:

...

Comparing bitmaps, as some people already suggested, is a good thing.

It shouldn't be too difficult to write a script which produces a bitmap file for each page (using ghostscript) and then creates a file which consists of lines like

<pagenumber> <md5sum of the bitmap file>

The bitmap files can be removed by the script when it is finished and standard UNIX tools can be used to examine the output files.

Particularly, diff(1) can be used efficiently. It will tell you the numbers of the pages which are different.

That's a very good suggestion, thanks! Regards, Frank -- Frank Küster Single Molecule Spectroscopy, Protein Folding @ Inst. f. Biochemie, Univ. Zürich Debian Developer (teTeX)

Hartmut Henkel

7:34 p.m.

On Thu, 16 Mar 2006, Frank Küster wrote:

...

Reinhard Kotucha wrote:

...
<pagenumber> <md5sum of the bitmap file>

The bitmap files can be removed by the script when it is finished and standard UNIX tools can be used to examine the output files.

Particularly, diff(1) can be used efficiently. It will tell you the numbers of the pages which are different.

That's a very good suggestion, thanks!

this looks pretty fragile to me. Characters will end up in bitmaps with interpolated gray pixels, and so it depends not only on pdftex but also on any subtlety of the rendering engine. And if the md5sum doesn't match, you know nothing without the original file. Maybe some crosscorrelation between images with some given tolerance limit would be safer. Regards, Hartmut

Frank Küster

7:53 p.m.

Hartmut Henkel wrote:

...

On Thu, 16 Mar 2006, Frank Küster wrote:

...
Reinhard Kotucha wrote:

...
<pagenumber> <md5sum of the bitmap file>

The bitmap files can be removed by the script when it is finished and standard UNIX tools can be used to examine the output files.

Particularly, diff(1) can be used efficiently. It will tell you the numbers of the pages which are different.

That's a very good suggestion, thanks!

this looks pretty fragile to me. Characters will end up in bitmaps with interpolated gray pixels, and so it depends not only on pdftex but also on any subtlety of the rendering engine. And if the md5sum doesn't match, you know nothing without the original file.

Yes, we definitely need the original file.

...

Maybe some crosscorrelation between images with some given tolerance limit would be safer.

Any hints what I could use for such a purpose? Regards, Frank -- Frank Küster Single Molecule Spectroscopy, Protein Folding @ Inst. f. Biochemie, Univ. Zürich Debian Developer (teTeX)

Hartmut Henkel

8:08 p.m.

On Thu, 16 Mar 2006, Frank Küster wrote:

...

Hartmut Henkel wrote:

Yes, we definitely need the original file.

...
Maybe some crosscorrelation between images with some given tolerance limit would be safer.

Any hints what I could use for such a purpose?

just found from Netpbm: NAME pnmpsnr - compute the difference between two portable anymaps SYNOPSIS pnmpsnr [pnmfile1] [pnmfile2] DESCRIPTION Reads two PBM, PGM, or PPM files, or PAM equivalents, as input. Prints the peak signal-to-noise ratio (PSNR) difference between the two images. This metric is typically used in image compression papers to rate the distortion between original and decoded image. And in ImageMagick there is a -compose operator with Difference and Multiply option. Regards, Hartmut

Hartmut Henkel

8:23 p.m.

pnmpsnr seems to work quite nicely, e. g. with some file xx.tex: \input tufte \input tufte \input tufte \hbox{\kern 0.1pt a} \input tufte \input tufte \bye against another file with 0pt instead of 0.1pt kern one gets: $ pdftoppm -r 1200 xx.pdf xx $ pdftoppm -r 1200 yy.pdf yy $ pnmpsnr xx-000001.ppm yy-000001.ppm pnmpsnr: PSNR between xx-000001.ppm and yy-000001.ppm: pnmpsnr: Y color component: 59.18 dB pnmpsnr: Cb color component doesn't differ. pnmpsnr: Cr color component doesn't differ. (seems there is a spurious blank in the output :-) but such a tiny displacement needs rendering at 1200 dpi which gives large files and is slow... Regards, Hartmut

Martin Schröder

17 Mar 17 Mar

1:12 p.m.

New subject: Creating identical PDF files with different pdfTeX runs

On 2006-03-16 20:08:02 +0100, Hartmut Henkel wrote:

...

And in ImageMagick there is a -compose operator with Difference and Multiply option.

This is exactly what's needed. I tried this (compare sample2e with sample2e & lm): - render the pdfs with gs to png the resulting pngs are not large (ca. 150k/page for sample2e) and can be tested with cmp/md5sum for identity - generate diff images with "composite -compose difference -negate" this eats memory (1G at 600dpi), but produces what's needed. A pointer: http://jpdfunit.sourceforge.net/ "JpdfUnit is a framework for testing a generated pdf document with the JUnit test framework so JPdfUnit is a high level api." Best Martin -- http://www.tm.oneiros.de

Reinhard Kotucha

16 Mar 16 Mar

10:09 p.m.

...

...
...
...
...
"Hartmut" == Hartmut Henkel writes:

...

On Thu, 16 Mar 2006, Frank Küster wrote:

...
Reinhard Kotucha wrote:

...
<pagenumber> <md5sum of the bitmap file>

The bitmap files can be removed by the script when it is

finished and > standard UNIX tools can be used to examine the output files.

...
Particularly, diff(1) can be used efficiently. It will tell

you the > numbers of the pages which are different.

That's a very good suggestion, thanks!

...

this looks pretty fragile to me. Characters will end up in bitmaps with interpolated gray pixels, and so it depends not only on pdftex but also on any subtlety of the rendering engine.

No, not every ghostscript output device does antialiasing. Usually antialiasing is done for screen rendering only. And even there you can use -sDEVICE=x11 instead of x11alpha. You can try faxg3 or pcxmono or something like that.

...

And if the md5sum doesn't match, you know nothing without the original file. Maybe some crosscorrelation between images with some given tolerance limit would be safer.

...and you don't know anything either. The question was whether files are identical, not similar. This can be achieved as I described, given that the bitmaps are produced with a reasonable high resolution (and it does not matter whether antialiasing is turned on). Of course, you have to use the same version of the program which produces the bitmaps invariably. If you want to see the differences you need a program which displays all pixels which are different in two bitmap files. But I suppose that you want to check whether two bitmaps are different before you use such a tool.

...

pnmpsnr: Y color component: 59.18 dB

Well, it just tells you that the files are different, the actual value does not provide any useful information. I think two tools are needed, one which tells you which pages are different and one which makes the changes visible. Regards, Reinhard -- ---------------------------------------------------------------------------- Reinhard Kotucha Phone: +49-511-4592165 Marschnerstr. 25 D-30167 Hannover mailto:reinhard.kotucha@web.de ---------------------------------------------------------------------------- Microsoft isn't the answer. Microsoft is the question, and the answer is NO. ----------------------------------------------------------------------------

The Thanh Han

4:42 a.m.

New subject: Creating identical PDF files with different pdfTeX runs

there are a few places where pdftex takes the current time into account to generate certain tags. Implementing a primitive to leave out date-related things is not difficult. Can you please give some further explain what it would be useful for? Also note that even when such a primitive is available, the chance you get the same output for a given input file is rather low. You need to ensure that *all* relevant files (like font-related stuff, macro, figure etc.) are also the same. Thanh On Wed, Mar 15, 2006 at 02:07:16PM +0100, Frank Küster wrote:

...

Hi,

we (i.e. a couple of Debian developers, taking up old ideas of each individual and the tests in http://tug.org/svn/texlive/trunk/Master/support/tests/) are trying to implement some automated testing of pdf and dvi creation by pdftex. Possible applications are regression tests for the binary, for distributions (changed font setup etc.), or for package authors ("does the new version still cooperate with hyperref?").

For this, it would be great if it were possible to create identical pdf files in subsequent runs of pdfTeX. With

\pdfinfo{/CreationDate (1980-09-09)} \year=1980 \month=9 \day=9 \time=10 \pdfcompresslevel0

I get pdf files that can be compared, but only after some grepping:

egrep -a -v '/BaseFont|/FontName|/UniqueID|/ID|/CreationDate' $< > $@

With dvipdfmx (and perhaps with newer pdfTeX versions than the one in teTeX) /CMap or something like this needs to be excluded, too.

Is it possible to achieve identical pdf files directly, by adding the proper commands, or would it be possible to add this feature?

TIA, Frank -- Frank Küster Single Molecule Spectroscopy, Protein Folding @ Inst. f. Biochemie, Univ. Zürich Debian Developer (teTeX)

_______________________________________________ ntg-pdftex mailing list ntg-pdftex@ntg.nl http://www.ntg.nl/mailman/listinfo/ntg-pdftex

Hartmut Henkel

8:37 a.m.

On Thu, 16 Mar 2006, The Thanh Han wrote:

...

You need to ensure that *all* relevant files (like font-related stuff, macro, figure etc.) are also the same.

...which also means in case of embedded PDFs, that they come from the same path, see e. g.: /PTEX.FileName (./example.pdf) Regards, Hartmut

Frank Küster

10:24 p.m.

The Thanh Han wrote:

...

there are a few places where pdftex takes the current time into account to generate certain tags. Implementing a primitive to leave out date-related things is not difficult. Can you please give some further explain what it would be useful for?

1. Regression tests for package writers. I already have set a simple setup in one or two LaTeX packages of mine, and it was very helpful when I decided to make big changes to the implementation without changing the interface. I always wanted to generalize this, but never came to it. 2. Regression testing for TeX distributions or, e.g., rpm or deb packages of texlive or teTeX for some Linux distribution. When internals change (like rearranging TEXMF trees, putting fonts in other packages, changing dependencies, etc.), the documents should not change. This is actually the reason why we are now talking about this: There are a couple of Debian developers who invest a lot of effort in automated package tests, and one of them asked whether such thing was possible in teTeX. We started discussing, now I'm here. 3. Regression testing in pdfTeX: A change that only aims at e.g. performance enhancements should not alter the document. Many other changes of course will change it.

...

Also note that even when such a primitive is available, the chance you get the same output for a given input file is rather low. You need to ensure that *all* relevant files (like font-related stuff, macro, figure etc.) are also the same.

That means that such tests would have to be done in a "controlled environment". That's easy for application 2, and probably not too hard for 1. For 3, it's not trivial - I don't think you can run the tests after compiling, like other programs, e.g. perl, do, and hope they will succeed on any system. But if you recreate the known-good data on *your* system just before applying a problematic patch, and test afterwards, that should work. Regards, Frank -- Frank Küster Single Molecule Spectroscopy, Protein Folding @ Inst. f. Biochemie, Univ. Zürich Debian Developer (teTeX)

Hans Hagen

10:39 p.m.

...

3. Regression testing in pdfTeX: A change that only aims at e.g. performance enhancements should not alter the document. Many other changes of course will change it.

i think that you must define different levels of similarity: - pure text: in that case a bitmap as already discussed is needed - functionality (annotations and such): that need to take place at the pdf level, i.e. filtering resources and descriptions and compare them (e.g. annotation names, rectangles, etc) - font resources i can imagine that your test of text similarity is run with disabled interactive features the second one could be an add=on for pdftex: a special log mode, where pdftex writes a file with all annotations (name, page, rectangle, maybe also the while dict) and a second one which lists all the used fonts, encoding files, map lines and glyphs (encoding subset) Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

Martin Schröder

11:06 p.m.

New subject: Creating identical PDF files with different pdfTeX runs

On 2006-03-16 22:39:10 +0100, Hans Hagen wrote:

...

the second one could be an add=on for pdftex: a special log mode, where pdftex writes a file with all annotations (name, page, rectangle, maybe also the while dict) and a second one which lists all the used fonts, encoding files, map lines and glyphs (encoding subset)

I still think that this information can be found with \tracingoutput etc. Best Martin -- http://www.tm.oneiros.de

Heiko Oberdiek

17 Mar 17 Mar

12:06 a.m.

New subject: Creating identical PDF files with different pdfTeX runs

On Thu, Mar 16, 2006 at 11:06:57PM +0100, Martin Schröder wrote:

...

On 2006-03-16 22:39:10 +0100, Hans Hagen wrote:

...
the second one could be an add=on for pdftex: a special log mode, where pdftex writes a file with all annotations (name, page, rectangle, maybe also the while dict) and a second one which lists all the used fonts, encoding files, map lines and glyphs (encoding subset)

I still think that this information can be found with \tracingoutput etc.

TeX's \tracing output is a very bad source for automatical comparisons: TeX just prints the formatted result as text: * Funny linebreaks. * Sometimes the output is limited and marked by "etc.". * Information is lost, the .log file doesn't tell the catcode of character tokens, even you cannot see what a token is: \def\msg#{\immediate\write16} \def\testA{\foobar} \msg{\string\testA: [\meaning\testA]} \edef\testB{\string\foobar\space} \msg{\string\testB: [\meaning\testB]} \msg{% \noexpand\testA and \noexpand\testB \ifx\testA\testB are equal% \else differ% \fi .% } \end \testA expands to one command token, the expansion of \testB consists of seven other token and one space token, But the output in the .log file is equal in both cases. * Semantics is lost. The .log file must be analyzed to get some of the semantics back, what is a box listing, what is a macro expansion, ... Thus it would be nice to have a debug/tracing file, where the output is written as XML with semantics. Then macro definitions, box contents, ... could be easier compared and also the differences could easier be detected and marked. For debugging, analyzing packages it would be useful to have a command that generates a kind of snapshot, e.g. all macros with their meanings or the status of registers. Then two stages, e.g before, after package loading or some action could be compared: which macros or registers are differnt, ... This could be used to show the wanted effect and to deteced unwanted side effects, ... To retain compatibility the tracing/debug output could go in a separate debug file with additional commands, the counterparts of the \tracingcommands that write into the new debug file and additional commands, such as the snapshot commands, e.g. \xtracingoutputfile{\jobname-debug.xml} \xtracingmacros=1 \xtracing... \xsnapshot{count}% all count registers <> 0 \xsnapshot{macro}% all defined command tokens with meaning ... The result would be written to \jobname-debug.xml and can then further processed: filtering, statistics, displaying comparisons, displaying, ... Yours sincerely Heiko --

George N. White III

20 Mar 20 Mar

1:43 p.m.

New subject: Creating identical PDF files with different pdfTeX runs

On 3/16/06, Hans Hagen wrote:

...

...
3. Regression testing in pdfTeX: A change that only aims at e.g. performance enhancements should not alter the document. Many other changes of course will change it.

i think that you must define different levels of similarity:

- pure text: in that case a bitmap as already discussed is needed - functionality (annotations and such): that need to take place at the pdf level, i.e. filtering resources and descriptions and compare them (e.g. annotation names, rectangles, etc) - font resources

If we focus on differences between pdf documents and forget for the moment that we are using pdftex, these levels apply to a much wider class of documents. I suspect it is too early to know what sort of differences will occur, so the first step is to find tools to analyze differences between PDF documents. Many people who have never heard of pdftex would like to be able to compare PDF's and analyze the contents, so if you create a good initial framework there will be lots of help. I've never used the commercial preflight tools, but based on the 4th hand reports I get (printer operator --> cust. service --> editor --> me) I gather that the commercial tools can provide pretty detailed reports (e.g., figure on page N has yellow in the "black" annotation text). I'm not sure that rasterizing is the best approach. What about a ghostscript device that generates various debugging summaries. Some things that would be useful: 1. list of objects with md5sums and page numbers where the object is used -- this would be used to focus in on areas where documents differ. If files differ on a small number of pages then comparing bitmaps might be useful and economical of resources. 2. human readable summary of a specified object We already have tools to pull out specified pages and extract text.

...

i can imagine that your test of text similarity is run with disabled interactive features

the second one could be an add=on for pdftex: a special log mode, where pdftex writes a file with all annotations (name, page, rectangle, maybe also the while dict) and a second one which lists all the used fonts, encoding files, map lines and glyphs (encoding subset)

It would be better to focus on things that can be used with other pdf generating applications. Many pdftex documents have pdf's from sources other than pdftex embedded. It would be worth thinking about ways to add metadata to pdf's to help the analyzers, but if we rely on the pdftex logs then we have to develop all the tools ourselves and if the difference is in an input pdf our tools may not work. If we produce a good framework I suspect the larger community of pdf users will become engaged and accomplish much more than the pdftex community can afford. -- George N. White III Head of St. Margarets Bay, Nova Scotia

7043

Age (days ago)

7048

Last active (days ago)

List overview

Download

24 comments

8 participants

participants (8)

Frank Küster
George N. White III
Hans Hagen
Hartmut Henkel
Heiko Oberdiek
Martin Schröder
Reinhard Kotucha
The Thanh Han