Creating identical PDF files with different pdfTeX runs
Hi, we (i.e. a couple of Debian developers, taking up old ideas of each individual and the tests in http://tug.org/svn/texlive/trunk/Master/support/tests/) are trying to implement some automated testing of pdf and dvi creation by pdftex. Possible applications are regression tests for the binary, for distributions (changed font setup etc.), or for package authors ("does the new version still cooperate with hyperref?"). For this, it would be great if it were possible to create identical pdf files in subsequent runs of pdfTeX. With \pdfinfo{/CreationDate (1980-09-09)} \year=1980 \month=9 \day=9 \time=10 \pdfcompresslevel0 I get pdf files that can be compared, but only after some grepping: egrep -a -v '/BaseFont|/FontName|/UniqueID|/ID|/CreationDate' $< > $@ With dvipdfmx (and perhaps with newer pdfTeX versions than the one in teTeX) /CMap or something like this needs to be excluded, too. Is it possible to achieve identical pdf files directly, by adding the proper commands, or would it be possible to add this feature? TIA, Frank -- Frank Küster Single Molecule Spectroscopy, Protein Folding @ Inst. f. Biochemie, Univ. Zürich Debian Developer (teTeX)
� wrote:
Hi,
we (i.e. a couple of Debian developers, taking up old ideas of each individual and the tests in http://tug.org/svn/texlive/trunk/Master/support/tests/) are trying to implement some automated testing of pdf and dvi creation by pdftex. Possible applications are regression tests for the binary, for distributions (changed font setup etc.), or for package authors ("does the new version still cooperate with hyperref?").
For this, it would be great if it were possible to create identical pdf files in subsequent runs of pdfTeX. With
\pdfinfo{/CreationDate (1980-09-09)} \year=1980 \month=9 \day=9 \time=10 \pdfcompresslevel0
I get pdf files that can be compared, but only after some grepping:
egrep -a -v '/BaseFont|/FontName|/UniqueID|/ID|/CreationDate' $< > $@
With dvipdfmx (and perhaps with newer pdfTeX versions than the one in teTeX) /CMap or something like this needs to be excluded, too.
Is it possible to achieve identical pdf files directly, by adding the proper commands, or would it be possible to add this feature?
i think that the changes are nil - pdf itself is moving which may demand additional of different resources being added - the pdftex stream depends on for instance cm directives and font references and there has been changes in this area over time (improvements like collapsing, removing redundant code) - macro packages may change their implementations of annotations, color, graphics and such, which results in different object ordering, numbering and content - macro packages may add/support new features which in turn may result in differences between pdf files; - macro packages may improve/change/patch special things (hz metrics and such) - font resources may change (metrics are normally stable, but the rest may change) the best you can do is not to look at the pdf file, but to parse the log for errors, like overfull boxes which can be signals of old/new code doing weird things, missing fonts, map files, encodings and characters. Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
Hans Hagen
- pdf itself is moving which may demand additional of different resources being added - the pdftex stream depends on for instance cm directives and font references and there has been changes in this area over time (improvements like collapsing, removing redundant code) - macro packages may change their implementations of annotations, color, graphics and such, which results in different object ordering, numbering and content - macro packages may add/support new features which in turn may result in differences between pdf files; - macro packages may improve/change/patch special things (hz metrics and such) - font resources may change (metrics are normally stable, but the rest may change)
I don't think that these arguments make such tests unuseful. If such changes occur, the tests will fail, and the known-good documents need to be regenerated and manually checked. However, most of the time this will *not* happen, and then the tests would be very helpful.
the best you can do is not to look at the pdf file, but to parse the log for errors, like overfull boxes which can be signals of old/new code doing weird things, missing fonts, map files, encodings and characters.
I don't think the log helps. The log files contain absolute paths, so they would need lots of replacements before you can even start comparing. They contain version information for the packages - but checking whether a new version gives identical results is one of our goals. And renamed files, or splitting a package into different input files loaded by the "master" file, would completely break when we would try to automatically parse the log file. Regards, Frank -- Frank Küster Single Molecule Spectroscopy, Protein Folding @ Inst. f. Biochemie, Univ. Zürich Debian Developer (teTeX)
� wrote:
I don't think the log helps.
The log files contain absolute paths, so they would need lots of replacements before you can even start comparing. They contain version information for the packages - but checking whether a new version gives identical results is one of our goals. And renamed files, or splitting a package into different input files loaded by the "master" file, would completely break when we would try to automatically parse the log file.
i had the impression that you wanted to test if teh pdf files were ok and then you don;t need to compare with older files; just analyze the log for anomalities (like missing resources which point to a problem in the font setup - which is the most probable source of problems); regression tests at the macro package (output) level is something else; afaik latex already has regression tests; as said: analyzing the page stream or objects is non trivial, and even simple changes in pdftex or in annot/object/literal support will make a comparison nearly impossible Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
Hans Hagen
regression tests at the macro package (output) level is something else; afaik latex already has regression tests
Oh, very interesting - where are these?
; as said: analyzing the page stream or objects is non trivial, and even simple changes in pdftex or in annot/object/literal support will make a comparison nearly impossible
So maybe for pdfTeX development mainly the bitmap-based tests are useful; but for macro packages, based on a given pdfTeX version, I think the PDF based tests make sense, too. Regards, Frank -- Frank Küster Single Molecule Spectroscopy, Protein Folding @ Inst. f. Biochemie, Univ. Zürich Debian Developer (teTeX)
On 2006-03-15 14:07:16 +0100, Frank Küster wrote:
we (i.e. a couple of Debian developers, taking up old ideas of each individual and the tests in http://tug.org/svn/texlive/trunk/Master/support/tests/) are trying to implement some automated testing of pdf and dvi creation by pdftex. Possible applications are regression tests for the binary, for distributions (changed font setup etc.), or for package authors ("does the new version still cooperate with hyperref?").
Good idea.
For this, it would be great if it were possible to create identical pdf files in subsequent runs of pdfTeX. With
This will only get you so far. If you want to compare PDFs, you should not test for identical files, but for identical output: Render the PDF to a bitmap (e.g. with ghostscript) and compare the generated bitmaps. Otherwise your test will fail whenever the output of pdfTeX is changed in any way. Even then your tests will fail when we improve the typesetting of pdfTeX. :-}
\pdfinfo{/CreationDate (1980-09-09)} \year=1980 \month=9 \day=9 \time=10 \pdfcompresslevel0
I get pdf files that can be compared, but only after some grepping:
egrep -a -v '/BaseFont|/FontName|/UniqueID|/ID|/CreationDate' $< > $@
A start would be setting the system date -- this would seed /ID and /CreationDate. Note that /ID also includes the filename.
Is it possible to achieve identical pdf files directly, by adding the proper commands, or would it be possible to add this feature?
Currently it's impossible (and that's a feature :). Adding the feature can't be that hard, though. Best Martin -- http://www.tm.oneiros.de
Martin Schröder
On 2006-03-15 14:07:16 +0100, Frank Küster wrote:
we (i.e. a couple of Debian developers, taking up old ideas of each individual and the tests in http://tug.org/svn/texlive/trunk/Master/support/tests/) are trying to implement some automated testing of pdf and dvi creation by pdftex. Possible applications are regression tests for the binary, for distributions (changed font setup etc.), or for package authors ("does the new version still cooperate with hyperref?").
Good idea.
For this, it would be great if it were possible to create identical pdf files in subsequent runs of pdfTeX. With
This will only get you so far. If you want to compare PDFs, you should not test for identical files, but for identical output: Render the PDF to a bitmap (e.g. with ghostscript) and compare the generated bitmaps. Otherwise your test will fail whenever the output of pdfTeX is changed in any way.
That's a point; on the other hand, all I get then is glyphs on a page, but the information about the internal structure is lost: hyperlinks, character information (will a search for "fl" find the fl ligature, or will "ü" be pasted correctly?), etc.
Even then your tests will fail when we improve the typesetting of pdfTeX. :-}
I don't expect those tests, or rather the known-good documents, to be carved in stone. There will be chances on occasion which require manual checking of the new document. I guess it's best to have both kinds of test: Comparing the bitmaps gives information about the actual typesetting (and about the differences you wouldn't spot when checking by eye), comparing the pdf files gives additional information about the document structure.
A start would be setting the system date -- this would seed /ID and /CreationDate. Note that /ID also includes the filename.
The system date? You need to be root to do that, or is there a way to fake the system date in the local environment?
Is it possible to achieve identical pdf files directly, by adding the proper commands, or would it be possible to add this feature?
Currently it's impossible (and that's a feature :). Adding the feature can't be that hard, though.
I would very much appreciate this. Regards, Frank -- Frank Küster Single Molecule Spectroscopy, Protein Folding @ Inst. f. Biochemie, Univ. Zürich Debian Developer (teTeX)
On 2006-03-15 15:06:43 +0100, Frank Küster wrote:
Martin Schröder
wrote: This will only get you so far. If you want to compare PDFs, you should not test for identical files, but for identical output: Render the PDF to a bitmap (e.g. with ghostscript) and compare the generated bitmaps. Otherwise your test will fail whenever the output of pdfTeX is changed in any way.
That's a point; on the other hand, all I get then is glyphs on a page, but the information about the internal structure is lost: hyperlinks, character information (will a search for "fl" find the fl ligature, or will "ü" be pasted correctly?), etc.
How far will you get with the logfile with a suitable combination of \tracingoutput etc.?
Even then your tests will fail when we improve the typesetting of pdfTeX. :-}
I don't expect those tests, or rather the known-good documents, to be carved in stone. There will be chances on occasion which require manual checking of the new document.
Agreed.
I guess it's best to have both kinds of test: Comparing the bitmaps gives information about the actual typesetting (and about the differences you wouldn't spot when checking by eye), comparing the pdf files gives additional information about the document structure.
Agreed.
A start would be setting the system date -- this would seed /ID and /CreationDate. Note that /ID also includes the filename.
The system date? You need to be root to do that, or is there a way to fake the system date in the local environment?
I don't know. What you would probably need is a way to dump und set all run-specific settings of pdfTeX (seeds for various random numbers, date etc).
Is it possible to achieve identical pdf files directly, by adding the proper commands, or would it be possible to add this feature?
Currently it's impossible (and that's a feature :). Adding the feature can't be that hard, though.
I would very much appreciate this.
File a feature request, please. :-) Best Martin -- http://www.tm.oneiros.de
Martin Schröder
On 2006-03-15 15:06:43 +0100, Frank Küster wrote:
Martin Schröder
wrote: This will only get you so far. If you want to compare PDFs, you should not test for identical files, but for identical output: Render the PDF to a bitmap (e.g. with ghostscript) and compare the generated bitmaps. Otherwise your test will fail whenever the output of pdfTeX is changed in any way.
That's a point; on the other hand, all I get then is glyphs on a page, but the information about the internal structure is lost: hyperlinks, character information (will a search for "fl" find the fl ligature, or will "ü" be pasted correctly?), etc.
How far will you get with the logfile with a suitable combination of \tracingoutput etc.?
I don't know; I have never delved into font tracing commands. I'm also not sure that this is the best approach, because when we develop a testing framework, we want it to be extensible and easy to use, even for e.g. package authors. If they have a version of their package that produces a particular pdf feature (like a special internal hyperlink pointing to exactly the right place), they want to make sure that future versions don't loose that feature. For this, a suitable pdf file comparison is all that's needed. Parsing the log file might additionally be helpful to find out what's wrong, but it only works if the testing framework already considered this feature. Also, in case I'm one of the developers of this testing framework, I'd rather concentrate on implementing it cleanly, extensible, maintainable, etc., and not so much on learning about pdf internals I didn't need up to now, or pdfTeX-special tracing possibilities. Regards, Frank -- Frank Küster Single Molecule Spectroscopy, Protein Folding @ Inst. f. Biochemie, Univ. Zürich Debian Developer (teTeX)
"Frank" == Frank Küster
writes:
Is it possible to achieve identical pdf files directly, by adding the proper commands, or would it be possible to add this feature?
This would be useful for package authors who want to compare the output of different versions of their macro packages using the same version of pdftex. If you want to compare different versions of pdftex, your tests will hopefully always fail and it is sufficient to compare file sizes. For some time Hartmut is trying to convince pdftex to produce significantly smaller pdf files, especially if font expansion is used. The current version includes the same font with a modified FontMatrix for each expasion factor, while the next version will embed each font only once. There are a few other things which will not provide such a large improvement and hence might be postponed for future versions. As an example, the width of glyphs can be specified by setting a variable or by providing an array. pdftex always provides arrays but for monospaced fonts it is sufficient to set the variable and omit the array. Comparing bitmaps, as some people already suggested, is a good thing. It shouldn't be too difficult to write a script which produces a bitmap file for each page (using ghostscript) and then creates a file which consists of lines like <pagenumber> <md5sum of the bitmap file> The bitmap files can be removed by the script when it is finished and standard UNIX tools can be used to examine the output files. Particularly, diff(1) can be used efficiently. It will tell you the numbers of the pages which are different. However, it would be nice if the pdftex version number could be retrieved more easily. At the moment the options "-v" and "--version" are both quite verbose and provide copyright stuff. Maybe one of these options should provide the pdftex version ("1.30.7-beta" for example) only and nothing else. A script might want to insert the pdftex version number into its output filename, but at the moment it can only be done using something like sed|awk|perl. And maybe the output of pdftex {-v,--version} will change in the future, which will break such scripts. -- ---------------------------------------------------------------------------- Reinhard Kotucha Phone: +49-511-4592165 Marschnerstr. 25 D-30167 Hannover mailto:reinhard.kotucha@web.de ---------------------------------------------------------------------------- Microsoft isn't the answer. Microsoft is the question, and the answer is NO. ----------------------------------------------------------------------------
Sorry, I accidentally sent the previous mail before it was ready. There is another point I'd like to put on my wish list. When Knuth developed TeX, he obviously did not take into account that people want to pipe STDOUT into other programs. It would be nice to have an option "-q" or "--quiet" which instructs TeX to suppress any output except what is reqested implicitly by \write16, for instance. If you want to determine the number of pages in a particular pdf file, you can say pdftex '\pdfximage page 1 {file.pdf}\write16{\the\pdflastximagepages}\end' but you get a lot of more or less useless messages and have to filter the output somehow. Regards, Reinhard -- ---------------------------------------------------------------------------- Reinhard Kotucha Phone: +49-511-4592165 Marschnerstr. 25 D-30167 Hannover mailto:reinhard.kotucha@web.de ---------------------------------------------------------------------------- Microsoft isn't the answer. Microsoft is the question, and the answer is NO. ----------------------------------------------------------------------------
Reinhard Kotucha
Comparing bitmaps, as some people already suggested, is a good thing.
It shouldn't be too difficult to write a script which produces a bitmap file for each page (using ghostscript) and then creates a file which consists of lines like
<pagenumber> <md5sum of the bitmap file>
The bitmap files can be removed by the script when it is finished and standard UNIX tools can be used to examine the output files.
Particularly, diff(1) can be used efficiently. It will tell you the numbers of the pages which are different.
That's a very good suggestion, thanks! Regards, Frank -- Frank Küster Single Molecule Spectroscopy, Protein Folding @ Inst. f. Biochemie, Univ. Zürich Debian Developer (teTeX)
On Thu, 16 Mar 2006, Frank Küster wrote:
Reinhard Kotucha
wrote: <pagenumber> <md5sum of the bitmap file>
The bitmap files can be removed by the script when it is finished and standard UNIX tools can be used to examine the output files.
Particularly, diff(1) can be used efficiently. It will tell you the numbers of the pages which are different.
That's a very good suggestion, thanks!
this looks pretty fragile to me. Characters will end up in bitmaps with interpolated gray pixels, and so it depends not only on pdftex but also on any subtlety of the rendering engine. And if the md5sum doesn't match, you know nothing without the original file. Maybe some crosscorrelation between images with some given tolerance limit would be safer. Regards, Hartmut
Hartmut Henkel
On Thu, 16 Mar 2006, Frank Küster wrote:
Reinhard Kotucha
wrote: <pagenumber> <md5sum of the bitmap file>
The bitmap files can be removed by the script when it is finished and standard UNIX tools can be used to examine the output files.
Particularly, diff(1) can be used efficiently. It will tell you the numbers of the pages which are different.
That's a very good suggestion, thanks!
this looks pretty fragile to me. Characters will end up in bitmaps with interpolated gray pixels, and so it depends not only on pdftex but also on any subtlety of the rendering engine. And if the md5sum doesn't match, you know nothing without the original file.
Yes, we definitely need the original file.
Maybe some crosscorrelation between images with some given tolerance limit would be safer.
Any hints what I could use for such a purpose? Regards, Frank -- Frank Küster Single Molecule Spectroscopy, Protein Folding @ Inst. f. Biochemie, Univ. Zürich Debian Developer (teTeX)
On Thu, 16 Mar 2006, Frank Küster wrote:
Hartmut Henkel
wrote: Yes, we definitely need the original file.
Maybe some crosscorrelation between images with some given tolerance limit would be safer.
Any hints what I could use for such a purpose?
just found from Netpbm: NAME pnmpsnr - compute the difference between two portable anymaps SYNOPSIS pnmpsnr [pnmfile1] [pnmfile2] DESCRIPTION Reads two PBM, PGM, or PPM files, or PAM equivalents, as input. Prints the peak signal-to-noise ratio (PSNR) difference between the two images. This metric is typically used in image compression papers to rate the distortion between original and decoded image. And in ImageMagick there is a -compose operator with Difference and Multiply option. Regards, Hartmut
pnmpsnr seems to work quite nicely, e. g. with some file xx.tex: \input tufte \input tufte \input tufte \hbox{\kern 0.1pt a} \input tufte \input tufte \bye against another file with 0pt instead of 0.1pt kern one gets: $ pdftoppm -r 1200 xx.pdf xx $ pdftoppm -r 1200 yy.pdf yy $ pnmpsnr xx-000001.ppm yy-000001.ppm pnmpsnr: PSNR between xx-000001.ppm and yy-000001.ppm: pnmpsnr: Y color component: 59.18 dB pnmpsnr: Cb color component doesn't differ. pnmpsnr: Cr color component doesn't differ. (seems there is a spurious blank in the output :-) but such a tiny displacement needs rendering at 1200 dpi which gives large files and is slow... Regards, Hartmut
On 2006-03-16 20:08:02 +0100, Hartmut Henkel wrote:
And in ImageMagick there is a -compose operator with Difference and Multiply option.
This is exactly what's needed. I tried this (compare sample2e with sample2e & lm): - render the pdfs with gs to png the resulting pngs are not large (ca. 150k/page for sample2e) and can be tested with cmp/md5sum for identity - generate diff images with "composite -compose difference -negate" this eats memory (1G at 600dpi), but produces what's needed. A pointer: http://jpdfunit.sourceforge.net/ "JpdfUnit is a framework for testing a generated pdf document with the JUnit test framework so JPdfUnit is a high level api." Best Martin -- http://www.tm.oneiros.de
"Hartmut" == Hartmut Henkel
writes:
On Thu, 16 Mar 2006, Frank Küster wrote:
Reinhard Kotucha
wrote: <pagenumber> <md5sum of the bitmap file>
The bitmap files can be removed by the script when it is
finished and > standard UNIX tools can be used to examine the output files.
Particularly, diff(1) can be used efficiently. It will tell
you the > numbers of the pages which are different.
That's a very good suggestion, thanks!
this looks pretty fragile to me. Characters will end up in bitmaps with interpolated gray pixels, and so it depends not only on pdftex but also on any subtlety of the rendering engine.
No, not every ghostscript output device does antialiasing. Usually antialiasing is done for screen rendering only. And even there you can use -sDEVICE=x11 instead of x11alpha. You can try faxg3 or pcxmono or something like that.
And if the md5sum doesn't match, you know nothing without the original file. Maybe some crosscorrelation between images with some given tolerance limit would be safer.
...and you don't know anything either. The question was whether files are identical, not similar. This can be achieved as I described, given that the bitmaps are produced with a reasonable high resolution (and it does not matter whether antialiasing is turned on). Of course, you have to use the same version of the program which produces the bitmaps invariably. If you want to see the differences you need a program which displays all pixels which are different in two bitmap files. But I suppose that you want to check whether two bitmaps are different before you use such a tool.
pnmpsnr: Y color component: 59.18 dB
Well, it just tells you that the files are different, the actual value does not provide any useful information. I think two tools are needed, one which tells you which pages are different and one which makes the changes visible. Regards, Reinhard -- ---------------------------------------------------------------------------- Reinhard Kotucha Phone: +49-511-4592165 Marschnerstr. 25 D-30167 Hannover mailto:reinhard.kotucha@web.de ---------------------------------------------------------------------------- Microsoft isn't the answer. Microsoft is the question, and the answer is NO. ----------------------------------------------------------------------------
there are a few places where pdftex takes the current time into account to generate certain tags. Implementing a primitive to leave out date-related things is not difficult. Can you please give some further explain what it would be useful for? Also note that even when such a primitive is available, the chance you get the same output for a given input file is rather low. You need to ensure that *all* relevant files (like font-related stuff, macro, figure etc.) are also the same. Thanh On Wed, Mar 15, 2006 at 02:07:16PM +0100, Frank Küster wrote:
Hi,
we (i.e. a couple of Debian developers, taking up old ideas of each individual and the tests in http://tug.org/svn/texlive/trunk/Master/support/tests/) are trying to implement some automated testing of pdf and dvi creation by pdftex. Possible applications are regression tests for the binary, for distributions (changed font setup etc.), or for package authors ("does the new version still cooperate with hyperref?").
For this, it would be great if it were possible to create identical pdf files in subsequent runs of pdfTeX. With
\pdfinfo{/CreationDate (1980-09-09)} \year=1980 \month=9 \day=9 \time=10 \pdfcompresslevel0
I get pdf files that can be compared, but only after some grepping:
egrep -a -v '/BaseFont|/FontName|/UniqueID|/ID|/CreationDate' $< > $@
With dvipdfmx (and perhaps with newer pdfTeX versions than the one in teTeX) /CMap or something like this needs to be excluded, too.
Is it possible to achieve identical pdf files directly, by adding the proper commands, or would it be possible to add this feature?
TIA, Frank -- Frank Küster Single Molecule Spectroscopy, Protein Folding @ Inst. f. Biochemie, Univ. Zürich Debian Developer (teTeX)
_______________________________________________ ntg-pdftex mailing list ntg-pdftex@ntg.nl http://www.ntg.nl/mailman/listinfo/ntg-pdftex
On Thu, 16 Mar 2006, The Thanh Han wrote:
You need to ensure that *all* relevant files (like font-related stuff, macro, figure etc.) are also the same.
...which also means in case of embedded PDFs, that they come from the same path, see e. g.: /PTEX.FileName (./example.pdf) Regards, Hartmut
The Thanh Han
there are a few places where pdftex takes the current time into account to generate certain tags. Implementing a primitive to leave out date-related things is not difficult. Can you please give some further explain what it would be useful for?
1. Regression tests for package writers. I already have set a simple setup in one or two LaTeX packages of mine, and it was very helpful when I decided to make big changes to the implementation without changing the interface. I always wanted to generalize this, but never came to it. 2. Regression testing for TeX distributions or, e.g., rpm or deb packages of texlive or teTeX for some Linux distribution. When internals change (like rearranging TEXMF trees, putting fonts in other packages, changing dependencies, etc.), the documents should not change. This is actually the reason why we are now talking about this: There are a couple of Debian developers who invest a lot of effort in automated package tests, and one of them asked whether such thing was possible in teTeX. We started discussing, now I'm here. 3. Regression testing in pdfTeX: A change that only aims at e.g. performance enhancements should not alter the document. Many other changes of course will change it.
Also note that even when such a primitive is available, the chance you get the same output for a given input file is rather low. You need to ensure that *all* relevant files (like font-related stuff, macro, figure etc.) are also the same.
That means that such tests would have to be done in a "controlled environment". That's easy for application 2, and probably not too hard for 1. For 3, it's not trivial - I don't think you can run the tests after compiling, like other programs, e.g. perl, do, and hope they will succeed on any system. But if you recreate the known-good data on *your* system just before applying a problematic patch, and test afterwards, that should work. Regards, Frank -- Frank Küster Single Molecule Spectroscopy, Protein Folding @ Inst. f. Biochemie, Univ. Zürich Debian Developer (teTeX)
3. Regression testing in pdfTeX: A change that only aims at e.g. performance enhancements should not alter the document. Many other changes of course will change it.
i think that you must define different levels of similarity: - pure text: in that case a bitmap as already discussed is needed - functionality (annotations and such): that need to take place at the pdf level, i.e. filtering resources and descriptions and compare them (e.g. annotation names, rectangles, etc) - font resources i can imagine that your test of text similarity is run with disabled interactive features the second one could be an add=on for pdftex: a special log mode, where pdftex writes a file with all annotations (name, page, rectangle, maybe also the while dict) and a second one which lists all the used fonts, encoding files, map lines and glyphs (encoding subset) Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
On 2006-03-16 22:39:10 +0100, Hans Hagen wrote:
the second one could be an add=on for pdftex: a special log mode, where pdftex writes a file with all annotations (name, page, rectangle, maybe also the while dict) and a second one which lists all the used fonts, encoding files, map lines and glyphs (encoding subset)
I still think that this information can be found with \tracingoutput etc. Best Martin -- http://www.tm.oneiros.de
On Thu, Mar 16, 2006 at 11:06:57PM +0100, Martin Schröder wrote:
On 2006-03-16 22:39:10 +0100, Hans Hagen wrote:
the second one could be an add=on for pdftex: a special log mode, where pdftex writes a file with all annotations (name, page, rectangle, maybe also the while dict) and a second one which lists all the used fonts, encoding files, map lines and glyphs (encoding subset)
I still think that this information can be found with \tracingoutput etc.
TeX's \tracing output is a very bad source for automatical comparisons:
TeX just prints the formatted result as text:
* Funny linebreaks.
* Sometimes the output is limited and marked by "etc.".
* Information is lost, the .log file doesn't tell the catcode
of character tokens, even you cannot see what a token is:
\def\msg#{\immediate\write16}
\def\testA{\foobar}
\msg{\string\testA: [\meaning\testA]}
\edef\testB{\string\foobar\space}
\msg{\string\testB: [\meaning\testB]}
\msg{%
\noexpand\testA and \noexpand\testB
\ifx\testA\testB
are equal%
\else
differ%
\fi
.%
}
\end
\testA expands to one command token, the expansion of
\testB consists of seven other token and one space token,
But the output in the .log file is equal in both cases.
* Semantics is lost. The .log file must be analyzed to get
some of the semantics back, what is a box listing, what is
a macro expansion, ...
Thus it would be nice to have a debug/tracing file, where
the output is written as XML with semantics.
Then macro definitions, box contents, ... could be easier
compared and also the differences could easier be detected
and marked.
For debugging, analyzing packages it would be useful to
have a command that generates a kind of snapshot, e.g.
all macros with their meanings or the status of registers.
Then two stages, e.g before, after package loading or
some action could be compared: which macros or registers
are differnt, ... This could be used to show the wanted
effect and to deteced unwanted side effects, ...
To retain compatibility the tracing/debug output could go
in a separate debug file with additional commands,
the counterparts of the \tracingcommands that write into the
new debug file and additional commands, such as the snapshot
commands, e.g.
\xtracingoutputfile{\jobname-debug.xml}
\xtracingmacros=1
\xtracing...
\xsnapshot{count}% all count registers <> 0
\xsnapshot{macro}% all defined command tokens with meaning
...
The result would be written to \jobname-debug.xml and can
then further processed: filtering, statistics, displaying
comparisons, displaying, ...
Yours sincerely
Heiko
On 3/16/06, Hans Hagen
3. Regression testing in pdfTeX: A change that only aims at e.g. performance enhancements should not alter the document. Many other changes of course will change it.
i think that you must define different levels of similarity:
- pure text: in that case a bitmap as already discussed is needed - functionality (annotations and such): that need to take place at the pdf level, i.e. filtering resources and descriptions and compare them (e.g. annotation names, rectangles, etc) - font resources
If we focus on differences between pdf documents and forget for the moment that we are using pdftex, these levels apply to a much wider class of documents. I suspect it is too early to know what sort of differences will occur, so the first step is to find tools to analyze differences between PDF documents. Many people who have never heard of pdftex would like to be able to compare PDF's and analyze the contents, so if you create a good initial framework there will be lots of help. I've never used the commercial preflight tools, but based on the 4th hand reports I get (printer operator --> cust. service --> editor --> me) I gather that the commercial tools can provide pretty detailed reports (e.g., figure on page N has yellow in the "black" annotation text). I'm not sure that rasterizing is the best approach. What about a ghostscript device that generates various debugging summaries. Some things that would be useful: 1. list of objects with md5sums and page numbers where the object is used -- this would be used to focus in on areas where documents differ. If files differ on a small number of pages then comparing bitmaps might be useful and economical of resources. 2. human readable summary of a specified object We already have tools to pull out specified pages and extract text.
i can imagine that your test of text similarity is run with disabled interactive features
the second one could be an add=on for pdftex: a special log mode, where pdftex writes a file with all annotations (name, page, rectangle, maybe also the while dict) and a second one which lists all the used fonts, encoding files, map lines and glyphs (encoding subset)
It would be better to focus on things that can be used with other pdf
generating applications. Many pdftex documents have pdf's from
sources other than pdftex embedded. It would be worth thinking about
ways to add metadata to pdf's to help the analyzers, but if we rely on
the pdftex logs then we have to develop all the tools ourselves and if
the difference is in an input pdf our tools may not work. If we
produce a good framework I suspect the larger community of pdf users
will become engaged and accomplish much more than the pdftex
community can afford.
--
George N. White III
participants (8)
-
Frank Küster
-
George N. White III
-
Hans Hagen
-
Hartmut Henkel
-
Heiko Oberdiek
-
Martin Schröder
-
Reinhard Kotucha
-
The Thanh Han