Re: [NTG-pdftex] Creating identical PDF files with different pdfTeX runs

20 Mar 2006

      On 3/16/06, Hans Hagen  wrote:
...
...
3. Regression testing in pdfTeX: A change that only aims at
   e.g. performance enhancements should not alter the document.  Many
   other changes of course will change it.
i think that you must define different levels of similarity:
- pure text: in that case a bitmap as already discussed is needed
- functionality (annotations and such): that need to take place at the
pdf level, i.e. filtering resources and descriptions and compare them
(e.g. annotation names, rectangles, etc)
- font resources
If we focus on differences between pdf documents and forget for the
moment that we are using pdftex, these levels apply to a much wider
class of documents.   I suspect it is too early to know what sort of
differences will occur, so the first step is to find tools to analyze
differences between PDF documents.  Many people who have never heard
of pdftex would like to be able to compare PDF's and analyze the
contents, so if you create a good initial framework there will be lots
of help.  I've never used the commercial preflight tools, but based on
the 4th hand reports I get (printer operator --> cust. service -->
editor --> me)  I gather that the commercial tools can provide pretty
detailed reports (e.g., figure on page N has yellow in the "black"
annotation text).  I'm not sure that rasterizing is the best approach.
 What about a ghostscript device that generates various debugging
summaries.  Some things that would be useful:

1. list of objects with md5sums and page numbers where the object is
used -- this would be used to focus in on areas where documents
differ.   If files differ on a small number of pages then comparing
bitmaps might be useful and economical of resources.

2.  human readable summary of a specified object

We already have tools to pull out specified pages and extract text.
...
i can imagine that your test of text similarity is run with disabled
interactive features
the second one could be an add=on for pdftex: a special log mode, where
pdftex writes a file with all annotations (name, page, rectangle, maybe
also the while dict) and a second one which lists all the used fonts,
encoding files, map lines and glyphs (encoding subset)
It would be better to focus on things that can be used with other pdf
generating applications.  Many pdftex documents have pdf's from
sources other than pdftex embedded.  It would be worth thinking about
ways to add metadata to pdf's to help the analyzers, but if we rely on
the pdftex logs then we have to develop all the tools ourselves and if
the difference is in an input pdf our tools may not work.  If we
produce a good framework I suspect the larger community of pdf users
will become engaged and accomplish much more than the pdftex
community can afford.

--
George N. White III 
Head of St. Margarets Bay, Nova Scotia