On 3/16/06, Hans Hagen
3. Regression testing in pdfTeX: A change that only aims at e.g. performance enhancements should not alter the document. Many other changes of course will change it.
i think that you must define different levels of similarity:
- pure text: in that case a bitmap as already discussed is needed - functionality (annotations and such): that need to take place at the pdf level, i.e. filtering resources and descriptions and compare them (e.g. annotation names, rectangles, etc) - font resources
If we focus on differences between pdf documents and forget for the moment that we are using pdftex, these levels apply to a much wider class of documents. I suspect it is too early to know what sort of differences will occur, so the first step is to find tools to analyze differences between PDF documents. Many people who have never heard of pdftex would like to be able to compare PDF's and analyze the contents, so if you create a good initial framework there will be lots of help. I've never used the commercial preflight tools, but based on the 4th hand reports I get (printer operator --> cust. service --> editor --> me) I gather that the commercial tools can provide pretty detailed reports (e.g., figure on page N has yellow in the "black" annotation text). I'm not sure that rasterizing is the best approach. What about a ghostscript device that generates various debugging summaries. Some things that would be useful: 1. list of objects with md5sums and page numbers where the object is used -- this would be used to focus in on areas where documents differ. If files differ on a small number of pages then comparing bitmaps might be useful and economical of resources. 2. human readable summary of a specified object We already have tools to pull out specified pages and extract text.
i can imagine that your test of text similarity is run with disabled interactive features
the second one could be an add=on for pdftex: a special log mode, where pdftex writes a file with all annotations (name, page, rectangle, maybe also the while dict) and a second one which lists all the used fonts, encoding files, map lines and glyphs (encoding subset)
It would be better to focus on things that can be used with other pdf
generating applications. Many pdftex documents have pdf's from
sources other than pdftex embedded. It would be worth thinking about
ways to add metadata to pdf's to help the analyzers, but if we rely on
the pdftex logs then we have to develop all the tools ourselves and if
the difference is in an input pdf our tools may not work. If we
produce a good framework I suspect the larger community of pdf users
will become engaged and accomplish much more than the pdftex
community can afford.
--
George N. White III