Re: [NTG-pdftex] Incomplete CharSet causes failure with PDF/A validation

28 Jan 2019

      On 1/27/2019 10:00 PM, Ross Moore wrote:
...
PDFs are now editable, at least in Acrobat Pro.
weren't they always, given fonts being available?
...
So knowing what characters are available lets software easily determine
whether a simple edit that changes or adds characters to a text block
can simply be performed using the embedded font subset,
or whether a font substitution is needed to do the specific edit.
Of course it is preferable to not have to substitute, as this can change
the metrics, hence potentially making a noticeable change to the
visual appearance of that text block.
If you have ever tried to edit a PDF made by someone else (with TeX
or Word or …) then you should have experienced how things can
move around significantly within the same paragraph.
i never edit pdf documents (ok, i remember that once i had to strip 
stuff in order to get a logo, but not adding something)

imo editing a pdf makes no sense (and reflow even less) ... also, with 
respect to fonts, editing assumes all glyphs being present and with open 
type fonts one also enters a feature mess and gsub/gpos are not embedded

alse, editing contradicts archiving
...
At what level?
primitive (one can omit cidsets and charsets) and i added a setter at 
the lua end (there was already one for cidsets)
...
Can it be done on a font-by-font basis? That would be ideal.
hm, in principle that can be implemented but i don't think that will 
happen (also, when one uses so called wide fonts there are no charsets 
because the type 1 becomes a sort of simple opentype)
...
If just a command-line option when calling  lualatex  then that is
kind of workable.
Essentially it would require a user to have done a preflight check
and found that one of the fonts has a CharSet problem.
Then rerun with the option set, to get a valid PDF/A-2 (or 3) document.
same control as pdftex: primitives
...
It would be affecting all the Type-1 fonts, not just one of them.
The ability (described above) to later edit the PDF would be lost pretty
much entirely.
how many people will keep using type 1 fonts ... (i only use a few that 
i only have in type 1, like optima nova, but even that one is used as 
wide font)
...
My understanding of the code in  writefont.c  is that the Font Descriptor
dictionary is constructed (and written) as a complete object, before the 
font
subset itself is constructed.
For the CharSet, the entries in gl_tree are used, based upon a list of 
the characters
explicitly using that font. This does *not* include implicit glyphs, such as
  /grave (and perhaps /a ) with /agrave .
which is why you use tounicode -)
...
It was such a circumstance that initiated this conversation roughly a 
year ago.
I looked at solutions like writing the accent characters in white, 
outside the page
boundaries, as an /Artifact say.  But this begets a range of 
difficulties, and could
potentially affect the pagination or typesetting, and can fail other 
accessibility checks.
I want to develop reliable means to construct documents simultaneously 
for both
Archivability and Accessibility.
in luatex (and probably also in pdftex) the font id (instance often is 
the font in the text stream and as it has specific widths it gets 
disctionaty with a few properties, referring to a parent font that is 
shared; the question is what pdftex does when there are more than 255 
glyphs referred to from one type 1 font but i guess that this doesn't 
happen often in pdftex usage (one can try to include the full ec, 
texnansi, qx, some vietnamese pagella fonts and see what happens)
...
From the veraPDF link that Reinhard provided, it seems that presence
is checked with PDF/A-1, but not accuracy.
my impression is that fonts are never validated (there are all kind of 
properties that one need to keep with font objects so that is a special 
kind of check ... viewers of course can complain but even then, i had 
cases where acrobat complained and showed nothing while mypdf did and 
vise versa (private dict stuff and so)
...
But for PDF/A-2 and 3, there is an more detailed check for accuracy.
Perhaps true for viewers; but PDFs are becoming about *more* than just
the visual view. We want to be providing the structures required for 
accurate
text extraction and editing. TeX was never designed with this in mind, but
because of its programmability this is something that should be achievable.
sure, and it has been ... but even with that embedding a structured 
source for processing to me makes more sense (which of course is not 
what publisheres want)

(more accurate would be: this is not what maro packages and the tex way 
of entering content is designed for nor what users have in mind; tex 
itself can do pretty much anything)

Hans

-----------------------------------------------------------------
                                           Hans Hagen | PRAGMA ADE
               Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
        tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl
-----------------------------------------------------------------