New subject: \pdfmatch (was: Patch EscapeAndOther)

25 Jun 2005

      Hello,

I have uploaded the patch EscapeAndOther, #375.
The patch is based on pdftex-1.30.0-rc1.
It supersedes patch EscapeDivers.tar.bz2, #371.

Inherited from patch EscapeDivers:
  Expandable: \pdfstrcmp, \pdfescapestring, \pdfescapename
  Added: \pdfescapehex, \pdfunescapehex
  Fix: \pdfescapename: '%' is also a delimiter that
       needs escaping.

New in patch EscapeAndOther:
  * \pdfcreationdate
  * \pdffilemoddate
  * \pdffilesize
  * \pdfmdfivesum
  * \pdffiledump
  * \pdfshellescape
  * \pdfmatch and \pdflastmatch
  * Start with bug fix: quotes are legal in other operating
    systems than windows.

Syntax:

  %% expandable commands:
  \pdfstrcmp <general text> <general text>
  \pdfescapestring <general text>
  \pdfescapename <general text>
  \pdfescapehex <general text>
  \pdfunescapehex <general text>
  \pdfcreationdate
  \pdffilemoddate <general text>
  \pdffilesize <general text>
  \pdfmdfivesum <file spec> | <general text>
  \pdfmatch <match options> <general text> <general text>
  <match options> := [icase] [subcount <number>]
  \pdflastmatch <number>

  %% read-only integers
  \pdfshellescape

Common for the following primitives:
* The result string is given by characters with
  catcode "other" (12), only the space has
  catcode "space" (10). The commands follow
  the tradition of \meaning, \string, ...
* The argument <general text> is expanded before use.
  It follows the tradition of \special, \message,
  \pdfobj, ...

Description:

\pdfstrcmp{<a>}{<b>}
  Compares two strings and returns the strings
    "0" if <a> equals <b>
    "-1" if <a> is less than <b>
    "1" if <a> is greater than <b>
  Use example:
    \ifcase\pdfstrcmp{abc}{def}\relax
      \message{abc = def}
    \or
      \message{abc > def}
    \else
      \message{abc < def}
    \fi
  Alternative:
   Implementing as read-only integer, then
   the use in \ifcase, \ifnum, ... is more save:
       \ifcase\pdfstrcmp{xyz}{abc}1\or 2\else 3\fi
   expands to 3, as read-only integer it expands
   to the expected 2.

\pdfescapestring{<a>}
  Escapes the string <a> that it can be used as PDF string.
  '(', ')', '\' are escaped along with the
  control and 8-bit characters.
  Use example:
    \pdfinfo{/Title (\pdfescapestring{...})}
    \special{ps: [ /Title (\pdfescapestring{...}) /DOCINFO pdfmark}
  Alternative:
    Perhaps 8-bit characters don't need escaping.
    Whitespace, especially newlines should be escaped, because
    of the use for latex/dvips to avoid recoding problems
    (<LF> -> <CR><LF>).

\pdfescapename{<a>}
  Escapes the string <a> that it can be used as PDF string.
  Whitespace, delimiters, '#' are escaped along with
  the control and 8-bit characters, recommended by the spec.
  Use example:
    \pdfobj stream attr{/Type/EmbeddedFile%
      /Subtype/\pdfescapename{text/plain; charset=iso-8859-1}%
      ...
    } file {...}

\pdfescapehex{<a>}, \pdfunescapehex{<b>}
  String <a> is converted to uppercase hexadecimal
  representation, <b> is converted back.
  Use example:
    \pdfescapehex{Hello} is converted to 48656C6C6F
    \pdfinfo{/Title <\pdfescapehex{Hello}>}
  Also it can be used to write strings in auxiliary
  files and later reread without worrying about
  catcodes, unmached curly braces.

\pdfcreationdate
  It expands to the date string that pdfTeX uses in
  the info dict as default.
  Rationale:
  * It provides seconds and especially the time zone.
  * Setting of /M date for annotations.
  * Because of the complicate change file structure
    of the sources it is not easy to synchronize
    the creation date with \year, \month, \day
    and \time. Thus \pdfcreationdate can be
    used to set these registers to the same
    values.
  Example:
    \pdfcreationdate expands to D:20050625015605+02'00'
    \pdfannot ...{/Subtype /FileAttachment
      \M (\pdfcreationdate) ...}

\pdffilemoddate{<file>}
  It expands to the modification date of <file> in the
  same format as \pdfcreationdate (PDF date format).
  On error it returns the empty string.
  Rationale:
  * File embedding: the date is shown in the attachment tab.
  * "Make feature": files can be compared, it can be checked,
    which file is newer.
    Example: pdfTeX does not support EPS files, epstopdf.sty
    converts them to PDF either always or, if the PDF variants
    do not exists. Both ways are not satifactory, either the
    time penalty is large or pdf files are embedded that
    are out of date. \pdffilemoddate solves this problem.
  Error handling, see \pdffilesize

I don't have implemented a \pdfcreationdate because of
portability issue: The "ctime" field of struct "stat" is
interpreted differently among operating systems:
* creation date, e.g. win
* inode change time, e.g. unix

\pdffilesize{<file>}
  It expands to the size of <file> as string. On error it
  returns the empty string.
  Rationale:
  * File embedding: the size is shown in the attachment tab.
  * Sometimes it is useful to know if a file has size "0"
    (failed conversions, ...).
  Error handling:
    Currently the empty string is silently returned.
  Alternatives:
  * Stop with error message. But what can the user do?
    Error recovery is easy: no information available,
    thus return nothing.
  * Warning message. But the primitives could be used
    for checks on file existence.
  * Return status in \pdfretval. I doubt a little whether
    this is really necessary. It is very easy to implement.
    But the documentation grows by a large list of
    error codes with its problems: Much to explain to the
    users. How they are assigned?

\pdfmdfivesum{<abc>} or \pdfmdfivesum file {<file>}
  It calculates the md5 sum and converts it to
  uppercase hexadecimal format (same as \pdfescapehex).
  The syntax is a simplified \pdfobj: Either the
  data is given directly or in a file.
  Rationale:
  * File embedding: providing /CheckSum.
  * Also the md5 sums of auxiliary files could be stored
    and compared in order to display a rerun warning.
    (Of course, it can be possible that different files
    have the same checksum, but the same file does not
    have different checksums.)

\pdfshellescape
  It is a read-only integer that is 1 if \write18 is
  enabled, 0 otherwise.
  Rationale:
  * It thought that \ifeof18 with \pdftexversion
    to implement a safe test for the \write18 feature.
    But I had to learn that I was wrong, see thread
    in comp.text.tex: "Confused about pstricks and
    pdftricks":
    mikTeX's pdfTeX does not implement \ifeof18.
    For implementing the test, only a obscure way
    via \pdftexbanner remains, but this way is not
    very reliable, the contents of \pdftexbanner
    is not well defined, it could be anything.

Quotes in file names:
* File name handling is quite chaotic. Quotes are removed
  by the scanner for \input, \open*. This behaviour is
  schizophrene: to solve a problem with spaces, quotes
  are now forbidden.
* Inconsistencies:
  The syntax of \pdfobj and \pdfximage would allow
  any file name, but \pdfobj removes quotes, only
  \pdfximage uses the more intelligent way, it removes
  quotes for windows only.
Thus the patch fixes:
* append_to_name (used in pack_file_name): quotes are removed
  for windows only. This fixes \pdfobj.
* utils.c: new function "makecfilename",
  used by \pdfximage (readimage), \pdffilesize, ...

\pdfmatch [icase] [subcount <number>}] {<pattern>}{<string>}
  Implements pattern matching using the POSIX regex
  (a standard library at least in my linux).
  It returns the same values as \pdfstrcmp, but
  with the following semantics:
    -1: error case (invalid pattern, ...)
     0: no match
     1: match found
  Options:
  * icase: case insensitive matching
  * subcount: it sets the table size for found subpatterns.
    A number "-1" resets the table size to the start default.

  See the manual page regex.3 and regex.7.

  The implementation shows a possible interface to
  pattern matching in TeX. Therefore only the basics
  is implemented.
  Flags:
  * REG_EXTENDED is set in the implementation.
  * REG_ICASE: can be set by user.
  * other: not implemented.

\pdflastmatch <number>
  The result of \pdfmatch is stored in an array.
  The entry "0" contains the match, the following
  entries submatches. The positions of the matches
  are also available. They are encoded in the following
  manner to avoid another primitive:
    <position> "->" <match string>
  "->" is used as separator in the tradition of \meaning.
  There exists macros for parsing the output of \meaning
  (e.g. in LaTeX: \strip@prefix).
  The position "-1" with an empty string indicates that
  this entry is not set.
  Example:
    \def\msg#{\immediate\write16 }
    \msg{\pdfmatch{(l+)o (W(o))}{Hello World}}
    \msg{\pdflastmatch0}
    \msg{\pdflastmatch1}
    \msg{\pdflastmatch2}
    \msg{\pdflastmatch3}
    \msg{\pdflastmatch4}
  Result:
    1
    2->llo Wo
    2->ll
    6->Wo
    7->o
    -1->

Alternative:
  PCRE (Perl-compatible regular expressions) is far more
  powerful. More options, named subpattern, ...
  License for 0.4 was GPL compatible, since 0.5 it is BSD,
  current version is 0.6.

  The TeX interface could be changed in the following way:
  * Addition: \pdflastmatchbyname <general text>
    It extracts matches for named subpattern.
  * Options can be given by the same name as in the
    PCRE description:
    \pdfmatch anchored caseless ... {}{}
    For easier/faster scanning the options could be
    restricted to be given in sorted order.
  * Or options can be given by letters in any order
    in an additional argument:
      \pdfmatch{<pattern>}{<options>}{<string>}
      \pdfmatch{l+}{ai}{Hello World}
    The implementation could then use strchr to check,
    whether an option is set.

Patch instructions for testing are given in the
patch description at sarovar.

Have fun
  Heiko 
--

Patch EscapeAndOther

Heiko Oberdiek

Pawel Jackowski

Martin Schröder

Pawel Jackowski

Hans Hagen

Hans Hagen

Pawel Jackowski

Heiko Oberdiek

Martin Schröder

Martin Schröder

Heiko Oberdiek

Hans Hagen

Heiko Oberdiek

Hans Hagen

tags

participants (4)