Patch EscapeAndOther
Hello,
I have uploaded the patch EscapeAndOther, #375.
The patch is based on pdftex-1.30.0-rc1.
It supersedes patch EscapeDivers.tar.bz2, #371.
Inherited from patch EscapeDivers:
Expandable: \pdfstrcmp, \pdfescapestring, \pdfescapename
Added: \pdfescapehex, \pdfunescapehex
Fix: \pdfescapename: '%' is also a delimiter that
needs escaping.
New in patch EscapeAndOther:
* \pdfcreationdate
* \pdffilemoddate
* \pdffilesize
* \pdfmdfivesum
* \pdffiledump
* \pdfshellescape
* \pdfmatch and \pdflastmatch
* Start with bug fix: quotes are legal in other operating
systems than windows.
Syntax:
%% expandable commands:
\pdfstrcmp <general text> <general text>
\pdfescapestring <general text>
\pdfescapename <general text>
\pdfescapehex <general text>
\pdfunescapehex <general text>
\pdfcreationdate
\pdffilemoddate <general text>
\pdffilesize <general text>
\pdfmdfivesum <file spec> | <general text>
\pdfmatch <match options> <general text> <general text>
<match options> := [icase] [subcount <number>]
\pdflastmatch <number>
%% read-only integers
\pdfshellescape
Common for the following primitives:
* The result string is given by characters with
catcode "other" (12), only the space has
catcode "space" (10). The commands follow
the tradition of \meaning, \string, ...
* The argument <general text> is expanded before use.
It follows the tradition of \special, \message,
\pdfobj, ...
Description:
\pdfstrcmp{<a>}{<b>}
Compares two strings and returns the strings
"0" if <a> equals <b>
"-1" if <a> is less than <b>
"1" if <a> is greater than <b>
Use example:
\ifcase\pdfstrcmp{abc}{def}\relax
\message{abc = def}
\or
\message{abc > def}
\else
\message{abc < def}
\fi
Alternative:
Implementing as read-only integer, then
the use in \ifcase, \ifnum, ... is more save:
\ifcase\pdfstrcmp{xyz}{abc}1\or 2\else 3\fi
expands to 3, as read-only integer it expands
to the expected 2.
\pdfescapestring{<a>}
Escapes the string <a> that it can be used as PDF string.
'(', ')', '\' are escaped along with the
control and 8-bit characters.
Use example:
\pdfinfo{/Title (\pdfescapestring{...})}
\special{ps: [ /Title (\pdfescapestring{...}) /DOCINFO pdfmark}
Alternative:
Perhaps 8-bit characters don't need escaping.
Whitespace, especially newlines should be escaped, because
of the use for latex/dvips to avoid recoding problems
(<LF> -> <CR><LF>).
\pdfescapename{<a>}
Escapes the string <a> that it can be used as PDF string.
Whitespace, delimiters, '#' are escaped along with
the control and 8-bit characters, recommended by the spec.
Use example:
\pdfobj stream attr{/Type/EmbeddedFile%
/Subtype/\pdfescapename{text/plain; charset=iso-8859-1}%
...
} file {...}
\pdfescapehex{<a>}, \pdfunescapehex{<b>}
String <a> is converted to uppercase hexadecimal
representation, <b> is converted back.
Use example:
\pdfescapehex{Hello} is converted to 48656C6C6F
\pdfinfo{/Title <\pdfescapehex{Hello}>}
Also it can be used to write strings in auxiliary
files and later reread without worrying about
catcodes, unmached curly braces.
\pdfcreationdate
It expands to the date string that pdfTeX uses in
the info dict as default.
Rationale:
* It provides seconds and especially the time zone.
* Setting of /M date for annotations.
* Because of the complicate change file structure
of the sources it is not easy to synchronize
the creation date with \year, \month, \day
and \time. Thus \pdfcreationdate can be
used to set these registers to the same
values.
Example:
\pdfcreationdate expands to D:20050625015605+02'00'
\pdfannot ...{/Subtype /FileAttachment
\M (\pdfcreationdate) ...}
\pdffilemoddate{<file>}
It expands to the modification date of <file> in the
same format as \pdfcreationdate (PDF date format).
On error it returns the empty string.
Rationale:
* File embedding: the date is shown in the attachment tab.
* "Make feature": files can be compared, it can be checked,
which file is newer.
Example: pdfTeX does not support EPS files, epstopdf.sty
converts them to PDF either always or, if the PDF variants
do not exists. Both ways are not satifactory, either the
time penalty is large or pdf files are embedded that
are out of date. \pdffilemoddate solves this problem.
Error handling, see \pdffilesize
I don't have implemented a \pdfcreationdate because of
portability issue: The "ctime" field of struct "stat" is
interpreted differently among operating systems:
* creation date, e.g. win
* inode change time, e.g. unix
\pdffilesize{<file>}
It expands to the size of <file> as string. On error it
returns the empty string.
Rationale:
* File embedding: the size is shown in the attachment tab.
* Sometimes it is useful to know if a file has size "0"
(failed conversions, ...).
Error handling:
Currently the empty string is silently returned.
Alternatives:
* Stop with error message. But what can the user do?
Error recovery is easy: no information available,
thus return nothing.
* Warning message. But the primitives could be used
for checks on file existence.
* Return status in \pdfretval. I doubt a little whether
this is really necessary. It is very easy to implement.
But the documentation grows by a large list of
error codes with its problems: Much to explain to the
users. How they are assigned?
\pdfmdfivesum{<abc>} or \pdfmdfivesum file {<file>}
It calculates the md5 sum and converts it to
uppercase hexadecimal format (same as \pdfescapehex).
The syntax is a simplified \pdfobj: Either the
data is given directly or in a file.
Rationale:
* File embedding: providing /CheckSum.
* Also the md5 sums of auxiliary files could be stored
and compared in order to display a rerun warning.
(Of course, it can be possible that different files
have the same checksum, but the same file does not
have different checksums.)
\pdfshellescape
It is a read-only integer that is 1 if \write18 is
enabled, 0 otherwise.
Rationale:
* It thought that \ifeof18 with \pdftexversion
to implement a safe test for the \write18 feature.
But I had to learn that I was wrong, see thread
in comp.text.tex: "Confused about pstricks and
pdftricks":
mikTeX's pdfTeX does not implement \ifeof18.
For implementing the test, only a obscure way
via \pdftexbanner remains, but this way is not
very reliable, the contents of \pdftexbanner
is not well defined, it could be anything.
Quotes in file names:
* File name handling is quite chaotic. Quotes are removed
by the scanner for \input, \open*. This behaviour is
schizophrene: to solve a problem with spaces, quotes
are now forbidden.
* Inconsistencies:
The syntax of \pdfobj and \pdfximage would allow
any file name, but \pdfobj removes quotes, only
\pdfximage uses the more intelligent way, it removes
quotes for windows only.
Thus the patch fixes:
* append_to_name (used in pack_file_name): quotes are removed
for windows only. This fixes \pdfobj.
* utils.c: new function "makecfilename",
used by \pdfximage (readimage), \pdffilesize, ...
\pdfmatch [icase] [subcount <number>}] {<pattern>}{<string>}
Implements pattern matching using the POSIX regex
(a standard library at least in my linux).
It returns the same values as \pdfstrcmp, but
with the following semantics:
-1: error case (invalid pattern, ...)
0: no match
1: match found
Options:
* icase: case insensitive matching
* subcount: it sets the table size for found subpatterns.
A number "-1" resets the table size to the start default.
See the manual page regex.3 and regex.7.
The implementation shows a possible interface to
pattern matching in TeX. Therefore only the basics
is implemented.
Flags:
* REG_EXTENDED is set in the implementation.
* REG_ICASE: can be set by user.
* other: not implemented.
\pdflastmatch <number>
The result of \pdfmatch is stored in an array.
The entry "0" contains the match, the following
entries submatches. The positions of the matches
are also available. They are encoded in the following
manner to avoid another primitive:
<position> "->" <match string>
"->" is used as separator in the tradition of \meaning.
There exists macros for parsing the output of \meaning
(e.g. in LaTeX: \strip@prefix).
The position "-1" with an empty string indicates that
this entry is not set.
Example:
\def\msg#{\immediate\write16 }
\msg{\pdfmatch{(l+)o (W(o))}{Hello World}}
\msg{\pdflastmatch0}
\msg{\pdflastmatch1}
\msg{\pdflastmatch2}
\msg{\pdflastmatch3}
\msg{\pdflastmatch4}
Result:
1
2->llo Wo
2->ll
6->Wo
7->o
-1->
Alternative:
PCRE (Perl-compatible regular expressions) is far more
powerful. More options, named subpattern, ...
License for 0.4 was GPL compatible, since 0.5 it is BSD,
current version is 0.6.
The TeX interface could be changed in the following way:
* Addition: \pdflastmatchbyname <general text>
It extracts matches for named subpattern.
* Options can be given by the same name as in the
PCRE description:
\pdfmatch anchored caseless ... {}{}
For easier/faster scanning the options could be
restricted to be given in sorted order.
* Or options can be given by letters in any order
in an additional argument:
\pdfmatch{<pattern>}{<options>}{<string>}
\pdfmatch{l+}{ai}{Hello World}
The implementation could then use strchr to check,
whether an option is set.
Patch instructions for testing are given in the
patch description at sarovar.
Have fun
Heiko
Heiko, i'm playing with new features provided by your last patch. Impressive!
\pdfmatch [icase] [subcount <number>}] {<pattern>}{<string>}
Impressive indeed!!
Implements pattern matching using the POSIX regex (a standard library at least in my linux).
Didn't test deeply yet, but works on windows as well Regards, -- Pawe/l Jackowski P.Jackowski@gust.org.pl
On 2005-06-25 03:52:01 +0200, Heiko Oberdiek wrote:
\pdfmatch [icase] [subcount <number>}] {<pattern>}{<string>} Implements pattern matching using the POSIX regex (a standard library at least in my linux). It returns the same values as \pdfstrcmp, but with the following semantics: -1: error case (invalid pattern, ...) 0: no match 1: match found Options: * icase: case insensitive matching * subcount: it sets the table size for found subpatterns. A number "-1" resets the table size to the start default.
See the manual page regex.3 and regex.7.
The implementation shows a possible interface to pattern matching in TeX. Therefore only the basics is implemented. Flags: * REG_EXTENDED is set in the implementation. * REG_ICASE: can be set by user. * other: not implemented.
\pdflastmatch <number> The result of \pdfmatch is stored in an array. The entry "0" contains the match, the following entries submatches. The positions of the matches are also available. They are encoded in the following manner to avoid another primitive: <position> "->" <match string> "->" is used as separator in the tradition of \meaning. There exists macros for parsing the output of \meaning (e.g. in LaTeX: \strip@prefix). The position "-1" with an empty string indicates that this entry is not set. Example: \def\msg#{\immediate\write16 } \msg{\pdfmatch{(l+)o (W(o))}{Hello World}} \msg{\pdflastmatch0} \msg{\pdflastmatch1} \msg{\pdflastmatch2} \msg{\pdflastmatch3} \msg{\pdflastmatch4} Result: 1 2->llo Wo 2->ll 6->Wo 7->o -1->
Alternative: PCRE (Perl-compatible regular expressions) is far more powerful. More options, named subpattern, ... License for 0.4 was GPL compatible, since 0.5 it is BSD, current version is 0.6.
The TeX interface could be changed in the following way: * Addition: \pdflastmatchbyname <general text> It extracts matches for named subpattern. * Options can be given by the same name as in the PCRE description: \pdfmatch anchored caseless ... {}{} For easier/faster scanning the options could be restricted to be given in sorted order. * Or options can be given by letters in any order in an additional argument: \pdfmatch{<pattern>}{<options>}{<string>} \pdfmatch{l+}{ai}{Hello World} The implementation could then use strchr to check, whether an option is set.
Patch instructions for testing are given in the patch description at sarovar.
While this is a VERY nice feature, I'm reluctant to include this into 1.30.0 because - we are (in theory at least) in feature-freeze, and this is definitely a new feature :-) - it may need more testing - I doubt that regex.h is portable; we should keep Windows in mind. Comments? Best regards Martin -- http://www.tm.oneiros.de
Hi Martin,
\pdfmatch [icase] [subcount <number>}] {<pattern>}{<string>} [...]
While this is a VERY nice feature, I'm reluctant to include this into 1.30.0 because - we are (in theory at least) in feature-freeze, and this is definitely a new feature :-) - it may need more testing - I doubt that regex.h is portable; we should keep Windows in mind.
I've tried a couple of examples on Windows with \pdfmatch. Works fine but didn't test deeply. Will try to do that ASAP. Best regards -- Pawe/l Jackowski P.Jackowski@gust.org.pl
Pawel Jackowski wrote:
Hi Martin,
\pdfmatch [icase] [subcount <number>}] {<pattern>}{<string>} [...]
While this is a VERY nice feature, I'm reluctant to include this into 1.30.0 because - we are (in theory at least) in feature-freeze, and this is definitely a new feature :-) - it may need more testing - I doubt that regex.h is portable; we should keep Windows in mind.
Is there any reason why some non-os dependent feature would not compile on windows? I expect more problems with old unix architectures -) Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
Martin Schröder wrote:
While this is a VERY nice feature, I'm reluctant to include this into 1.30.0 because - we are (in theory at least) in feature-freeze, and this is definitely a new feature :-)
we can qualify all these new string handling features as experimental; the advantage of having it in 1.30 is that we can experiment; we can wait with adding them to the manual till 1.40 Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
Hans:
we can qualify all these new string handling features as experimental; the advantage of having it in 1.30 is that we can experiment; we can wait with adding them to the manual till 1.40
I've already added some notes (except \pdfmatch), but before release we can comment them. -- Pawe/l Jackowski P.Jackowski@gust.org.pl
On Thu, Jun 30, 2005 at 12:06:32PM +0200, Martin Schröder wrote:
On 2005-06-25 03:52:01 +0200, Heiko Oberdiek wrote:
\pdfmatch [icase] [subcount <number>}] {<pattern>}{<string>}
\pdflastmatch <number>
While this is a VERY nice feature, I'm reluctant to include this
This is experimental yet. Currently I am experimenting with "pcre" that is more powerful. Result so far: * changed syntax for \pdflastmatch: \pdflastmatch <general text> Thus keywords can also be used, saving new primitives: * \pdflastmatch{<number>} same semantics as before * \pdflastmatch{match} = \pdflastmatch{0} (perl: $&) * \pdflastmatch{subject} => whole string * \pdflastmatch{prematch} => part before "match" (perl: $`) * \pdflastmatch{postmatch} => part after "match" (perl: $') * \pdflastmatch{last_paren_match} => last substring (perl: $+) * Unclear: options, how they are specified: * \pdfmatch anchored caseless ... {}{} * any order? * alphabetical order for easier/faster scanner? * Or options in separate argument, similar keyval? \pdflastmatch{anchored,caseless,...}{<pattern>}{<subject>} or \pdflastmatch options{...} {<pattern>}{<subject>} * Eventually replace feature.
- I doubt that regex.h is portable; we should keep Windows in mind.
regex: at least it is POSIX. Thus there is a chance that this is
implemented in Windows compileres.
pcre: they say it is possible to compile it under Windows.
Yours sincerely
Heiko
On 2005-06-25 03:52:01 +0200, Heiko Oberdiek wrote:
\pdfmatch [icase] [subcount <number>}] {<pattern>}{<string>} Implements pattern matching using the POSIX regex (a standard library at least in my linux). It returns the same values as \pdfstrcmp, but with the following semantics: -1: error case (invalid pattern, ...) 0: no match 1: match found Options: * icase: case insensitive matching * subcount: it sets the table size for found subpatterns. A number "-1" resets the table size to the start default.
See the manual page regex.3 and regex.7.
The implementation shows a possible interface to pattern matching in TeX. Therefore only the basics is implemented. Flags: * REG_EXTENDED is set in the implementation. * REG_ICASE: can be set by user. * other: not implemented.
\pdflastmatch <number> The result of \pdfmatch is stored in an array. The entry "0" contains the match, the following entries submatches. The positions of the matches are also available. They are encoded in the following manner to avoid another primitive: <position> "->" <match string> "->" is used as separator in the tradition of \meaning. There exists macros for parsing the output of \meaning (e.g. in LaTeX: \strip@prefix). The position "-1" with an empty string indicates that this entry is not set. Example: \def\msg#{\immediate\write16 } \msg{\pdfmatch{(l+)o (W(o))}{Hello World}} \msg{\pdflastmatch0} \msg{\pdflastmatch1} \msg{\pdflastmatch2} \msg{\pdflastmatch3} \msg{\pdflastmatch4} Result: 1 2->llo Wo 2->ll 6->Wo 7->o -1->
How can one inquire the number of matches found, i.e. the size of the array? This would be sub_match_count in utils.c . Or does one have to parse increasing \ptexlastmatch'es till one get's a -1? Best Martin -- http://www.tm.oneiros.de
On 2005-06-25 03:52:01 +0200, Heiko Oberdiek wrote:
New in patch EscapeAndOther: * \pdfcreationdate * \pdffilemoddate * \pdffilesize * \pdfmdfivesum * \pdffiledump * \pdfshellescape * \pdfmatch and \pdflastmatch * Start with bug fix: quotes are legal in other operating systems than windows.
And \pdffiledump. What does it do? Best Martin -- http://www.tm.oneiros.de
On Sat, Jul 02, 2005 at 09:43:34PM +0200, Martin Schröder wrote:
On 2005-06-25 03:52:01 +0200, Heiko Oberdiek wrote:
New in patch EscapeAndOther: * \pdfcreationdate * \pdffilemoddate * \pdffilesize * \pdfmdfivesum * \pdffiledump * \pdfshellescape * \pdfmatch and \pdflastmatch * Start with bug fix: quotes are legal in other operating systems than windows.
And \pdffiledump. What does it do?
Sorry, I have forgotten the description:
\pdffiledump [offset <int>] [length <int>] <general text>
It returns a hex dump of file, given in <general text>,
starting at given offset or 0 with given length.
\pdffiledump length 20 {abc.bmp}
Reads the first twenty bytes of abc.bmp in binary mode and
returns them as hex string.
Uses: a lot, eg. inspecting the file header to get
size information of bitmap files (eg. for latex/dvips).
Or it could be used to read ucs2/4 files, recode them
and reading (\scantokens, \pdfunescapehex) them as utf-8.
Yours sincerely
Heiko
Heiko Oberdiek wrote:
\pdffiledump length 20 {abc.bmp}
Reads the first twenty bytes of abc.bmp in binary mode and returns them as hex string.
Uses: a lot, eg. inspecting the file header to get size information of bitmap files (eg. for latex/dvips). Or it could be used to read ucs2/4 files, recode them and reading (\scantokens, \pdfunescapehex) them as utf-8.
often one needs more trickery to get a size due to the fact that many bitmap formats are linked lists of tagged records; i suppose that reading in a 400K file has some mem penalty -) Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
On Mon, Jul 04, 2005 at 03:48:47PM +0200, Hans Hagen wrote:
Heiko Oberdiek wrote:
\pdffiledump length 20 {abc.bmp}
Reads the first twenty bytes of abc.bmp in binary mode and returns them as hex string.
Uses: a lot, eg. inspecting the file header to get size information of bitmap files (eg. for latex/dvips). Or it could be used to read ucs2/4 files, recode them and reading (\scantokens, \pdfunescapehex) them as utf-8.
often one needs more trickery to get a size due to the fact that many bitmap formats are linked lists of tagged records; i suppose that reading in a 400K file has some mem penalty -)
BMP formats are supported by dvips and you need some header
bytes only.
400K files don't need to be read as a whole. An intelligent
implementation would parse the necessary parts. With the
information of \pdffilesize files can also read from behind
and parse pdf or dvi files, for example. But it is easy to
add another option for the read direction for parameter "whence"
of fseek (SEEK_SET, SEEK_CUR, SEEK_END).
Currently the file is not remembered, thus "current" has no meaning.
Keyword "offset" can be given an option "end":
\pdffiledump [offset [end] <int>] [length <int>] <general text>
"offset 10 length 2" means byte 10 and 11,
"offset end 10 length 2" means byte 10 and 11 before end of file.
Yours sincerely
Heiko
Heiko Oberdiek wrote:
400K files don't need to be read as a whole. An intelligent implementation would parse the necessary parts. With the information of \pdffilesize files can also read from behind and parse pdf or dvi files, for example. But it is easy to add another option for the read direction for parameter "whence" of fseek (SEEK_SET, SEEK_CUR, SEEK_END).
that's indeed the point i wanted to make: that kind of parsing only makes sense if one can also seek; better make a full implementation now then wait for new features later.
Currently the file is not remembered, thus "current" has no meaning. Keyword "offset" can be given an option "end":
\pdffiledump [offset [end] <int>] [length <int>] <general text>
"offset 10 length 2" means byte 10 and 11, "offset end 10 length 2" means byte 10 and 11 before end of file.
ah, i see, interesting Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
participants (4)
-
Hans Hagen
-
Heiko Oberdiek
-
Martin Schröder
-
Pawel Jackowski