On Sat, 5 Aug 2006, Mojca Miklavec wrote:
I would like to ask how difficult it would be to count the number of words in a TeX/ConTeXt document. If it's too complex, please ignore the rest of the message.
Most recipes for LaTeX say that it's best to do something like "pdftotext" and then issue "wc" to count the words in the resulting text file, but windows users don't have "wc" and sometimes you only need to know the length of the abstract or so ...
Some time ago Hans mentioned that he counts the number of appearance of single charactres, but I don't know how difficult it would be to extend it to count the number of words.
The problem is not that well defined (how to handle equations, some would probably want to exclude headers, footers, buttons, ...), but it only needs to be an approximation and "backward compatibility" (in the sense that counter would have to result in the same number after some years) is not needed at all since algorithms might improve with time and the resulting document doesn't really depend on that number, it would only be written to the log file.
My idea for the interface would be something like
\startwordcount[abstract] \startframedtext Bla bla. \stopframedtext \stopwordcount
which would write something like "abstract: 2 words" to the log file
or
\startstatistics[abstract][words] \startframedtext Bla bla. \stopframedtext \stopstatistics
But this is really a low priority. I'm currently using Acrobat to copy the text, then I paste it into Office and take a look at statistics there when I need to obey some limitations.
So, if there's a simple solution, I would be glad to use it, but if it takes too much time to implement it, it's probably not worth the effort.
A very crude approach. There is a program called detex
http://ctan.org/tex-archive/support/detex/ I have not used it, but I
think that it strips off every command \something from the tex file.
Then you can filter the file through wc to get a rough estimate of
the number of words. One approach that will work is
\startstatistics[filename][words|letters|lines]
maps to
\startbuffer[\jobname-statistics-filename]
and
\stopstatistics maps to
\stopbuffer
\getbuffer[\jobname-statistics-filename]
\executesystemcommand{detex \jobname-statistics-filename.tmp | wc