On 8/5/06, Mojca Miklavec
Hello,
I would like to ask how difficult it would be to count the number of words in a TeX/ConTeXt document. If it's too complex, please ignore the rest of the message.
It wasn't too complex for Michael Downes using LaTeX: \ProvidesFile{wordcount.tex}[2000/09/27 v1.5 Michael Downes] % Copyright 2000 Michael John Downes % This file has no restrictions on its use, distribution, or sale. % % If you run LaTeX on wordcount.tex it will prompt you for the name of a % document to be counted. For most people, however, it will be more % convenient to run the shell script wordcount.sh, giving the document % name as the first argument. The comments in wordcount.sh % give further information about the usage and limitations of this tool. % The fundamental idea is to mark each character and interword space % with a unique tag that will show up in TeX "showbox" output. Then % arrange to make the output routine trigger a TeX overfull vbox message % for the page box so that everything gets reported in the TeX log. % Then run grep -c (or an equivalent text search utility, e.g., perl) on % the log file to count the occurrences. % [....]
Most recipes for LaTeX say that it's best to do something like "pdftotext" and then issue "wc" to count the words in the resulting text file, but windows users don't have "wc" and sometimes you only need to know the length of the abstract or so ...
Many GNU utilities have been ported (GNUWin32), or can be implemented in perl/ruby which context uses anyway.
Some time ago Hans mentioned that he counts the number of appearance of single charactres, but I don't know how difficult it would be to extend it to count the number of words.
The problem is not that well defined (how to handle equations, some would probably want to exclude headers, footers, buttons, ...), but it only needs to be an approximation and "backward compatibility" (in the sense that counter would have to result in the same number after some years) is not needed at all since algorithms might improve with time and the resulting document doesn't really depend on that number, it would only be written to the log file.
My idea for the interface would be something like
\startwordcount[abstract] \startframedtext Bla bla. \stopframedtext \stopwordcount
which would write something like "abstract: 2 words" to the log file
or
\startstatistics[abstract][words] \startframedtext Bla bla. \stopframedtext \stopstatistics
But this is really a low priority. I'm currently using Acrobat to copy the text, then I paste it into Office and take a look at statistics there when I need to obey some limitations.
So, if there's a simple solution, I would be glad to use it, but if it takes too much time to implement it, it's probably not worth the effort.
ConTeXt already analyzes the "scratch" files with perl or ruby, so if
you can adapt MD's idea it shouldn't be a big deal to have texexec
print the result.
--
George N. White III