counting the words in a TeX document

Mojca Miklavec

5 Aug 2006 5 Aug '06

6:45 p.m.

Hello, I would like to ask how difficult it would be to count the number of words in a TeX/ConTeXt document. If it's too complex, please ignore the rest of the message. Most recipes for LaTeX say that it's best to do something like "pdftotext" and then issue "wc" to count the words in the resulting text file, but windows users don't have "wc" and sometimes you only need to know the length of the abstract or so ... Some time ago Hans mentioned that he counts the number of appearance of single charactres, but I don't know how difficult it would be to extend it to count the number of words. The problem is not that well defined (how to handle equations, some would probably want to exclude headers, footers, buttons, ...), but it only needs to be an approximation and "backward compatibility" (in the sense that counter would have to result in the same number after some years) is not needed at all since algorithms might improve with time and the resulting document doesn't really depend on that number, it would only be written to the log file. My idea for the interface would be something like \startwordcount[abstract] \startframedtext Bla bla. \stopframedtext \stopwordcount which would write something like "abstract: 2 words" to the log file or \startstatistics[abstract][words] \startframedtext Bla bla. \stopframedtext \stopstatistics But this is really a low priority. I'm currently using Acrobat to copy the text, then I paste it into Office and take a look at statistics there when I need to obey some limitations. So, if there's a simple solution, I would be glad to use it, but if it takes too much time to implement it, it's probably not worth the effort. Thanks a lot, Mojca

Show replies by date

Aditya Mahajan

5 Aug 5 Aug

7:02 p.m.

On Sat, 5 Aug 2006, Mojca Miklavec wrote:

...

I would like to ask how difficult it would be to count the number of words in a TeX/ConTeXt document. If it's too complex, please ignore the rest of the message.

Most recipes for LaTeX say that it's best to do something like "pdftotext" and then issue "wc" to count the words in the resulting text file, but windows users don't have "wc" and sometimes you only need to know the length of the abstract or so ...

Some time ago Hans mentioned that he counts the number of appearance of single charactres, but I don't know how difficult it would be to extend it to count the number of words.

The problem is not that well defined (how to handle equations, some would probably want to exclude headers, footers, buttons, ...), but it only needs to be an approximation and "backward compatibility" (in the sense that counter would have to result in the same number after some years) is not needed at all since algorithms might improve with time and the resulting document doesn't really depend on that number, it would only be written to the log file.

My idea for the interface would be something like

\startwordcount[abstract] \startframedtext Bla bla. \stopframedtext \stopwordcount

which would write something like "abstract: 2 words" to the log file

or

\startstatistics[abstract][words] \startframedtext Bla bla. \stopframedtext \stopstatistics

But this is really a low priority. I'm currently using Acrobat to copy the text, then I paste it into Office and take a look at statistics there when I need to obey some limitations.

So, if there's a simple solution, I would be glad to use it, but if it takes too much time to implement it, it's probably not worth the effort.

A very crude approach. There is a program called detex http://ctan.org/tex-archive/support/detex/ I have not used it, but I think that it strips off every command \something from the tex file. Then you can filter the file through wc to get a rough estimate of the number of words. One approach that will work is \startstatistics[filename][words|letters|lines] maps to \startbuffer[\jobname-statistics-filename] and \stopstatistics maps to \stopbuffer \getbuffer[\jobname-statistics-filename] \executesystemcommand{detex \jobname-statistics-filename.tmp | wc } and possibly prettify output to be more clearly visible in the log. Another approach can be write a vim script so that you can count the number of words in a visually highlighted area. Aditya

gnwiii＠gmail.com

7:52 p.m.

On 8/5/06, Mojca Miklavec wrote:

...

Hello,

I would like to ask how difficult it would be to count the number of words in a TeX/ConTeXt document. If it's too complex, please ignore the rest of the message.

It wasn't too complex for Michael Downes using LaTeX: \ProvidesFile{wordcount.tex}[2000/09/27 v1.5 Michael Downes] % Copyright 2000 Michael John Downes % This file has no restrictions on its use, distribution, or sale. % % If you run LaTeX on wordcount.tex it will prompt you for the name of a % document to be counted. For most people, however, it will be more % convenient to run the shell script wordcount.sh, giving the document % name as the first argument. The comments in wordcount.sh % give further information about the usage and limitations of this tool. % The fundamental idea is to mark each character and interword space % with a unique tag that will show up in TeX "showbox" output. Then % arrange to make the output routine trigger a TeX overfull vbox message % for the page box so that everything gets reported in the TeX log. % Then run grep -c (or an equivalent text search utility, e.g., perl) on % the log file to count the occurrences. % [....]

...

Most recipes for LaTeX say that it's best to do something like "pdftotext" and then issue "wc" to count the words in the resulting text file, but windows users don't have "wc" and sometimes you only need to know the length of the abstract or so ...

Many GNU utilities have been ported (GNUWin32), or can be implemented in perl/ruby which context uses anyway.

...

Some time ago Hans mentioned that he counts the number of appearance of single charactres, but I don't know how difficult it would be to extend it to count the number of words.

The problem is not that well defined (how to handle equations, some would probably want to exclude headers, footers, buttons, ...), but it only needs to be an approximation and "backward compatibility" (in the sense that counter would have to result in the same number after some years) is not needed at all since algorithms might improve with time and the resulting document doesn't really depend on that number, it would only be written to the log file.

My idea for the interface would be something like

\startwordcount[abstract] \startframedtext Bla bla. \stopframedtext \stopwordcount

which would write something like "abstract: 2 words" to the log file

or

\startstatistics[abstract][words] \startframedtext Bla bla. \stopframedtext \stopstatistics

But this is really a low priority. I'm currently using Acrobat to copy the text, then I paste it into Office and take a look at statistics there when I need to obey some limitations.

So, if there's a simple solution, I would be glad to use it, but if it takes too much time to implement it, it's probably not worth the effort.

ConTeXt already analyzes the "scratch" files with perl or ruby, so if you can adapt MD's idea it shouldn't be a big deal to have texexec print the result. -- George N. White III Head of St. Margarets Bay, Nova Scotia

Hans Hagen

10:07 p.m.

Mojca Miklavec wrote:

...

Hello,

I would like to ask how difficult it would be to count the number of words in a TeX/ConTeXt document. If it's too complex, please ignore the rest of the message.

the way i do such things (and worse trickery) is using pdftotext you can of course use tex, but then ther ecan be generated words and so and it is insane to use tex (or adapt a tex style) for that; it may help to run with (nondestructive) \setupalign[nothyphenated] anyhow, here is a script (i could not locate my normal one) === wordcount.rb === if (file = ARGV[0]) && file && FileTest.file?(file) then begin system("pdftotext #{ARGV[0]} wc.log") data = IO.read("wc.log") data.gsub!(/\d[\.\:]*\w+/o) do ' ' end # remove suffixes data.gsub!(/\d/o) do ' ' end # remove numbers data.gsub!(/\-\s+/mo) do ' ' end # remove hyphenation data.gsub!(/\-/mo) do ' ' end # split compound words data.gsub!(/[\.\,\<\>\/\?\\\|\'\"\;\:\]\{\}\{\+\=\-\_\)\(\*\&\^\%\$\#\@\!\~\`]/mo) do ' ' end words = data.split(/\s+/) count = Hash.new words.each do |w| count[w] = (count[w] || 0) + 1 end rescue puts("some error #{$!}") else puts("words : #{words.size}") puts("unique : #{count.size}") end if ARGV[1] =~ /list/ then puts("\n") count.sort.each do |k,v| puts("#{k} : #{v}") end end end usage: wc filename.pdf [list] it this kind of stuff is usefull, we can add it to one of the scripts that come with context Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

Mojca Miklavec

6 Aug 6 Aug

2:31 a.m.

On 8/5/06, Hans Hagen wrote:

...

...
Hello,

I would like to ask how difficult it would be to count the number of words in a TeX/ConTeXt document. If it's too complex, please ignore the rest of the message.

Mojca Miklavec wrote: the way i do such things (and worse trickery) is using pdftotext

you can of course use tex, but then ther ecan be generated words and so and it is insane to use tex (or adapt a tex style) for that; it may help to run with (nondestructive)

\setupalign[nothyphenated]

anyhow, here is a script (i could not locate my normal one)

=== wordcount.rb ===

if (file = ARGV[0]) && file && FileTest.file?(file) then begin system("pdftotext #{ARGV[0]} wc.log") data = IO.read("wc.log") data.gsub!(/\d[\.\:]*\w+/o) do ' ' end # remove suffixes data.gsub!(/\d/o) do ' ' end # remove numbers data.gsub!(/\-\s+/mo) do ' ' end # remove hyphenation data.gsub!(/\-/mo) do ' ' end # split compound words data.gsub!(/[\.\,\<\>\/\?\\\|\'\"\;\:\]\{\}\{\+\=\-\_\)\(\*\&\^\%\$\#\@\!\~\`]/mo) do ' ' end words = data.split(/\s+/) count = Hash.new words.each do |w| count[w] = (count[w] || 0) + 1 end rescue puts("some error #{$!}") else puts("words : #{words.size}") puts("unique : #{count.size}") end if ARGV[1] =~ /list/ then puts("\n") count.sort.each do |k,v| puts("#{k} : #{v}") end end end

usage: wc filename.pdf [list]

it this kind of stuff is usefull, we can add it to one of the scripts that come with context

Thanks a lot! I guess that's *it*! I always forget about the most powerful feature of ConTeXt in comparison to LaTeX - scripting can be added to almost any place (and the user doesn't need to install any additional executables, such as "detex" mentioned by Aditya). Here's some of my feedback: - pdftotext is far from being useful for pdf to text conversion (doesn't handle any accents), but is perfectly suitable for wordcount - \[ is missing in the last gsub (only the right bracket is deleted) - something strange (but not critical) happens to en-dashes But everything else looks like a perfect functionality for ctxtools --wordcount. On 8/5/06, Aditya Mahajan wrote:

...

A very crude approach. There is a program called detex http://ctan.org/tex-archive/support/detex/ I have not used it, but I think that it strips off every command \something from the tex file. Then you can filter the file through wc to get a rough estimate of the number of words. One approach that will work is

\startstatistics[filename][words|letters|lines]

maps to

\startbuffer[\jobname-statistics-filename]

and

\stopstatistics maps to

\stopbuffer \getbuffer[\jobname-statistics-filename] \executesystemcommand{detex \jobname-statistics-filename.tmp | wc }

I took a look, but it merely looks like a parser for hardcoded (La)TeX (someone should correct me if I'm wrong). However, the fact that abstracts for which one might need wordcount usually don't have too much trickery involved (they're usually olmost pure plain text), doing the same, only with a simple ruby script instead of compiling/installing some external LaTeX-aware C program might already lead to satisfactory results.

...

It wasn't too complex for Michael Downes using LaTeX:

\ProvidesFile{wordcount.tex}[2000/09/27 v1.5 Michael Downes] % Copyright 2000 Michael John Downes % This file has no restrictions on its use, distribution, or sale. % % If you run LaTeX on wordcount.tex it will prompt you for the name of a % document to be counted. For most people, however, it will be more

This solution is more likely to produce better results (just that it includes slightly more work). It actually runs (La)TeX, just redefines a few commands before, so that counting the words is then a straightforward parsing of log files based on the number of some boxes. Base on those three answers I got a more clear idea of two (different, but complementary) methods that might be sensible: a) ctxtools --wordcount filename[tex|pdf] to do the wordcount for the whole document using pdftotext + ruby regexp b) \usemodule[wordcount] whatever \startstatistics[name][words|letters|lines] some more-or-less plain text \stopstatistics whatever and according to Aditya's idea, run a (ruby) regular expression (insead of detex) on it which would write the nicely formatted desired number to the output/log file. (I don't know if it's possible to use the first approach for the second problem, but it doesn't make sense to complicate things too much.) As long as the command names are carefully chosen (and extensible if the need for more complex behaviour arises in the future), that should be about everything and it doesn't seem so difficult to implement after all. (But I would write to the documentation that the resulting numbers might change slightly in the future if the algorithm for counting the words is improved.) Any thoughts? Thanks a lot, Mojca

Hans Hagen

5 p.m.

Mojca Miklavec wrote:

...

t complementary) methods that might be sensible:

a) ctxtools --wordcount filename[tex|pdf] to do the wordcount for the whole document using pdftotext + ruby regexp

counting words in tex docs is not that hard: it needs in addition: delete all the words starting with \ delete everything (nested) between [ ] Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

Aditya Mahajan

7:27 p.m.

On Sun, 6 Aug 2006, Mojca Miklavec wrote:

...

Base on those three answers I got a more clear idea of two (different, but complementary) methods that might be sensible:

a) ctxtools --wordcount filename[tex|pdf] to do the wordcount for the whole document using pdftotext + ruby regexp

b) \usemodule[wordcount]

whatever

\startstatistics[name][words|letters|lines] some more-or-less plain text \stopstatistics

whatever

and according to Aditya's idea, run a (ruby) regular expression (insead of detex) on it which would write the nicely formatted desired number to the output/log file. (I don't know if it's possible to use the first approach for the second problem, but it doesn't make sense to complicate things too much.)

If you have a script that counts words in a Context document, the second approach is straight forward. Write everything to a buffer and run the script on the buffer. However, such a mechansim will never be perfect (or close to perfect) in the sense of parsing arbitrary input. ftp://tug.ctan.org/pub/tex-archive/macros/plain/contrib/misc/xii.tex But of course, you will not write anything like this in an abstract :-) Aditya

Mojca Miklavec

7 Aug 7 Aug

10:24 a.m.

On 8/6/06, Aditya Mahajan wrote:

...

On Sun, 6 Aug 2006, Mojca Miklavec wrote:

...
Base on those three answers I got a more clear idea of two (different, but complementary) methods that might be sensible:

a) ctxtools --wordcount filename[tex|pdf] to do the wordcount for the whole document using pdftotext + ruby regexp

b) \usemodule[wordcount]

whatever

\startstatistics[name][words|letters|lines] some more-or-less plain text \stopstatistics

whatever

and according to Aditya's idea, run a (ruby) regular expression (insead of detex) on it which would write the nicely formatted desired number to the output/log file. (I don't know if it's possible to use the first approach for the second problem, but it doesn't make sense to complicate things too much.)

If you have a script that counts words in a Context document, the second approach is straight forward. Write everything to a buffer and run the script on the buffer. However, such a mechansim will never be perfect (or close to perfect) in the sense of parsing arbitrary input.

The most dummy solution that I could think of (using slightly modified Hans's ruby script): \unprotect \def\startstatistics {\dodoubleempty\dostartstatistics} \def\dostartstatistics[#1][#2]#3\stopstatistics {\setbuffer[#1]#3\endbuffer \executesystemcommand{ruby wordcount.rb \jobname-#1.tmp}% \getbuffer[#1]} \protect \doifnotmode{demo}{\endinput} ... but a friend who asked me for a favour actually wants to use abbreviations and bibliography as well, so only the first method (to create PDF first) would work. He currently keeps copy-pasting the resulting PDF to Word and uses Word's statistics to cound the words and/or characters for him. But I guess that his wishes will have to wait for some more time in this case.

...

ftp://tug.ctan.org/pub/tex-archive/macros/plain/contrib/misc/xii.tex

But of course, you will not write anything like this in an abstract :-)

Nevertheless, I love the story (and esp. the document which creates it)! All the best, Mojca

Hans Hagen

11:22 a.m.

Mojca Miklavec wrote:

...

...
ftp://tug.ctan.org/pub/tex-archive/macros/plain/contrib/misc/xii.tex

yeah, a famous tex master piece!

...
But of course, you will not write anything like this in an abstract :-)

hm, let me provide a word counter for that one before you get the idea to ask for it -)

\starttext \setbox0\vbox\bgroup % \tracingall -) \forgetall \nohyphens \hsize1mm \let\bye\egroup \bgroup \let~\catcode~`76~`A13~`F1~`j00~`P2jdefA71F~`7113jdefPALLF PA''FwPA;;FPAZZFLaLPA//71F71iPAHHFLPAzzFenPASSFthP;A$$FevP A@@FfPARR717273F737271P;ADDFRgniPAWW71FPATTFvePA**FstRsamP AGGFRruoPAqq71.72.F717271PAYY7172F727171PA??Fi*LmPA&&71jfi Fjfi71PAVVFjbigskipRPWGAUU71727374 75,76Fjpar71727375Djifx :76jelse&U76jfiPLAKK7172F71l7271PAXX71FVLnOSeL71SLRyadR@oL RrhC?yLRurtKFeLPFovPgaTLtReRomL;PABB71 72,73:Fjif.73.jelse B73:jfiXF71PU71 72,73:PWs;AMM71F71diPAJJFRdriPAQQFRsreLPAI I71Fo71dPA!!FRgiePBt'el@ lTLqdrYmu.Q.,Ke;vz vzLqpip.Q.,tz; ;Lql.IrsZ.eap,qn.i. i.eLlMaesLdRcna,;!;h htLqm.MRasZ.ilk,% s$;z zLqs'.ansZ.Ymi,/sx ;LYegseZRyal,@i;@ TLRlogdLrDsW,@;G LcYlaDLbJsW,SWXJW ree @rzchLhzsW,;WERcesInW qt.'oL.Rtrul;e doTsW,Wk;Rri@stW aHAHHFndZPpqar.tridgeLinZpe.LtYer.W,:jbye \egroup \newcounter\NOfLines \beginshapebox \unvcopy0 \endshapebox \reshapebox{\doglobal\increment\NOfLines} \getnoflines{\ht0} lines: \the\noflines words: \NOfLines\par % \unvbox0 \stoptext ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

Mojca Miklavec

8:54 p.m.

On 8/7/06, Hans Hagen wrote:

...

Mojca Miklavec wrote:

...
...
ftp://tug.ctan.org/pub/tex-archive/macros/plain/contrib/misc/xii.tex

yeah, a famous tex master piece!

...
But of course, you will not write anything like this in an abstract :-)

hm, let me provide a word counter for that one before you get the idea to ask for it -)

Whom did you have in mind? I would never have thought about asking such a question. ;)

...

\starttext

\setbox0\vbox\bgroup % \tracingall -) \forgetall \nohyphens \hsize1mm \let\bye\egroup \bgroup \let~\catcode~`76~`A13~`F1~`j00~`P2jdefA71F~`7113jdefPALLF PA''FwPA;;FPAZZFLaLPA//71F71iPAHHFLPAzzFenPASSFthP;A$$FevP A@@FfPARR717273F737271P;ADDFRgniPAWW71FPATTFvePA**FstRsamP AGGFRruoPAqq71.72.F717271PAYY7172F727171PA??Fi*LmPA&&71jfi Fjfi71PAVVFjbigskipRPWGAUU71727374 75,76Fjpar71727375Djifx :76jelse&U76jfiPLAKK7172F71l7271PAXX71FVLnOSeL71SLRyadR@oL RrhC?yLRurtKFeLPFovPgaTLtReRomL;PABB71 72,73:Fjif.73.jelse B73:jfiXF71PU71 72,73:PWs;AMM71F71diPAJJFRdriPAQQFRsreLPAI I71Fo71dPA!!FRgiePBt'el@ lTLqdrYmu.Q.,Ke;vz vzLqpip.Q.,tz; ;Lql.IrsZ.eap,qn.i. i.eLlMaesLdRcna,;!;h htLqm.MRasZ.ilk,% s$;z zLqs'.ansZ.Ymi,/sx ;LYegseZRyal,@i;@ TLRlogdLrDsW,@;G LcYlaDLbJsW,SWXJW ree @rzchLhzsW,;WERcesInW qt.'oL.Rtrul;e doTsW,Wk;Rri@stW aHAHHFndZPpqar.tridgeLinZpe.LtYer.W,:jbye \egroup

\newcounter\NOfLines \beginshapebox \unvcopy0 \endshapebox \reshapebox{\doglobal\increment\NOfLines}

\getnoflines{\ht0}

lines: \the\noflines words: \NOfLines\par

% \unvbox0

\stoptext

(I'll spare you the fun with sections for some other time,) but since you reminded me that I might have some questions left, here you have another one: how do I replace hyphens, en-dashes and em-dashes with "spaces/line breaks"? \catcode`~=13\let~=\space does what I want, but none of the following works: \def\-{\space} \def-{\space} \let\-=\space Thanks to the magicians, Mojca

Hans Hagen

10:55 p.m.

Mojca Miklavec wrote:

...

On 8/7/06, Hans Hagen wrote:

...
Mojca Miklavec wrote:

...
...
ftp://tug.ctan.org/pub/tex-archive/macros/plain/contrib/misc/xii.tex

yeah, a famous tex master piece!

...
...
But of course, you will not write anything like this in an abstract :-)

hm, let me provide a word counter for that one before you get the idea to ask for it -)

Whom did you have in mind? I would never have thought about asking such a question. ;)

...
\starttext

\setbox0\vbox\bgroup % \tracingall -) \forgetall \nohyphens \hsize1mm \let\bye\egroup \bgroup \let~\catcode~`76~`A13~`F1~`j00~`P2jdefA71F~`7113jdefPALLF PA''FwPA;;FPAZZFLaLPA//71F71iPAHHFLPAzzFenPASSFthP;A$$FevP A@@FfPARR717273F737271P;ADDFRgniPAWW71FPATTFvePA**FstRsamP AGGFRruoPAqq71.72.F717271PAYY7172F727171PA??Fi*LmPA&&71jfi Fjfi71PAVVFjbigskipRPWGAUU71727374 75,76Fjpar71727375Djifx :76jelse&U76jfiPLAKK7172F71l7271PAXX71FVLnOSeL71SLRyadR@oL RrhC?yLRurtKFeLPFovPgaTLtReRomL;PABB71 72,73:Fjif.73.jelse B73:jfiXF71PU71 72,73:PWs;AMM71F71diPAJJFRdriPAQQFRsreLPAI I71Fo71dPA!!FRgiePBt'el@ lTLqdrYmu.Q.,Ke;vz vzLqpip.Q.,tz; ;Lql.IrsZ.eap,qn.i. i.eLlMaesLdRcna,;!;h htLqm.MRasZ.ilk,% s$;z zLqs'.ansZ.Ymi,/sx ;LYegseZRyal,@i;@ TLRlogdLrDsW,@;G LcYlaDLbJsW,SWXJW ree @rzchLhzsW,;WERcesInW qt.'oL.Rtrul;e doTsW,Wk;Rri@stW aHAHHFndZPpqar.tridgeLinZpe.LtYer.W,:jbye \egroup

\newcounter\NOfLines \beginshapebox \unvcopy0 \endshapebox \reshapebox{\doglobal\increment\NOfLines}

\getnoflines{\ht0}

lines: \the\noflines words: \NOfLines\par

% \unvbox0

\stoptext

(I'll spare you the fun with sections for some other time,) but since you reminded me that I might have some questions left, here you have another one: how do I replace hyphens, en-dashes and em-dashes with "spaces/line breaks"? \catcode`~=13\let~=\space does what I want, but none of the following works: \def\-{\space} \def-{\space} \let\-=\space

\catcode`-=\active \def-{ } Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

Mojca Miklavec

11:31 p.m.

On 8/7/06, Hans Hagen wrote:

...

...
(I'll spare you the fun with sections for some other time,) but since you reminded me that I might have some questions left, here you have another one: how do I replace hyphens, en-dashes and em-dashes with "spaces/line breaks"? \catcode`~=13\let~=\space does what I want, but none of the following works: \def\-{\space} \def-{\space} \let\-=\space

\catcode`-=\active \def-{ }

I tried that one already, but it didn't work. Now I figured out that it was because of nesting the definitions (perhaps even some interference with negative numbers?), not because of wrong definition on itself. I'm sorry. Mojca (But my fear is that the whole problem is too complex anyway (tables, ...) to be solved elegantly.) \long\def\startstatistics#1\stopstatistics {\setbox0\vbox\bgroup % \tracingall -) \forgetall \nohyphens \hsize1mm % treat non-breakable space as a normal one \catcode`~=13\let~=\space % treat en-dash as a normal one \catcode`-=\active \def-{ } % ERROR \bgroup#1\egroup\egroup \newcounter\NOfLines \beginshapebox \unvcopy0 \endshapebox \reshapebox{\doglobal\increment\NOfLines} #1\crlf\unvbox0\crlf words: \NOfLines\crlf} \starttext \startstatistics abc~def ghi-jkl -- mno --- prs \stopstatistics % works OK %abc-def -- ghi --- jkl % \catcode`-=\active \def-{ } %abc-def -- ghi --- jkl \stoptext

Aditya Mahajan

8 Aug 8 Aug

2:49 a.m.

On Mon, 7 Aug 2006, Mojca Miklavec wrote:

...

On 8/7/06, Hans Hagen wrote:

...
...
(I'll spare you the fun with sections for some other time,) but since you reminded me that I might have some questions left, here you have another one: how do I replace hyphens, en-dashes and em-dashes with "spaces/line breaks"? \catcode`~=13\let~=\space does what I want, but none of the following works: \def\-{\space} \def-{\space} \let\-=\space

\catcode`-=\active \def-{ }

I tried that one already, but it didn't work. Now I figured out that it was because of nesting the definitions (perhaps even some interference with negative numbers?), not because of wrong definition on itself.

I'm sorry.

Mojca

(But my fear is that the whole problem is too complex anyway (tables, ...) to be solved elegantly.)

You should not be writing tables in abstracts! Here is my attempt. Seems to work correctly for simple text, references, simple markup etc. Try anything too fancy and you are in trouble. I changed the name to start stop stats, as I was mistyping startstatistics :-). \starttext \bgroup \catcode`~=\active \catcode`-=\active \gdef\ignorestats% {% treat non-breakable space as a normal one \catcode`~=\active \let~=\space % treat endash, emdash and - as normal space \catcode`-=\active \def-{ } %\setupframed[align=normal]%Frames do not work correctly } \gdef\startdostats% {\bgroup \setbox0\vbox\bgroup % \tracingall -) \forgetall \nohyphens \hsize1mm} \gdef\stopdostats% {\egroup \newcounter\NOfLines \dontcomplain %Why do I still get overfull \hbox warnings \beginshapebox \unvcopy0 \endshapebox \reshapebox{\doglobal\increment\NOfLines} \getnoflines{\ht0} \unvbox0 %Uncomment for debug \par lines: \the\noflines\space words: \NOfLines\par\egroup} \long\gdef\startstats#1\stopstats% {\bgroup\ignorestats \startdostats\scantokens{#1}\stopdostats\egroup} \egroup \def\ShowStats#1{\hairline#1\par\startstats#1\stopstats} \ShowStats{abc~def ghi-jkl -- mno --- prs} \ShowStats{abc-def -- ghi --- jkl} \ShowStats{a, b} \section[a]{one} \ShowStats{We do some great things in \in{section}[a]} % I do not know the internals, but section 1 seems unbreakable \ShowStats{$a=b$} %What did you expect? It may be possible to treat %each math token as mathord and allow it to break %but that will not give any better results. \startbuffer This is a test \stopbuffer \ShowStats{\getbuffer} \ShowStats{\startformula a = b + c \stopformula} \ShowStats{\framed{This is a test}} \ShowStats{\starthiding Another test \stophiding Does this work?} % Buffers do not work and fail silently. \ShowStats{This is {\bf Bold} and {\it Italic}} \ShowStats{\input tufte} \stoptext Aditya

Hans Hagen

9:54 a.m.

Mojca Miklavec wrote:

...

On 8/7/06, Hans Hagen wrote:

...
...
(I'll spare you the fun with sections for some other time,) but since you reminded me that I might have some questions left, here you have another one: how do I replace hyphens, en-dashes and em-dashes with "spaces/line breaks"? \catcode`~=13\let~=\space does what I want, but none of the following works: \def\-{\space} \def-{\space} \let\-=\space

\catcode`-=\active \def-{ }

I tried that one already, but it didn't work. Now I figured out that it was because of nesting the definitions (perhaps even some interference with negative numbers?), not because of wrong definition on itself.

I'm sorry.

Mojca

(But my fear is that the whole problem is too complex anyway (tables, ....) to be solved elegantly.)

i'm nearly 100% sure that you will never manage to make that working ; if you want to count words, you need to intercept them at the input stage and/or interpret the output Hans -- ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

6904

Age (days ago)

6907

Last active (days ago)

List overview

Download

13 comments

4 participants

participants (4)

Aditya Mahajan
gnwiii＠gmail.com
Hans Hagen
Mojca Miklavec

counting the words in a TeX document

tags

participants (4)