[NTG-context] PDF document statistics (character count incl. spaces)?

Alan BRASLAU alan.braslau at cea.fr
Mon Feb 2 16:39:47 CET 2015


On Mon, 2 Feb 2015 10:20:15 +0100
Keith Schultz <keithjschultz at icloud.com> wrote:

> Hello All,
> 
> As a linguist, I can say that not counting words that are shorter is
> an absolute NO-GO for an accurate word count and thereby character
> count!
> 
> See below, for a non representative proof !
> 
> > Am 01.02.2015 um 22:12 schrieb Wolfgang Schuster
> > <schuster.wolfgang at gmail.com>:
> > 
> [snip, snip]
> 
> > ConTeXt has an option to count the words (you find the result in
> > <jobname>.words) in a document but words words shorter than four
> > letters aren’t taken into account.
> word length under 4 characters  :   10
> word length =< 4 chars                 :   20
> 
> here you are missing a third of the words! That is 30%
> 
> regards
> 	Keith



See also:
Zipf, G. K. (1949), "Human Behavior and the Principle of Least Effort",
Cambridge, MA: Addison-Wesley.

in particular, Chapter 2: On the Economy of Words.


As well as:
Shannon, C. E. (1951), "The redundancy of English", Cybernetics,
248-272.

54% for English, so we can afford to be sloppy (wch s wy txt compr qte
ll).


Alan


More information about the ntg-context mailing list