PDF document statistics (character count incl. spaces)?

Is the character count “wc --char <textfile>” returns with or without blank spaces? (Which is important for me.) “man wc” doesn’t talk about that. I had hoped there was a better way than to edit the result of “pdftotext” in my text editor or in libreoffice writer (deleting unnecessary carriage returns and spaces by searching for regular expressions) which are able to do the count I need. In fact I had hoped that ConTeXt was able to count the characters and spaces it renders to PDF (is that theoretically possible?) … Greetings Jörg On 01.02.2015 20:11, Aditya Mahajan wrote:

...

On Sun, 1 Feb 2015, Jörg Weger wrote:

...
Is there a way to report the “character count including spaces” of the resulting PDF in ConTeXt?

Given that these counts are never accurate, how about

pdftotext filename

followed by

wc filename

Aditya

___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________

Wolfgang Schuster

10:12 p.m.

...

Am 01.02.2015 um 22:06 schrieb Jörg Weger :

Is the character count “wc --char <textfile>” returns with or without blank spaces? (Which is important for me.) “man wc” doesn’t talk about that.

I had hoped there was a better way than to edit the result of “pdftotext” in my text editor or in libreoffice writer (deleting unnecessary carriage returns and spaces by searching for regular expressions) which are able to do the count I need. In fact I had hoped that ConTeXt was able to count the characters and spaces it renders to PDF (is that theoretically possible?) …

ConTeXt has an option to count the words (you find the result in <jobname>.words) in a document but words words shorter than four letters aren’t taken into account. \setupspellchecking[state=start,method=2] \starttext \input knuth \stoptext Wolfgang

Idris Samawi Hamid ادريس سماوي ح امد

10:32 p.m.

New subject: PDF document statistics (character count incl. spaces)?

On Sun, 01 Feb 2015 14:12:54 -0700, Wolfgang Schuster wrote:

...

\setupspellchecking[state=start,method=2] \starttext \input knuth \stoptext

Slightly off-topic: Just as Wolfgang's reply came in I was setting up a new version of http://tinyspell.com/ Editor-based spell-checkers are usually not very useful (although some LaTeX-centric editors are pretty good at it.) I never knew about \setupspellchecking before now. Perhaps it could evolve into something very useful. Part of spell-checking involves getting uppercase vs lowercase right. I see that the .words output of \setupspellchecking ignores case, and treats '-' (the simple dash) as a word separator. I'd like to see this evolve into something more precise.

...

words shorter than four letters aren’t taken into account.

I get *some* words shorter than four letters in the output, so there must be some other logic going on... Thanks for pointing out this utility, Wolfgang, and Best wishes Idris -- Idris Samawi Hamid Professor of Philosophy Colorado State University Fort Collins, CO 80523

Wolfgang Schuster

11:11 p.m.

...

Am 01.02.2015 um 22:32 schrieb Idris Samawi Hamid ادريس سماوي حامد :

...
words shorter than four letters aren’t taken into account.

I get *some* words shorter than four letters in the output, so there must be some other logic going on…

Do you have a few examples? Wolfgang

Idris Samawi Hamid ادريس سماوي ح امد

11:27 p.m.

New subject: PDF document statistics (character count incl. spaces)?

On Sun, 01 Feb 2015 15:11:48 -0700, Wolfgang Schuster wrote:

...

...
Am 01.02.2015 um 22:32 schrieb Idris Samawi Hamid ادريس سماوي حامد :

...
words shorter than four letters aren’t taken into account.

I get *some* words shorter than four letters in the output, so there must be some other logic going on…

Do you have a few examples?

A quick one: ======= \setupspellchecking[state=start,method=2] \starttext Dār is the Arabic word for home. \stoptext ======= -- Idris Samawi Hamid Professor of Philosophy Colorado State University Fort Collins, CO 80523

Keith Schultz

2 Feb 2 Feb

10:20 a.m.

New subject: PDF document statistics (character count incl. spaces)?

Hello All, As a linguist, I can say that not counting words that are shorter is an absolute NO-GO for an accurate word count and thereby character count! See below, for a non representative proof !

...

Am 01.02.2015 um 22:12 schrieb Wolfgang Schuster :

[snip, snip]

...

ConTeXt has an option to count the words (you find the result in <jobname>.words) in a document but words words shorter than four letters aren’t taken into account. word length under 4 characters : 10 word length =< 4 chars : 20

here you are missing a third of the words! That is 30% regards Keith

Alan BRASLAU

4:39 p.m.

New subject: PDF document statistics (character count incl. spaces)?

On Mon, 2 Feb 2015 10:20:15 +0100 Keith Schultz wrote:

...

Hello All,

As a linguist, I can say that not counting words that are shorter is an absolute NO-GO for an accurate word count and thereby character count!

See below, for a non representative proof !

...
Am 01.02.2015 um 22:12 schrieb Wolfgang Schuster :

[snip, snip]

...
ConTeXt has an option to count the words (you find the result in <jobname>.words) in a document but words words shorter than four letters aren’t taken into account. word length under 4 characters : 10 word length =< 4 chars : 20

here you are missing a third of the words! That is 30%

regards Keith

See also: Zipf, G. K. (1949), "Human Behavior and the Principle of Least Effort", Cambridge, MA: Addison-Wesley. in particular, Chapter 2: On the Economy of Words. As well as: Shannon, C. E. (1951), "The redundancy of English", Cybernetics, 248-272. 54% for English, so we can afford to be sloppy (wch s wy txt compr qte ll). Alan

Hans Hagen

5:55 p.m.

New subject: PDF document statistics (character count incl. spaces)?

On 2/2/2015 4:39 PM, Alan BRASLAU wrote:

...

...
...
ConTeXt has an option to count the words (you find the result in <jobname>.words) in a document but words words shorter than four letters aren’t taken into account. word length under 4 characters : 10 word length =< 4 chars : 20

here you are missing a third of the words! That is 30%

this feature relates to (simple) spell checking and collectign words for dedicated spell check lists and, 4 chars is nearly always avalid word which is why we discard them Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

Alan BRASLAU

3 Feb 3 Feb

4:19 a.m.

New subject: PDF document statistics (character count incl. spaces)?

On Mon, 2 Feb 2015 17:55:35 +0100 Hans Hagen wrote:

...

this feature relates to (simple) spell checking and collectign words for dedicated spell check lists and, 4 chars is nearly always avalid word which is why we discard them

English is rich in "four-letter words"! Alan ;-)

Hans Hagen

2 Feb 2 Feb

12:56 a.m.

New subject: PDF document statistics (character count incl. spaces)?

On 2/1/2015 10:06 PM, Jörg Weger wrote:

...

Is the character count “wc --char <textfile>” returns with or without blank spaces? (Which is important for me.) “man wc” doesn’t talk about that.

I had hoped there was a better way than to edit the result of “pdftotext” in my text editor or in libreoffice writer (deleting unnecessary carriage returns and spaces by searching for regular expressions) which are able to do the count I need. In fact I had hoped that ConTeXt was able to count the characters and spaces it renders to PDF (is that theoretically possible?) …

it's not too hard so maybe when i'm bored or see a good reason .. Hans ----------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------

Jörg Weger

11:39 p.m.

New subject: PDF document statistics (character count incl. spaces)?

So I hope you might get bored once in a while before I have to write my bachelor thesis :) Greetings Jörg On 02.02.2015 00:56, Hans Hagen wrote:

...

On 2/1/2015 10:06 PM, Jörg Weger wrote:

...
Is the character count “wc --char <textfile>” returns with or without blank spaces? (Which is important for me.) “man wc” doesn’t talk about that.

I had hoped there was a better way than to edit the result of “pdftotext” in my text editor or in libreoffice writer (deleting unnecessary carriage returns and spaces by searching for regular expressions) which are able to do the count I need. In fact I had hoped that ConTeXt was able to count the characters and spaces it renders to PDF (is that theoretically possible?) …

it's not too hard so maybe when i'm bored or see a good reason ..

Hans

----------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com | www.pragma-pod.nl ----------------------------------------------------------------- ___________________________________________________________________________________

If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________

Marcin Borkowski

9:45 p.m.

On 2015-02-01, at 22:06, Jörg Weger wrote:

...

Is the character count “wc --char <textfile>” returns with or without blank spaces? (Which is important for me.) “man wc” doesn’t talk about that.

I had hoped there was a better way than to edit the result of “pdftotext” in my text editor or in libreoffice writer (deleting unnecessary carriage returns and spaces by searching for regular expressions) which are able to do the count I need. In fact I had hoped that ConTeXt was able to count the characters and spaces it renders to PDF (is that theoretically possible?) …

I am pretty sure that you can make sed filter out blank characters. So then you can just chain pdftotext, sed and wc. OTOH, here's a relevant question (and a simple answer) on SO. (It seems to count newlines, though.) JFF, I've just coded this in Emacs Lisp: --8<---------------cut here---------------start------------->8--- ;; Count non-blank characters in a buffer (defun how-many-visible-chars () "Count visible (i.e., other than spaces, tabs and newlines) characters in the buffer." (interactive) (let ((count 0)) (save-excursion (goto-char (point-min)) (while (not (eobp)) (unless (looking-at-p "[ \t\n]") (setq count (1+ count))) (forward-char))) (message "%d visible characters" count))) --8<---------------cut here---------------end--------------->8--- It's terribly unoptimized, but I ran it on a 300+ kB file on my low-end netbook and it ran in something like 2 seconds, so it's not that bad in practice. Also, it's not well-coded: it should e.g. return the number instead of displaying the message when called non-interactively, it might take active region into account etc. - but as a proof-of-concept, it works surprisingly well (i.e., fast).

...

Greetings Jörg

Best, -- Marcin Borkowski http://octd.wmi.amu.edu.pl/en/Marcin_Borkowski Faculty of Mathematics and Computer Science Adam Mickiewicz University

3777

Age (days ago)

3779

Last active (days ago)

List overview

Download

13 comments

8 participants

participants (8)

Aditya Mahajan
Alan BRASLAU
Hans Hagen
Idris Samawi Hamid ادريس سماوي ح امد
Jörg Weger
Keith Schultz
Marcin Borkowski
Wolfgang Schuster