PDF document statistics (character count incl. spaces)?
Is there a way to report the “character count including spaces” of the resulting PDF in ConTeXt? Greetings Jörg
On Sun, 1 Feb 2015, Jörg Weger wrote:
Is there a way to report the “character count including spaces” of the resulting PDF in ConTeXt?
Given that these counts are never accurate, how about pdftotext filename followed by wc filename Aditya
Is the character count “wc --char <textfile>” returns with or without blank spaces? (Which is important for me.) “man wc” doesn’t talk about that. I had hoped there was a better way than to edit the result of “pdftotext” in my text editor or in libreoffice writer (deleting unnecessary carriage returns and spaces by searching for regular expressions) which are able to do the count I need. In fact I had hoped that ConTeXt was able to count the characters and spaces it renders to PDF (is that theoretically possible?) … Greetings Jörg On 01.02.2015 20:11, Aditya Mahajan wrote:
On Sun, 1 Feb 2015, Jörg Weger wrote:
Is there a way to report the “character count including spaces” of the resulting PDF in ConTeXt?
Given that these counts are never accurate, how about
pdftotext filename
followed by
wc filename
Aditya
___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki!
maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________
Am 01.02.2015 um 22:06 schrieb Jörg Weger
: Is the character count “wc --char <textfile>” returns with or without blank spaces? (Which is important for me.) “man wc” doesn’t talk about that.
I had hoped there was a better way than to edit the result of “pdftotext” in my text editor or in libreoffice writer (deleting unnecessary carriage returns and spaces by searching for regular expressions) which are able to do the count I need. In fact I had hoped that ConTeXt was able to count the characters and spaces it renders to PDF (is that theoretically possible?) …
ConTeXt has an option to count the words (you find the result in <jobname>.words) in a document but words words shorter than four letters aren’t taken into account. \setupspellchecking[state=start,method=2] \starttext \input knuth \stoptext Wolfgang
On Sun, 01 Feb 2015 14:12:54 -0700, Wolfgang Schuster
\setupspellchecking[state=start,method=2] \starttext \input knuth \stoptext
Slightly off-topic: Just as Wolfgang's reply came in I was setting up a new version of http://tinyspell.com/ Editor-based spell-checkers are usually not very useful (although some LaTeX-centric editors are pretty good at it.) I never knew about \setupspellchecking before now. Perhaps it could evolve into something very useful. Part of spell-checking involves getting uppercase vs lowercase right. I see that the .words output of \setupspellchecking ignores case, and treats '-' (the simple dash) as a word separator. I'd like to see this evolve into something more precise.
words shorter than four letters aren’t taken into account.
I get *some* words shorter than four letters in the output, so there must be some other logic going on... Thanks for pointing out this utility, Wolfgang, and Best wishes Idris -- Idris Samawi Hamid Professor of Philosophy Colorado State University Fort Collins, CO 80523
On Sun, 01 Feb 2015 15:11:48 -0700, Wolfgang Schuster
Am 01.02.2015 um 22:32 schrieb Idris Samawi Hamid ادريس سماوي حامد
: words shorter than four letters aren’t taken into account.
I get *some* words shorter than four letters in the output, so there must be some other logic going on…
Do you have a few examples?
A quick one: ======= \setupspellchecking[state=start,method=2] \starttext Dār is the Arabic word for home. \stoptext ======= -- Idris Samawi Hamid Professor of Philosophy Colorado State University Fort Collins, CO 80523
Hello All, As a linguist, I can say that not counting words that are shorter is an absolute NO-GO for an accurate word count and thereby character count! See below, for a non representative proof !
Am 01.02.2015 um 22:12 schrieb Wolfgang Schuster
:
[snip, snip]
ConTeXt has an option to count the words (you find the result in <jobname>.words) in a document but words words shorter than four letters aren’t taken into account. word length under 4 characters : 10 word length =< 4 chars : 20
here you are missing a third of the words! That is 30% regards Keith
On Mon, 2 Feb 2015 10:20:15 +0100
Keith Schultz
Hello All,
As a linguist, I can say that not counting words that are shorter is an absolute NO-GO for an accurate word count and thereby character count!
See below, for a non representative proof !
Am 01.02.2015 um 22:12 schrieb Wolfgang Schuster
: [snip, snip]
ConTeXt has an option to count the words (you find the result in <jobname>.words) in a document but words words shorter than four letters aren’t taken into account. word length under 4 characters : 10 word length =< 4 chars : 20
here you are missing a third of the words! That is 30%
regards Keith
See also: Zipf, G. K. (1949), "Human Behavior and the Principle of Least Effort", Cambridge, MA: Addison-Wesley. in particular, Chapter 2: On the Economy of Words. As well as: Shannon, C. E. (1951), "The redundancy of English", Cybernetics, 248-272. 54% for English, so we can afford to be sloppy (wch s wy txt compr qte ll). Alan
On 2/2/2015 4:39 PM, Alan BRASLAU wrote:
ConTeXt has an option to count the words (you find the result in <jobname>.words) in a document but words words shorter than four letters aren’t taken into account. word length under 4 characters : 10 word length =< 4 chars : 20
here you are missing a third of the words! That is 30%
this feature relates to (simple) spell checking and collectign words for dedicated spell check lists and, 4 chars is nearly always avalid word which is why we discard them Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
On Mon, 2 Feb 2015 17:55:35 +0100
Hans Hagen
this feature relates to (simple) spell checking and collectign words for dedicated spell check lists and, 4 chars is nearly always avalid word which is why we discard them
English is rich in "four-letter words"! Alan ;-)
On 2/1/2015 10:06 PM, Jörg Weger wrote:
Is the character count “wc --char <textfile>” returns with or without blank spaces? (Which is important for me.) “man wc” doesn’t talk about that.
I had hoped there was a better way than to edit the result of “pdftotext” in my text editor or in libreoffice writer (deleting unnecessary carriage returns and spaces by searching for regular expressions) which are able to do the count I need. In fact I had hoped that ConTeXt was able to count the characters and spaces it renders to PDF (is that theoretically possible?) …
it's not too hard so maybe when i'm bored or see a good reason .. Hans ----------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
So I hope you might get bored once in a while before I have to write my bachelor thesis :) Greetings Jörg On 02.02.2015 00:56, Hans Hagen wrote:
On 2/1/2015 10:06 PM, Jörg Weger wrote:
Is the character count “wc --char <textfile>” returns with or without blank spaces? (Which is important for me.) “man wc” doesn’t talk about that.
I had hoped there was a better way than to edit the result of “pdftotext” in my text editor or in libreoffice writer (deleting unnecessary carriage returns and spaces by searching for regular expressions) which are able to do the count I need. In fact I had hoped that ConTeXt was able to count the characters and spaces it renders to PDF (is that theoretically possible?) …
it's not too hard so maybe when i'm bored or see a good reason ..
Hans
----------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com | www.pragma-pod.nl ----------------------------------------------------------------- ___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!
maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________
On 2015-02-01, at 22:06, Jörg Weger
Is the character count “wc --char <textfile>” returns with or without blank spaces? (Which is important for me.) “man wc” doesn’t talk about that.
I had hoped there was a better way than to edit the result of “pdftotext” in my text editor or in libreoffice writer (deleting unnecessary carriage returns and spaces by searching for regular expressions) which are able to do the count I need. In fact I had hoped that ConTeXt was able to count the characters and spaces it renders to PDF (is that theoretically possible?) …
I am pretty sure that you can make sed filter out blank characters. So then you can just chain pdftotext, sed and wc. OTOH, here's a relevant question (and a simple answer) on SO. (It seems to count newlines, though.) JFF, I've just coded this in Emacs Lisp: --8<---------------cut here---------------start------------->8--- ;; Count non-blank characters in a buffer (defun how-many-visible-chars () "Count visible (i.e., other than spaces, tabs and newlines) characters in the buffer." (interactive) (let ((count 0)) (save-excursion (goto-char (point-min)) (while (not (eobp)) (unless (looking-at-p "[ \t\n]") (setq count (1+ count))) (forward-char))) (message "%d visible characters" count))) --8<---------------cut here---------------end--------------->8--- It's terribly unoptimized, but I ran it on a 300+ kB file on my low-end netbook and it ran in something like 2 seconds, so it's not that bad in practice. Also, it's not well-coded: it should e.g. return the number instead of displaying the message when called non-interactively, it might take active region into account etc. - but as a proof-of-concept, it works surprisingly well (i.e., fast).
Greetings Jörg
Best, -- Marcin Borkowski http://octd.wmi.amu.edu.pl/en/Marcin_Borkowski Faculty of Mathematics and Computer Science Adam Mickiewicz University
participants (8)
-
Aditya Mahajan
-
Alan BRASLAU
-
Hans Hagen
-
Idris Samawi Hamid ادريس سماوي ح امد
-
Jörg Weger
-
Keith Schultz
-
Marcin Borkowski
-
Wolfgang Schuster