[NTG-context] PDF document statistics (character count incl. spaces)?

Marcin Borkowski mbork at wmi.amu.edu.pl
Mon Feb 2 21:45:25 CET 2015

On 2015-02-01, at 22:06, Jörg Weger <joerg73.muc at googlemail.com> wrote:

> Is the character count “wc --char <textfile>” returns with or without 
> blank spaces? (Which is important for me.) “man wc” doesn’t talk about that.
> I had hoped there was a better way than to edit the result of 
> “pdftotext” in my text editor or in libreoffice writer (deleting 
> unnecessary carriage returns and spaces by searching for regular 
> expressions) which are able to do the count I need. In fact I had hoped 
> that ConTeXt was able to count the characters and spaces it renders to 
> PDF (is that theoretically possible?) …

I am pretty sure that you can make sed filter out blank characters.  So
then you can just chain pdftotext, sed and wc.

OTOH, here's a relevant question (and a simple answer) on SO.  (It seems
to count newlines, though.)

JFF, I've just coded this in Emacs Lisp:

--8<---------------cut here---------------start------------->8---
;; Count non-blank characters in a buffer

(defun how-many-visible-chars ()
    "Count visible (i.e., other than spaces, tabs and newlines)
characters in the buffer."
  (let ((count 0))
      (goto-char (point-min))
      (while (not (eobp))
	(unless (looking-at-p "[ \t\n]")
	  (setq count (1+ count)))
    (message "%d visible characters" count)))
--8<---------------cut here---------------end--------------->8---

It's terribly unoptimized, but I ran it on a 300+ kB file on my low-end
netbook and it ran in something like 2 seconds, so it's not that bad in
practice.  Also, it's not well-coded: it should e.g. return the number
instead of displaying the message when called non-interactively, it
might take active region into account etc. - but as a proof-of-concept,
it works surprisingly well (i.e., fast).

> Greetings Jörg


Marcin Borkowski
Faculty of Mathematics and Computer Science
Adam Mickiewicz University

More information about the ntg-context mailing list