[Dev-luatex] Memory leak in string.explode()?

Reinhard Kotucha reinhard.kotucha at web.de
Sat Nov 10 00:02:27 CET 2012


On 2012-11-08 at 11:36:37 +0100, Hans Hagen wrote:

 > On 11/8/2012 2:05 AM, Reinhard Kotucha wrote:
 > 
 > > Thank you, Hans.  Here it's faster than reading the file at once
 > > but still slower than reading 8k Blocks.  It also consumes as
 > > much memory as reading the file at once (and memory consumption
 > > grows exponentially), but I could reduce memory consumption
 > > significantly replacing
 > >
 > >    return table.concat(data)
 > >
 > > with
 > >
 > >    return data
 > >
 > > table.concat() keeps the file twice in memory, once as a table
 > > and once as a string.
 > 
 > but if you want to compare the *all with blockwise loading you need
 > to do the concat because otherwise you compare differen things;
 > it's the concat that is costly (more than twice as much as the
 > loading)

Yes, I removed it in order to confirm that it's responsible for the
the memory consumption.
 
 > > Yes, memory consumption is a problem on my machine at work.  I'm
 > > running Linux in a virtual machine under 32-bit Windows.  Windows
 > > can only use 3GB of memory and uses 800MB itself.  Though I can
 > > assign more than 3GB to the VM, I suppose that I actually have
 > > less than 2.2GB and the rest is provided by a swap file.
 > > Furthermore, multi tasking/multi user systems can only work if no
 > > program assumes that it's the only one which is running.
 > 
 > ah, but using a vm is making comparison problematic because in many
 > cases a vm's file handling can be faster than in bare metal (tex
 > uses one core only but in a vm the second core kicks in for some
 > management tasks)

Sorry, forgot to mention that I did all the comparisons on my 64-bit
Linux box with 4GB RAM at home.  Another problem at work is that I
failed to compile xosview under CentOS.  So I don't see when the
system is swapping, which might happen frequently on the VM.

 > > Speed is important in many cases.  And I think that if you're
 > > writing a function you want to use in various scripts, it's
 > > worthwhile to evaluate the parameters carefully.
 > 
 > sure, i do lots of speed/efficiency tests

I know.  However, I just installed Subversion and copiled the latest
SVN version of LuaTeX on my Raspberry Pi.  If you or anybody else is
interested in benchmarks, just send me your test files.

 > > The idea I had was to write a function which allows to read a
 > > text file efficiently.  It should also be flexible and easy to
 > > use.
 > 
 > yes, but keep in mind that there are many parameters that
 > influences it, like caching (an initial make format - fresh machine
 > startup - can for instance take 5 times more time than a successive
 > one and the same is true with this kind of tests)

When using the cache, I usually clear it first and then run the script
several times.  I also obey xosview in order to make sure that no
other processes interfere.  I think that an empty cache is what you
have after a fresh startup.  And the most important thing is that no
web-browser is running when doing benchmarks.

 > > In Lua it's convenient to read a file either line-by-line or at
 > > once.  Both are not efficient.  The first is extremely slow when
 > > lines are short and the latter consumes a lot of memory.  And in
 > > many cases you don't even need the content of the whole file.
 > 
 > line based reading needs to parse lines; it's faster to read the
 > whole file with "rb" and loop over lines with
 > 
 > for s in string.gmatch("(.-)\n") do
 > 
 > or something similar
 
Hmm, something similar is Taco's string.explode() function.  It's much
faster than regular expressions, so I prefer it.  What I didn't
consider yet is that the separator can only be either \r or \n and I
have to know in advance which linebreaks are used.  But I have some
ideas how to solve the problem.

 > > What I have so far is a function which reads a block and [the
 > > rest of] a line within an endless loop.  Each chunk is split into
 > > lines.  It takes two arguments, the file name and a function.
 > > For each chunk, the function is run on each line.  Thus I'm able
 > > to filter the data and not everything has to be stored in memory.
 > >
 > > ------------------------------------------------
 > > #! /usr/bin/env texlua
 > > --*- Lua -*-
 > >
 > > function readfile (filename, fun)
 > >    local lineno=1
 > >    fh=assert(io.open(filename, 'r'))
 > >    while true do
 > >      local line, rest = fh:read(2^13, '*line')
 > >      if not line then break end
 > >      if rest then line = line..rest end
 > >      local tab = line:explode('\n')
 > >      for i, v in ipairs(tab) do
 > >        fun(v, lineno)
 > >        lineno=lineno+1
 > >      end
 > >    end
 > >    fh:close()
 > > end
 > >
 > > function process_line (line, n)
 > >    print(n, line)
 > > end
 > >
 > > readfile ('testfile', process_line)
 > 
 > you still store the exploded tab
 > 
 > > ------------------------------------------------
 > >
 > > Memory consumption is either 8kB or the length of the longest line
 > > unless you store lines in a string or table.  Almost no extra memory
 > 
 > you do store them as the explode splits a max 2^13 chunk into lines

Sure.  But as far as I can see it doesn't hurt.  The table is
overwritten whenever a new chunk is processed.  Thus, things don't
accumulate.  I don't know what happens when I overwrite a table.
Maybe the new one allocates new memory and the old one is left to the
garbage collector.  But if this is the case, then the garbage
collector does a pretty good job.  The function is very fast and
memory cunsumption isn't even visible in xosview.

BTW, the f:read(BUFFER, '*line') concept can be less efficient if lines
are extremely long...

 > > is needed if you manipulate each line somehow and write the result to
 > > another file.  The only files I encountered which are really large are
 > > CSV-like files which contain rows and columns of numbers, but the
 > > function process_line() allows me to select only the rows and columns
 > > I want to pass to pgfplots, for example.
 > >
 > >   > at my end 2^24 is the most efficient (in time) block size
 > >
 > > I found out that 2^13 is most efficient.  But I suppose that the most
 > > important thing is that it's an integer multiple of a filesystem data
 > > block.  Since Taco provided os.type() and os.name(), it's possible to
 > > to make the chunk size system dependent.  But I fear that the actual
 > > hardware (SSD vs. magnetic disk) has a bigger impact than the OS.
 > 
 > it's not os dependent but filesystem dependent and often disk sector 
 > dependent
 > 
 > here's one that does not need the split
 
Well, it splits the file though:

  string.gmatch(buffer,"([^\n\r]-)(\r?\n)")

I suppose that the most promising approach is to use regexps in order
to determine the linebreak style, abort, and read the file again using
Taco's function.

Anyway, our discussion is obviously off-topic here.  Hans, I'll inform
you about the results by private mail.  If anybody else is interested
in the results, just drop me a line.

Regards,
  Reinhard

-- 
----------------------------------------------------------------------------
Reinhard Kotucha                                      Phone: +49-511-3373112
Marschnerstr. 25
D-30167 Hannover                              mailto:reinhard.kotucha at web.de
----------------------------------------------------------------------------
Microsoft isn't the answer. Microsoft is the question, and the answer is NO.
----------------------------------------------------------------------------


More information about the dev-luatex mailing list