Re: [Dev-luatex] Memory leak in string.explode()?

8 Nov 2012


      On 11/8/2012 2:05 AM, Reinhard Kotucha wrote:
...
Thank you, Hans.  Here it's faster than reading the file at once but
still slower than reading 8k Blocks.  It also consumes as much memory
as reading the file at once (and memory consumption grows
exponentially), but I could reduce memory consumption significantly
replacing
return table.concat(data)
with
return data
table.concat() keeps the file twice in memory, once as a table and
once as a string.
but if you want to compare the *all with blockwise loading you need to 
do the concat because otherwise you compare differen things; it's the 
concat that is costly (more than twice as much as the loading)
...
Yes, memory consumption is a problem on my machine at work.  I'm
running Linux in a virtual machine under 32-bit Windows.  Windows can
only use 3GB of memory and uses 800MB itself.  Though I can assign
more than 3GB to the VM, I suppose that I actually have less than
2.2GB and the rest is provided by a swap file.  Furthermore, multi
tasking/multi user systems can only work if no program assumes that
it's the only one which is running.
ah, but using a vm is making comparison problematic because in many 
cases a vm's file handling can be faster than in bare metal (tex uses 
one core only but in a vm the second core kicks in for some management 
tasks)
...
Speed is important in many cases.  And I think that if you're writing
a function you want to use in various scripts, it's worthwhile to
evaluate the parameters carefully.
sure, i do lots of speed/efficiency tests
...
The idea I had was to write a function which allows to read a text
file efficiently.  It should also be flexible and easy to use.
yes, but keep in mind that there are many parameters that influences it, 
like caching (an initial make format - fresh machine startup - can for 
instance take 5 times more time than a successive one and the same is 
true with this kind of tests)
...
In Lua it's convenient to read a file either line-by-line or at once.
Both are not efficient.  The first is extremely slow when lines are
short and the latter consumes a lot of memory.  And in many cases you
don't even need the content of the whole file.
line based reading needs to parse lines; it's faster to read the whole 
file with "rb" and loop over lines with

for s in string.gmatch("(.-)\n") do

or something similar
...
What I have so far is a function which reads a block and [the rest of]
a line within an endless loop.  Each chunk is split into lines.  It
takes two arguments, the file name and a function.  For each chunk,
the function is run on each line.  Thus I'm able to filter the data
and not everything has to be stored in memory.
------------------------------------------------
#! /usr/bin/env texlua
--*- Lua -*-
function readfile (filename, fun)
   local lineno=1
   fh=assert(io.open(filename, 'r'))
   while true do
     local line, rest = fh:read(2^13, '*line')
     if not line then break end
     if rest then line = line..rest end
     local tab = line:explode('\n')
     for i, v in ipairs(tab) do
       fun(v, lineno)
       lineno=lineno+1
     end
   end
   fh:close()
end
function process_line (line, n)
   print(n, line)
end
readfile ('testfile', process_line)
you still store the exploded tab
...
------------------------------------------------
Memory consumption is either 8kB or the length of the longest line
unless you store lines in a string or table.  Almost no extra memory
you do store them as the explode splits a max 2^13 chunk into lines
...
is needed if you manipulate each line somehow and write the result to
another file.  The only files I encountered which are really large are
CSV-like files which contain rows and columns of numbers, but the
function process_line() allows me to select only the rows and columns
I want to pass to pgfplots, for example.
...
at my end 2^24 is the most efficient (in time) block size
I found out that 2^13 is most efficient.  But I suppose that the most
important thing is that it's an integer multiple of a filesystem data
block.  Since Taco provided os.type() and os.name(), it's possible to
to make the chunk size system dependent.  But I fear that the actual
hardware (SSD vs. magnetic disk) has a bigger impact than the OS.
it's not os dependent but filesystem dependent and often disk sector 
dependent


here's one that does not need the split

local chunksize = 2^13 -- needs to be larger than last line !
local chunksize = 2^12 -- quite okay

function processlinebyline(filename,action)
     local filehandle = io.open(filename,'rb')
     if not filehandle then
         return
     end
     local linenumber = 0
     local cursor = 0
     local lastcursor = nil
     while true do
         filehandle:seek("set",cursor)
         if lastcursor == cursor then
             -- we can also wnd up here when a line is too long to fit 
in the
             -- buffer
             local line = filehandle:read(chunksize)
             if line then
                 linenumber = linenumber + 1
                 action(line,linenumber)
             end
             filehandle:close()
             return
         else
             local buffer = filehandle:read(chunksize)
             if not buffer then
                 filehandle:close()
                 return
             end
             local grab = string.gmatch(buffer,"([^\n\r]-)(\r?\n)")
             local line, eoline = grab()
             lastcursor = cursor
             while line do
                 local next, eonext = grab()
                 if next then
                     linenumber = linenumber + 1
                     if action(line,linenumber) then
                         filehandle:close()
                         return
                     end
                     cursor = cursor + #line + #eoline
                     line = next
                     eoline = eonext
                     lastcursor = nil
                 else
                     break
                 end
             end
         end
     end
end

function processline(line,n)
     if n > 100 and n < 200 then
         print(n,#line,line)
	-- return true -- quits the loop
     end
end

processlinebyline('somefile.txt',processline)


-- 

-----------------------------------------------------------------
                                           Hans Hagen | PRAGMA ADE
               Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
     tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com
                                              | www.pragma-pod.nl
-----------------------------------------------------------------