[Dev-luatex] Memory leak in string.explode()?

Hans Hagen pragma at wxs.nl
Thu Nov 8 11:36:37 CET 2012


On 11/8/2012 2:05 AM, Reinhard Kotucha wrote:

> Thank you, Hans.  Here it's faster than reading the file at once but
> still slower than reading 8k Blocks.  It also consumes as much memory
> as reading the file at once (and memory consumption grows
> exponentially), but I could reduce memory consumption significantly
> replacing
>
>    return table.concat(data)
>
> with
>
>    return data
>
> table.concat() keeps the file twice in memory, once as a table and
> once as a string.

but if you want to compare the *all with blockwise loading you need to 
do the concat because otherwise you compare differen things; it's the 
concat that is costly (more than twice as much as the loading)

> Yes, memory consumption is a problem on my machine at work.  I'm
> running Linux in a virtual machine under 32-bit Windows.  Windows can
> only use 3GB of memory and uses 800MB itself.  Though I can assign
> more than 3GB to the VM, I suppose that I actually have less than
> 2.2GB and the rest is provided by a swap file.  Furthermore, multi
> tasking/multi user systems can only work if no program assumes that
> it's the only one which is running.

ah, but using a vm is making comparison problematic because in many 
cases a vm's file handling can be faster than in bare metal (tex uses 
one core only but in a vm the second core kicks in for some management 
tasks)

> Speed is important in many cases.  And I think that if you're writing
> a function you want to use in various scripts, it's worthwhile to
> evaluate the parameters carefully.

sure, i do lots of speed/efficiency tests

> The idea I had was to write a function which allows to read a text
> file efficiently.  It should also be flexible and easy to use.

yes, but keep in mind that there are many parameters that influences it, 
like caching (an initial make format - fresh machine startup - can for 
instance take 5 times more time than a successive one and the same is 
true with this kind of tests)

> In Lua it's convenient to read a file either line-by-line or at once.
> Both are not efficient.  The first is extremely slow when lines are
> short and the latter consumes a lot of memory.  And in many cases you
> don't even need the content of the whole file.

line based reading needs to parse lines; it's faster to read the whole 
file with "rb" and loop over lines with

for s in string.gmatch("(.-)\n") do

or something similar

> What I have so far is a function which reads a block and [the rest of]
> a line within an endless loop.  Each chunk is split into lines.  It
> takes two arguments, the file name and a function.  For each chunk,
> the function is run on each line.  Thus I'm able to filter the data
> and not everything has to be stored in memory.
>
> ------------------------------------------------
> #! /usr/bin/env texlua
> --*- Lua -*-
>
> function readfile (filename, fun)
>    local lineno=1
>    fh=assert(io.open(filename, 'r'))
>    while true do
>      local line, rest = fh:read(2^13, '*line')
>      if not line then break end
>      if rest then line = line..rest end
>      local tab = line:explode('\n')
>      for i, v in ipairs(tab) do
>        fun(v, lineno)
>        lineno=lineno+1
>      end
>    end
>    fh:close()
> end
>
> function process_line (line, n)
>    print(n, line)
> end
>
> readfile ('testfile', process_line)

you still store the exploded tab

> ------------------------------------------------
>
> Memory consumption is either 8kB or the length of the longest line
> unless you store lines in a string or table.  Almost no extra memory

you do store them as the explode splits a max 2^13 chunk into lines

> is needed if you manipulate each line somehow and write the result to
> another file.  The only files I encountered which are really large are
> CSV-like files which contain rows and columns of numbers, but the
> function process_line() allows me to select only the rows and columns
> I want to pass to pgfplots, for example.
>
>   > at my end 2^24 is the most efficient (in time) block size
>
> I found out that 2^13 is most efficient.  But I suppose that the most
> important thing is that it's an integer multiple of a filesystem data
> block.  Since Taco provided os.type() and os.name(), it's possible to
> to make the chunk size system dependent.  But I fear that the actual
> hardware (SSD vs. magnetic disk) has a bigger impact than the OS.

it's not os dependent but filesystem dependent and often disk sector 
dependent


here's one that does not need the split

local chunksize = 2^13 -- needs to be larger than last line !
local chunksize = 2^12 -- quite okay

function processlinebyline(filename,action)
     local filehandle = io.open(filename,'rb')
     if not filehandle then
         return
     end
     local linenumber = 0
     local cursor = 0
     local lastcursor = nil
     while true do
         filehandle:seek("set",cursor)
         if lastcursor == cursor then
             -- we can also wnd up here when a line is too long to fit 
in the
             -- buffer
             local line = filehandle:read(chunksize)
             if line then
                 linenumber = linenumber + 1
                 action(line,linenumber)
             end
             filehandle:close()
             return
         else
             local buffer = filehandle:read(chunksize)
             if not buffer then
                 filehandle:close()
                 return
             end
             local grab = string.gmatch(buffer,"([^\n\r]-)(\r?\n)")
             local line, eoline = grab()
             lastcursor = cursor
             while line do
                 local next, eonext = grab()
                 if next then
                     linenumber = linenumber + 1
                     if action(line,linenumber) then
                         filehandle:close()
                         return
                     end
                     cursor = cursor + #line + #eoline
                     line = next
                     eoline = eonext
                     lastcursor = nil
                 else
                     break
                 end
             end
         end
     end
end

function processline(line,n)
     if n > 100 and n < 200 then
         print(n,#line,line)
	-- return true -- quits the loop
     end
end

processlinebyline('somefile.txt',processline)


-- 

-----------------------------------------------------------------
                                           Hans Hagen | PRAGMA ADE
               Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
     tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com
                                              | www.pragma-pod.nl
-----------------------------------------------------------------


More information about the dev-luatex mailing list