[Dev-luatex] Memory leak in string.explode()?
Hans Hagen
pragma at wxs.nl
Thu Nov 8 11:36:37 CET 2012
On 11/8/2012 2:05 AM, Reinhard Kotucha wrote:
> Thank you, Hans. Here it's faster than reading the file at once but
> still slower than reading 8k Blocks. It also consumes as much memory
> as reading the file at once (and memory consumption grows
> exponentially), but I could reduce memory consumption significantly
> replacing
>
> return table.concat(data)
>
> with
>
> return data
>
> table.concat() keeps the file twice in memory, once as a table and
> once as a string.
but if you want to compare the *all with blockwise loading you need to
do the concat because otherwise you compare differen things; it's the
concat that is costly (more than twice as much as the loading)
> Yes, memory consumption is a problem on my machine at work. I'm
> running Linux in a virtual machine under 32-bit Windows. Windows can
> only use 3GB of memory and uses 800MB itself. Though I can assign
> more than 3GB to the VM, I suppose that I actually have less than
> 2.2GB and the rest is provided by a swap file. Furthermore, multi
> tasking/multi user systems can only work if no program assumes that
> it's the only one which is running.
ah, but using a vm is making comparison problematic because in many
cases a vm's file handling can be faster than in bare metal (tex uses
one core only but in a vm the second core kicks in for some management
tasks)
> Speed is important in many cases. And I think that if you're writing
> a function you want to use in various scripts, it's worthwhile to
> evaluate the parameters carefully.
sure, i do lots of speed/efficiency tests
> The idea I had was to write a function which allows to read a text
> file efficiently. It should also be flexible and easy to use.
yes, but keep in mind that there are many parameters that influences it,
like caching (an initial make format - fresh machine startup - can for
instance take 5 times more time than a successive one and the same is
true with this kind of tests)
> In Lua it's convenient to read a file either line-by-line or at once.
> Both are not efficient. The first is extremely slow when lines are
> short and the latter consumes a lot of memory. And in many cases you
> don't even need the content of the whole file.
line based reading needs to parse lines; it's faster to read the whole
file with "rb" and loop over lines with
for s in string.gmatch("(.-)\n") do
or something similar
> What I have so far is a function which reads a block and [the rest of]
> a line within an endless loop. Each chunk is split into lines. It
> takes two arguments, the file name and a function. For each chunk,
> the function is run on each line. Thus I'm able to filter the data
> and not everything has to be stored in memory.
>
> ------------------------------------------------
> #! /usr/bin/env texlua
> --*- Lua -*-
>
> function readfile (filename, fun)
> local lineno=1
> fh=assert(io.open(filename, 'r'))
> while true do
> local line, rest = fh:read(2^13, '*line')
> if not line then break end
> if rest then line = line..rest end
> local tab = line:explode('\n')
> for i, v in ipairs(tab) do
> fun(v, lineno)
> lineno=lineno+1
> end
> end
> fh:close()
> end
>
> function process_line (line, n)
> print(n, line)
> end
>
> readfile ('testfile', process_line)
you still store the exploded tab
> ------------------------------------------------
>
> Memory consumption is either 8kB or the length of the longest line
> unless you store lines in a string or table. Almost no extra memory
you do store them as the explode splits a max 2^13 chunk into lines
> is needed if you manipulate each line somehow and write the result to
> another file. The only files I encountered which are really large are
> CSV-like files which contain rows and columns of numbers, but the
> function process_line() allows me to select only the rows and columns
> I want to pass to pgfplots, for example.
>
> > at my end 2^24 is the most efficient (in time) block size
>
> I found out that 2^13 is most efficient. But I suppose that the most
> important thing is that it's an integer multiple of a filesystem data
> block. Since Taco provided os.type() and os.name(), it's possible to
> to make the chunk size system dependent. But I fear that the actual
> hardware (SSD vs. magnetic disk) has a bigger impact than the OS.
it's not os dependent but filesystem dependent and often disk sector
dependent
here's one that does not need the split
local chunksize = 2^13 -- needs to be larger than last line !
local chunksize = 2^12 -- quite okay
function processlinebyline(filename,action)
local filehandle = io.open(filename,'rb')
if not filehandle then
return
end
local linenumber = 0
local cursor = 0
local lastcursor = nil
while true do
filehandle:seek("set",cursor)
if lastcursor == cursor then
-- we can also wnd up here when a line is too long to fit
in the
-- buffer
local line = filehandle:read(chunksize)
if line then
linenumber = linenumber + 1
action(line,linenumber)
end
filehandle:close()
return
else
local buffer = filehandle:read(chunksize)
if not buffer then
filehandle:close()
return
end
local grab = string.gmatch(buffer,"([^\n\r]-)(\r?\n)")
local line, eoline = grab()
lastcursor = cursor
while line do
local next, eonext = grab()
if next then
linenumber = linenumber + 1
if action(line,linenumber) then
filehandle:close()
return
end
cursor = cursor + #line + #eoline
line = next
eoline = eonext
lastcursor = nil
else
break
end
end
end
end
end
function processline(line,n)
if n > 100 and n < 200 then
print(n,#line,line)
-- return true -- quits the loop
end
end
processlinebyline('somefile.txt',processline)
--
-----------------------------------------------------------------
Hans Hagen | PRAGMA ADE
Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com
| www.pragma-pod.nl
-----------------------------------------------------------------
More information about the dev-luatex
mailing list