Hi, While downloading urls, context santizes the filename but does not check the length of the url. So, one can end up with a situation where the filename is too long for the operating system to handle. For example, the following fails on 32bit linux. \enabletrackers[resolvers.schemes] \startluacode local report_webfilter = logs.new("thirddata.webfilter") local url = "http://www.bing.com/search?q=Areallyreallylongstringjusttoseehowthingsworkor..." local specification = resolvers.splitmethod(url) local file = resolvers.finders['http'](specification) or "" if file and file ~= "" then report_webfilter("saving file %s", file) else report_webfilter("download failed") end \stopluacode \normalend Is there a robust way to avoid this problem? One possibility is that in data-sch.lua instead of local cleanname = gsub(original,"[^%a%d%.]+","-") use local cleanname = md5sum(original) What do you think? Aditya
On Sun, 16 Jan 2011, Aditya Mahajan wrote:
Is there a robust way to avoid this problem? One possibility is that in data-sch.lua instead of
local cleanname = gsub(original,"[^%a%d%.]+","-")
use
local cleanname = md5.HEX(original) -- gsub(original,"[^%a%d%.]+","-") appears to work correctly in my tests. The drawback of this scheme is that instead of \externalfigure[url ending with .png] one would have to use \externalfigure[url ending with .png][method=png] But \input 'url ending with .tex' still works The other drawback is the filenames in the cache will be gibberish. But on the plus side, you can use long urls. Do you think that the drawbacks outweigh the gains? I need this for the webfilter module, where the url can get pretty long. I can always write my own http_get function, but that will be mostly repetition of data-sch.lua Aditya
On 21-1-2011 6:15, Aditya Mahajan wrote:
local cleanname = md5.HEX(original) -- gsub(original,"[^%a%d%.]+","-")
appears to work correctly in my tests. The drawback of this scheme is that instead of
\externalfigure[url ending with .png]
one would have to use
\externalfigure[url ending with .png][method=png]
But \input 'url ending with .tex' still works
The other drawback is the filenames in the cache will be gibberish. But on the plus side, you can use long urls.
Do you think that the drawbacks outweigh the gains?
What exactly do you mean with the suffix issue? We can probably normalize things a bit. Concerning the gibberish ... we can put a file alongside with some info. I need to think a bit about it but indeed it makes no sense to have redundant mechanisms. Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
On Fri, 21 Jan 2011, Hans Hagen wrote:
What exactly do you mean with the suffix issue?
Consider \externalfigure[http://contextgarden.files.wordpress.com/2008/08/logo-alt41.png] The current implementation downloads this file as <path-to-current-cache>/http-contextgarden.files.wordpress.com-2008-08-logo-alt41.png Then external figure sees a file with .png extension, and correctly includes it. If you follow my suggestion, the file will be downloaded as <path-to-current-cache>/667816068B899068327DA1EF013B3943 Then external figure sees a file with no extension, assumes that the file is a pdf file, and the figure inclusion fails. To correct that, you need to add [method=png] to \externalfigure.
We can probably normalize things a bit.
Agreed. Perhaps the best option will be a file name like http-contextgardent.files.wordpress.com-667816068B899068327DA1EF013B3943.png (so normalized base url + md5sum of url + extension). I am not sure how if extensions can be calculated reliably in urls. In particular imaging something like http://www.bing.com/search?q=check+.extension+long+url+so+that+os+filename+l....... A simple algorithm with assume that everything following the dot is the extension, while that is certainly not the case here. We can definitely restrict the search of extension to the last 10 or so characters of the url, but there will be cases when such heuristics will fail.
Concerning the gibberish ... we can put a file alongside with some info. I need to think a bit about it but indeed it makes no sense to have redundant mechanisms.
Thanks, Aditya
On 22-1-2011 1:20, Aditya Mahajan wrote:
A simple algorithm with assume that everything following the dot is the extension, while that is certainly not the case here. We can definitely restrict the search of extension to the last 10 or so characters of the url, but there will be cases when such heuristics will fail.
it's not that complicated ... say that you patch this way: function schemes.cleanname(specification) return (gsub(specification.original,"[^%a%d%.]+","-")) end local function fetch(specification) local original = specification.original local scheme = specification.scheme local cleanname = schemes.cleanname(specification) that will be the current method. Now you can experiment with: \startluacode function resolvers.schemes.cleanname(specification) local name = specification.original local hash = file.addsuffix(md5.hex(name),file.suffix(specification.path)) logs.simple("%s => %s",name,hash) return hash end \stopluacode Just see how that works out Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
On Sun, 23 Jan 2011, Hans Hagen wrote:
On 22-1-2011 1:20, Aditya Mahajan wrote:
A simple algorithm with assume that everything following the dot is the extension, while that is certainly not the case here. We can definitely restrict the search of extension to the last 10 or so characters of the url, but there will be cases when such heuristics will fail.
it's not that complicated ... say that you patch this way:
function schemes.cleanname(specification) return (gsub(specification.original,"[^%a%d%.]+","-")) end
local function fetch(specification) local original = specification.original local scheme = specification.scheme local cleanname = schemes.cleanname(specification)
that will be the current method. Now you can experiment with:
Can cleanname be passed as a parameter of the specification? Then we can have local cleanname = specification.cleanname or schemes.cleanname(specification) This way, I can only change the cleanname of the files that are downloaded by my module without affecting the cleanname for any other command that might want to download a file. Aditya
On 23-1-2011 9:34, Aditya Mahajan wrote:
On Sun, 23 Jan 2011, Hans Hagen wrote:
On 22-1-2011 1:20, Aditya Mahajan wrote:
A simple algorithm with assume that everything following the dot is the extension, while that is certainly not the case here. We can definitely restrict the search of extension to the last 10 or so characters of the url, but there will be cases when such heuristics will fail.
it's not that complicated ... say that you patch this way:
function schemes.cleanname(specification) return (gsub(specification.original,"[^%a%d%.]+","-")) end
local function fetch(specification) local original = specification.original local scheme = specification.scheme local cleanname = schemes.cleanname(specification)
that will be the current method. Now you can experiment with:
Can cleanname be passed as a parameter of the specification? Then we can have
local cleanname = specification.cleanname or schemes.cleanname(specification)
This way, I can only change the cleanname of the files that are downloaded by my module without affecting the cleanname for any other command that might want to download a file.
I made this ... as this is rather specialized tuning (that might confuse users) it's a directive: \starttext \enabletrackers [resolvers.schemes] \enabledirectives[schemes.cleanmethod=md5] \externalfigure[http://contextgarden.files.wordpress.com/2008/08/logo-alt41.png][width=3cm] \externalfigure[http://contextgarden.files.wordpress.com/2008/08/logo-alt41.png][width=3cm] \externalfigure[http://contextgarden.files.wordpress.com/2008/08/logo-alt41.png][width=3cm] \stoptext currently 'strip' is default but we can decide on md5 Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
On Sun, 23 Jan 2011, Hans Hagen wrote:
I made this ... as this is rather specialized tuning (that might confuse users) it's a directive:
\starttext
\enabletrackers [resolvers.schemes] \enabledirectives[schemes.cleanmethod=md5]
\externalfigure[http://contextgarden.files.wordpress.com/2008/08/logo-alt41.png][width=3cm] \externalfigure[http://contextgarden.files.wordpress.com/2008/08/logo-alt41.png][width=3cm] \externalfigure[http://contextgarden.files.wordpress.com/2008/08/logo-alt41.png][width=3cm]
\stoptext
currently 'strip' is default but we can decide on md5
Thanks. I'll test it with my module. Aditya
participants (2)
-
Aditya Mahajan
-
Hans Hagen