idea: Module to automatically extract and insert information from Wikipedia
Dear ConTeXt folks, just now I thought of the following and I am wondering if there exists already a solution. Writing a text which includes people I want to add information about these peoples as footnotes. The first sentence in a Wikipedia article is most of the time good enough for that. A macro `\infofromwikipedia{Donald Knuth}` would be nice which gets the first sentence of the article and puts an item into the bibliography. There is even an API to access articles [2]. Besides coding that up I see the following problems. 1. The output [3] needs to be converted to ConTeXt. 2. An Internet connection would be necessary. But that is just a note and not a problem. Thanks, Paul [1] https://en.wikipedia.org/wiki/Donald_Knuth [2] http://www.mediawiki.org/wiki/API [3] http://www.mediawiki.org/wiki/API:Data_formats#Output
Hi Paul, On 2011-11-12 16:19, Paul Menzel wrote:
A macro `\infofromwikipedia{Donald Knuth}` would be nice which gets the first sentence of the article and puts an item into the bibliography.
There is even an API to access articles [2]. Besides coding that up I see the following problems.
1. The output [3] needs to be converted to ConTeXt. 2. An Internet connection would be necessary. But that is just a note and not a problem.
you could take this as a starting point: https://bitbucket.org/phg/context-acceptor/ and implement a function that ignores everything but the first text paragraph. Autodownload should work for the English WP. (I’m sorry I have no time to do this myself atm.) Btw. as “Sentence” is not a markup category of wikitext, there is no sentence recognition built in ... ymmv. (Beware that processing wiki text from WP is extremely complicated due to WP’s using special plugins (“templates” and stuff). So the only way to make sure that a parser accept any well formed WP page would be to include all those plugins. Which would entail rewriting the PHP code in Lua for use as a context script. And then you’d have to decide for every plugin what its output should look like in Context.[0] If you have the time ...) Good luck Philipp [0] Get an impression on how much work this can be at http://en.wikipedia.org/wiki/Wikipedia:List_of_templates The more important ones are at http://en.wikipedia.org/wiki/Category:Infobox_templates
Thanks,
Paul
[1] https://en.wikipedia.org/wiki/Donald_Knuth [2] http://www.mediawiki.org/wiki/API [3] http://www.mediawiki.org/wiki/API:Data_formats#Output
___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki!
maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : http://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________
On Sat, Nov 12, 2011 at 05:31:23PM +0100, Philipp Gesang wrote:
(Beware that processing wiki text from WP is extremely complicated due to WP’s using special plugins (“templates” and stuff). So the only way to make sure that a parser accept any well formed WP page would be to include all those plugins. Which would entail rewriting the PHP code in Lua for use as a context script. And then you’d have to decide for every plugin what its output should look like in Context.[0] If you have the time ...)
I think scraping the MediaWiki-generated HTML would be simpler. Regards, Khaled
On 12-11-2011 17:40, Khaled Hosny wrote:
On Sat, Nov 12, 2011 at 05:31:23PM +0100, Philipp Gesang wrote:
(Beware that processing wiki text from WP is extremely complicated due to WP’s using special plugins (“templates” and stuff). So the only way to make sure that a parser accept any well formed WP page would be to include all those plugins. Which would entail rewriting the PHP code in Lua for use as a context script. And then you’d have to decide for every plugin what its output should look like in Context.[0] If you have the time ...)
I think scraping the MediaWiki-generated HTML would be simpler.
Doesn't it also depend on the first line being recognizable as such? Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
On Sat, 12 Nov 2011, Paul Menzel wrote:
just now I thought of the following and I am wondering if there exists already a solution.
Not exactly for wikipedia, but I have an experimental module that pulls information from the web. I use it get images from sites like yuml.me an dwebsequencediagrams.com. https://github.com/adityam/context-webfilter See test/ directory for examples.
Writing a text which includes people I want to add information about these peoples as footnotes. The first sentence in a Wikipedia article is most of the time good enough for that.
A macro `\infofromwikipedia{Donald Knuth}` would be nice which gets the first sentence of the article and puts an item into the bibliography.
This actually requires a more detailed spec. What happens if there is more than one person with the same name: http://en.wikipedia.org/wiki/Wolfgang_Schuster
There is even an API to access articles [2]. Besides coding that up I see the following problems.
1. The output [3] needs to be converted to ConTeXt.
I don't see anything in the API specs that returns the contents of the page. My guess is that simply downloading the html page and scraping the main paragraph might be easier. Once the data is retreived, using ConTeXt to typeset HTML is fairly easy. Another option is to just use one of the existing scripts to scrap the first paragraph/first line from Wikipedia, e.g., http://stackoverflow.com/questions/1565347/get-first-lines-of-wikipedia-arti... http://query7.com/scrape-the-first-paragraph-image-from-a-wikipedia-entry and use the filter module to call them. Aditya
participants (5)
-
Aditya Mahajan
-
Hans Hagen
-
Khaled Hosny
-
Paul Menzel
-
Philipp Gesang