HTML to ConTeXt

newer
Using SciTE for both, ConTeXt and...

older
Re: [NTG-context] Doc to ConTeXt...

Aditya Mahajan

25 Oct 2007 25 Oct '07

4:50 p.m.

This is interesting. A website that converts html to context (actually uses markdown behind the scenes). http://johnmacfarlane.net/pandoc/html2x.html This is how the context wiki looks like. http://johnmacfarlane.net/cgi-bin/html2x.pl?url=http%3A%2F%2Fwiki.contextgarden.net%2FMain_Page&format=context The program is written in haskell and is also available for download. You can use it to convert markdown to context. I had been looking for this for a while, when multiple formats are needed. Write in markdown and generate html or context. I do not completely like the context output it generates (for example http://johnmacfarlane.net/pandoc/README gets converted to http://johnmacfarlane.net/pandoc/example11.tex.html) Aditya

Show replies by date

Idris Samawi Hamid

25 Oct 25 Oct

10:17 p.m.

Hi Aditya, On Thu, 25 Oct 2007 08:50:03 -0600, Aditya Mahajan wrote:

...

This is interesting. A website that converts html to context (actually uses markdown behind the scenes).

http://johnmacfarlane.net/pandoc/html2x.html

This is how the context wiki looks like.

http://johnmacfarlane.net/cgi-bin/html2x.pl?url=http%3A%2F%2Fwiki.contextgarden.net%2FMain_Page&format=context

The program is written in haskell and is also available for download. You can use it to convert markdown to context. I had been looking for this for a while, when multiple formats are needed. Write in markdown and generate html or context. I do not completely like the context output it generates (for example http://johnmacfarlane.net/pandoc/README gets converted to http://johnmacfarlane.net/pandoc/example11.tex.html)

This looks very promising. Perhaps some of us can help the developers to improve the ConTeXt support. Thank you very much for sharing this! Best wishes Idris -- Professor Idris Samawi Hamid, Editor-in-Chief International Journal of Shi`i Studies Department of Philosophy Colorado State University Fort Collins, CO 80523 -- Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

Aditya Mahajan

26 Oct 26 Oct

6:22 a.m.

On Thu, 25 Oct 2007, Idris Samawi Hamid wrote:

...

Hi Aditya,

On Thu, 25 Oct 2007 08:50:03 -0600, Aditya Mahajan wrote:

...
This is interesting. A website that converts html to context (actually uses markdown behind the scenes).

http://johnmacfarlane.net/pandoc/html2x.html

This is how the context wiki looks like.

http://johnmacfarlane.net/cgi-bin/html2x.pl?url=http%3A%2F%2Fwiki.contextgarden.net%2FMain_Page&format=context

The program is written in haskell and is also available for download. You can use it to convert markdown to context. I had been looking for this for a while, when multiple formats are needed. Write in markdown and generate html or context. I do not completely like the context output it generates (for example http://johnmacfarlane.net/pandoc/README gets converted to http://johnmacfarlane.net/pandoc/example11.tex.html)

This looks very promising. Perhaps some of us can help the developers to improve the ConTeXt support.

I will explore pandoc in more detail in the future. I am more interested in it from the point of view of understanding Haskell parsers, but improving the context output will definitely not hurt. Aditya

Idris Samawi Hamid

1:37 p.m.

New subject: Doc to ConTeXt [was Re: HTML to ConTeXt]

On Thu, 25 Oct 2007 22:22:46 -0600, Aditya Mahajan wrote:

...

...
This looks very promising. Perhaps some of us can help the developers to improve the ConTeXt support.

I will explore pandoc in more detail in the future. I am more interested in it from the point of view of understanding Haskell parsers, but improving the context output will definitely not hurt.

Ah, you're missing a big point in your discovery ;-) As I told Andrea: For relatively simple documents (like the kind we use in academic journals) it seems we can now 1) convert doc to odt using OOo 2) convert odt to markdown using http://wiki.services.openoffice.org/wiki/Odt2txt.py 3) use the pandoc utility to convert markdown to ConTeXt. As for the pandoc list, we may be able to influence the final ConTeXt output by making suggestions, reporting bugs etc. If we can convert Odt2txt.py to lua maybe this workflow can be partly integrated into ConTeXt itself someday.***** The pandoc developer seems interested in improving ConTeXt support (see my forwarded mail) so this is a good opportunity for all those who need a decent doc=>context workflow. Best wishes Idris *****Or maybe we can just port Odt2txt.py to give direct ConTeXt output and forget the markdown layer entirely. Any ideas on how hard that would be? -- Professor Idris Samawi Hamid, Editor-in-Chief International Journal of Shi`i Studies Department of Philosophy Colorado State University Fort Collins, CO 80523 -- Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

Andrea Valle

10 Nov 10 Nov

2:30 a.m.

New subject: Doc to ConTeXt [was Re: HTML to ConTeXt]

Hi to all (Idris, in particular, as we are always dealing with the same problems... ), I just want to share some thoughts about the ol' damn' problem of converting to ConTeXt from Word et al.

...

As I told Andrea: For relatively simple documents (like the kind we use in academic journals) it seems we can now

1) convert doc to odt using OOo 2) convert odt to markdown using

As suggest by Idris, I subscribed to the pandoc list, but I have to say that the activity is not exactly like the one on ConTeXt list... So the actual support for ConTeXt conversion is not convincing. More, it's always better to put the hands on your machine... My problem is to convert a series of academic journals in ConTeXt. They come form the Humanities so little structure (basically, mainly body and footnotes). Far from me the idea of automatically doing all the stuff, I'd like to be faster and more accurate in conversion. (No particular interest in figures, they are few, not so much in references: they tends to be typographically inconsistent if done in a WYSISYG environment, so difficult to parse). More, as the journal has already being published we need to work with final pdfs. After wasting my time with an awful pdf to html converter by Acrobat, I discovered this, you may all know: http://pdftohtml.sourceforge.net/ The html conversion is very very good in resulting rendering and also in sources, but after some tweakings I got interested in the xml conversion it allows. The xml format substantially encodes the infos related to page, typically each line is an element. Plus, there are bold and italics marked easily as <b> and <i> I'm still struggling to understand something really operative of XML processing in ConTeXt, so I switched back to Python. I used an incremental sax parser with some replacement. This is today's draft. Original: http://www.semiotiche.it/andrea/membrana/02%20imp.pdf Recomposed (no setup at all, only \enableregime[utf]): http://www.semiotiche.it/andrea/membrana/02imp.pdf pdf --> pdftoxml --> xml --> python script --> tex --> pdf I recovered par, bold, em, footnotes, stripping dashes and reassembling the text with footnote references. Not bad as a first step. I guess that you xml gurus could probably do much easier and cleaner. So, I mean -just for my very specific needs, I con probably take word sources, convert to pdf and then finally reach ConTeXt as discussed. Just some ideas to share with the list Best -a- -------------------------------------------------- Andrea Valle -------------------------------------------------- CIRMA - DAMS Università degli Studi di Torino --> http://www.cirma.unito.it/andrea/ --> andrea.valle@unito.it -------------------------------------------------- I did this interview where I just mentioned that I read Foucault. Who doesn't in university, right? I was in this strip club giving this guy a lap dance and all he wanted to do was to discuss Foucault with me. Well, I can stand naked and do my little dance, or I can discuss Foucault, but not at the same time; too much information. (Annabel Chong)

Idris Samawi Hamid

4:14 a.m.

New subject: Doc to ConTeXt [was Re: HTML to ConTeXt]

Hi Andrea, On Fri, 09 Nov 2007 18:30:36 -0700, Andrea Valle wrote:

...

Hi to all (Idris, in particular, as we are always dealing with the same problems... ),

I just want to share some thoughts about the ol' damn' problem of converting to ConTeXt from Word et al.

...
As I told Andrea: For relatively simple documents (like the kind we use in academic journals) it seems we can now

1) convert doc to odt using OOo 2) convert odt to markdown using

http://wiki.services.openoffice.org/wiki/Odt2txt.py 3) use the pandoc utility to convert markdown to ConTeXt. [you left this out]

...

As suggest by Idris, I subscribed to the pandoc list, but I have to say that the activity is not exactly like the one on ConTeXt list... So the actual support for ConTeXt conversion is not convincing. More, it's always better to put the hands on your machine...

Did you try the markdown-to-ConTeXt conversion? The doc-odt-markdown-context workflow seems pretty useful as is. See also http://code.google.com/p/pandoc/wiki/ConTeXtImprovements I'm working on something else related to this issue that I hope to say more about in the coming weeks ;-) Best wishes Idris -- Professor Idris Samawi Hamid, Editor-in-Chief International Journal of Shi`i Studies Department of Philosophy Colorado State University Fort Collins, CO 80523 -- Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

Andrea Valle

12:25 p.m.

New subject: Doc to ConTeXt [was Re: HTML to ConTeXt]

Hi Idris On 10 Nov 2007, at 04:14, Idris Samawi Hamid wrote:

...

...
[you left this out] Sorry, it was just to refer to the discussion

...

The doc-odt-markdown-context workflow seems pretty useful as is. See also

I will try it more in depth. My main problem for now is to work from pdfs. Because they're past issues. Once I have new contributions, I will be there for sure :). I'm also curious to see if mine could be a more general approach to (word-->)pdf-->context conversion. Just started on it.

...

http://code.google.com/p/pandoc/wiki/ConTeXtImprovements

Oh, yes, quite useful. But has anyone replied to this on pandoc list? I thought none.

...

I'm working on something else related to this issue that I hope to say more about in the coming weeks ;-)

Looking forward to see the news :) Best -a- -------------------------------------------------- Andrea Valle -------------------------------------------------- CIRMA - DAMS Università degli Studi di Torino --> http://www.cirma.unito.it/andrea/ --> andrea.valle@unito.it -------------------------------------------------- I did this interview where I just mentioned that I read Foucault. Who doesn't in university, right? I was in this strip club giving this guy a lap dance and all he wanted to do was to discuss Foucault with me. Well, I can stand naked and do my little dance, or I can discuss Foucault, but not at the same time; too much information. (Annabel Chong)

Andrea Valle

1:09 p.m.

New subject: Doc to ConTeXt [was Re: HTML to ConTeXt]

...

In any case, I upload the wrong reconstructed pdf (too late at night...)

(Sorry, I'm always making too many typos: "I uploaded", indeed) -a- -------------------------------------------------- Andrea Valle -------------------------------------------------- CIRMA - DAMS Università degli Studi di Torino --> http://www.cirma.unito.it/andrea/ --> andrea.valle@unito.it -------------------------------------------------- I did this interview where I just mentioned that I read Foucault. Who doesn't in university, right? I was in this strip club giving this guy a lap dance and all he wanted to do was to discuss Foucault with me. Well, I can stand naked and do my little dance, or I can discuss Foucault, but not at the same time; too much information. (Annabel Chong)

Idris Samawi Hamid

4:33 a.m.

New subject: Doc to ConTeXt [was Re: HTML to ConTeXt]

On Fri, 09 Nov 2007 18:30:36 -0700, Andrea Valle wrote:

...

After wasting my time with an awful pdf to html converter by Acrobat, I discovered this, you may all know: http://pdftohtml.sourceforge.net/

Looks impressive...

...

The html conversion is very very good in resulting rendering and also in sources, but after some tweakings I got interested in the xml conversion it allows. The xml format substantially encodes the infos related to page, typically each line is an element. Plus, there are bold and italics marked easily as <b> and <i> I'm still struggling to understand something really operative of XML processing in ConTeXt, so I switched back to Python. I used an incremental sax parser with some replacement. This is today's draft. Original: http://www.semiotiche.it/andrea/membrana/02%20imp.pdf

Recomposed (no setup at all, only \enableregime[utf]): http://www.semiotiche.it/andrea/membrana/02imp.pdf

Looks VERY impressive... Tell me, how did you set up the cropmarks etc.?

...

pdf --> pdftoxml --> xml --> python script --> tex --> pdf

I recovered par, bold, em, footnotes, stripping dashes and reassembling the text with footnote references. Not bad as a first step.

Did you also try pdftohtml --> html --> context?

...

I guess that you xml gurus could probably do much easier and cleaner. So, I mean -just for my very specific needs, I con probably take word sources, convert to pdf and then finally reach ConTeXt as discussed.

Again, very nice stuff! Best wishes Idris -- Professor Idris Samawi Hamid, Editor-in-Chief International Journal of Shi`i Studies Department of Philosophy Colorado State University Fort Collins, CO 80523 -- Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

Andrea Valle

12:59 p.m.

New subject: Doc to ConTeXt [was Re: HTML to ConTeXt]

...

...
http://pdftohtml.sourceforge.net/ Looks impressive...

I think so

...

Looks VERY impressive... Tell me, how did you set up the cropmarks etc.?

Mmh, maybe you're are referring to the original source? The output one is bare bone (what I need) In any case, I upload the wrong reconstructed pdf (too late at night...) The reconstructed pdf (pdf -->xml-->context-->pdf) is this one, where footnotes are handled correctly (an important point for me). http://www.semiotiche.it/andrea/membrana/text.pdf I rendered it with XeConTeXt. As noted by Mojca, there are some problems with double apices In relation to footnote 1, this is what is coded in source : "Sign and Reality"

...

Did you also try pdftohtml --> html --> context?

No. You are suggesting via pandoc? Good point. The exported html is very clean. In general it seems that the idea is not to generate information related to document structure (as this should be inferred from pdf) favoring appearance description. Best -a- -------------------------------------------------- Andrea Valle -------------------------------------------------- CIRMA - DAMS Università degli Studi di Torino --> http://www.cirma.unito.it/andrea/ --> andrea.valle@unito.it -------------------------------------------------- I did this interview where I just mentioned that I read Foucault. Who doesn't in university, right? I was in this strip club giving this guy a lap dance and all he wanted to do was to discuss Foucault with me. Well, I can stand naked and do my little dance, or I can discuss Foucault, but not at the same time; too much information. (Annabel Chong)

Idris Samawi Hamid

3:07 p.m.

New subject: Doc to ConTeXt [was Re: HTML to ConTeXt]

On Sat, 10 Nov 2007 04:59:18 -0700, Andrea Valle wrote:

...

...
...
http://pdftohtml.sourceforge.net/ Looks impressive...

I think so

...
Looks VERY impressive... Tell me, how did you set up the cropmarks etc.?

Mmh, maybe you're are referring to the original source? The output one is bare bone (what I need) In any case, I upload the wrong reconstructed pdf (too late at night...)

Ah! I had a feeling that was too good to be true ;-) Best wishes Idris -- Professor Idris Samawi Hamid, Editor-in-Chief International Journal of Shi`i Studies Department of Philosophy Colorado State University Fort Collins, CO 80523 -- Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

Andrea Valle

3:11 p.m.

New subject: Doc to ConTeXt [was Re: HTML to ConTeXt]

Yes,

...

...
...
Tell me, how did you set up the cropmarks etc.?

but cropmarks are the easy parts using layers... :) Best -a-

...

Best wishes Idris

-- Professor Idris Samawi Hamid, Editor-in-Chief International Journal of Shi`i Studies Department of Philosophy Colorado State University Fort Collins, CO 80523

-- Using Opera's revolutionary e-mail client: http://www.opera.com/mail/ ______________________________________________________________________ _____________ If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : https://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ______________________________________________________________________ _____________

-------------------------------------------------- Andrea Valle -------------------------------------------------- CIRMA - DAMS Università degli Studi di Torino --> http://www.cirma.unito.it/andrea/ --> andrea.valle@unito.it -------------------------------------------------- I did this interview where I just mentioned that I read Foucault. Who doesn't in university, right? I was in this strip club giving this guy a lap dance and all he wanted to do was to discuss Foucault with me. Well, I can stand naked and do my little dance, or I can discuss Foucault, but not at the same time; too much information. (Annabel Chong)

Hans Hagen

8:08 p.m.

New subject: Doc to ConTeXt [was Re: HTML to ConTeXt]

Andrea Valle wrote:

...

Yes,

...
...
...
Tell me, how did you set up the cropmarks etc.?

but cropmarks are the easy parts using layers... :)

Saji Njarackalazhikam Hameed

6:44 a.m.

New subject: Doc to ConTeXt [was Re: HTML to ConTeXt]

Hi Andrea, I face a similar issue while organizing large-scale documents prepared by members of my group (many folks are not conversant with TeX here and write documents with WORD). My solution was to take their input through a wiki and convert the HTML to context markup using filters written with ruby (also see http://wiki.contextgarden.net/HTML_and_ConTeXt). Converting HTML syntax to ConTeXt syntax is very do-able. If it is of any use, I attach the ruby filters I use for my purpose. BTW, I use a ruby library called "hpricot" to ease some of these conversions. saji ... def scrape_the_page(pagePath,oFile,hFile) items_to_remove = [ "#menus", #menus notice "div.markedup", "div.navigation", "head", #table of contents "hr" ] doc=Hpricot(open(pagePath)) # this may not be applicable to your case # this removes some unnecessary markup from the Wiki pages @article = (doc/"#container").each do |content| #remove unnecessary content and edit links items_to_remove.each { |x| (content/x).remove } end # Write HTML content to file hFile.write @article.inner_html # How to replace various syntactic elements using Hpricot # replace p/b element with \bf (@article/"p/*/b").each do |pb| pb.swap("{\\bf #{pb.inner_html}}") end # replace p/b element with \bf (@article/"p/b").each do |pb| pb.swap("{\\bf #{pb.inner_html}}") end # replace strong element with \bf (@article/"strong").each do |ps| ps.swap("{\\bf #{ps.inner_html}}") end # replace h1 element with section (@article/"h1").each do |h1| h1.swap("\\section{#{h1.inner_html}}") end # replace h2 element with subsection (@article/"h2").each do |h2| h2.swap("\\subsection{#{h2.inner_html}}") end # replace h3 element with subsection (@article/"h3").each do |h3| h3.swap("\\subsubsection{#{h3.inner_html}}") end # replace h4 element with subsection (@article/"h4").each do |h4| h4.swap("\\subsubsubsection{#{h4.inner_html}}") end # replace h5 element with subsection (@article/"h5").each do |h5| h5.swap("\\subsubsubsubsection{#{h5.inner_html}}") end # replace <pre><code> by equivalent command in context (@article/"pre").each do |pre| pre.swap("\\startcode \n #{pre.at("code").inner_html} \n \\stopcode") end # when we encounter a reference to a figure inside the html # we replace it with a ConTeXt reference (@article/"a").each do |a| a.swap("\\in[#{a.inner_html}]") end # remove 'alt' attribute inside <img> element # replace <p><img> by equivalent command in context (@article/"p/img").each do |img| img_attrs=img.attributes['alt'].split(",") # separate the file name from the extension # have to take of file names that have a "." embedded in them img_src=img.attributes['src'].reverse.sub(/\w+\./,"").reverse # puts img_src # see if position of figure is indicated img_pos="force" img_attrs.each do |arr| img_pos=arr.gsub("position=","") if arr.match("position=") end img_attrs.delete("position=#{img_pos}") unless img_pos=="force" # see if the array img_attrs contains an referral key word if img_attrs.first.match(/\w+[=]\w+/) img_id=" " else img_id=img_attrs.first img_attrs.delete_at(0) end if img_pos=="force" if img.attributes['title'] img.swap(" \\placefigure\n [#{img_pos}][#{img_id}] \n {#{img.attributes['title']}} \n {\\externalfigure[#{img_src}][#{img_attrs.join(",")}]} \n ") else img.swap(" \\placefigure\n [#{img_pos}] \n {none} \n {\\externalfigure[#{img_src}][#{img_attrs.join(",")}]} ") end else if img.attributes['title'] img.swap(" \\placefigure\n [#{img_pos}][#{img_id}] \n {#{img.attributes['title']}} \n {\\externalfigure[#{img_src}][#{img_attrs.join(",")}]} \n ") else img.swap(" \\placefigure\n [#{img_pos}] \n {none} \n {\\externalfigure[#{img_src}][#{img_attrs.join(",")}]} \n ") end end end # end of converting inside (@article/"p/img") # why not search for table and if we find caption, keep it ; if not add an empty # Styling options: Here I catch the div element called Col2 and # format the tex document in 2 columns # Tables : placing them # replace <p><img> by equivalent command in context (@article/"table").each do |tab| if tab.at("caption") tab.swap(" \\placetable[split]{#{tab.at("caption").inner_html}}\n {\\bTABLE \n #{tab.inner_html} \\eTABLE} ") else tab.swap(" \\placetable[split]{}\n {\\bTABLE \n #{tab.inner_html} \\eTABLE} \n ") end end # Tables: remove the caption (@article/"caption").each do |cap| cap.swap("\n") end # Now we transfer the syntactically altered html to a string Object # and manipulate that object further newdoc=@article.inner_html # remove empty space in the beginning newdoc.gsub!(/^\s+/,"") # remove all elements we don't need. newdoc.gsub!(/^/,"\n") newdoc.gsub!(/<\u>/,"") newdoc.gsub!(/<\/u>/,"") newdoc.gsub!(/<ul>/,"\\startitemize[1]") newdoc.gsub!(/<\/ul>/,"\\stopitemize") newdoc.gsub!(/<ol>/,"\\startitemize[n]") newdoc.gsub!(/<\/ol>/,"\\stopitemize") newdoc.gsub!(/<li>/,"\\item ") newdoc.gsub!(/<\/li>/,"\n") newdoc.gsub!("_","\\_") newdoc.gsub!(/<table>/,"\\bTABLE \n") newdoc.gsub!(/<\/table>/,"\\eTABLE \n") newdoc.gsub!(/<tr>/,"\\bTR ") newdoc.gsub!(/<\/tr>/,"\\eTR ") newdoc.gsub!(/<td>/,"\\bTD ") newdoc.gsub!(/<\/td>/,"\\eTD ") newdoc.gsub!(/<th>/,"\\bTH ") newdoc.gsub!(/<\/th>/,"\\eTH ") newdoc.gsub!(/<center>/,"") newdoc.gsub!(/<\/center>/,"") newdoc.gsub!(/<em>/,"{\\em ") newdoc.gsub!(/<\/em>/,"}") newdoc.gsub!("^","") newdoc.gsub!("\%","\\%") newdoc.gsub!("&","&") newdoc.gsub!("&",'\\\&') newdoc.gsub!("$",'\\$') newdoc.gsub!(/<tbody>/,"\\bTABLEbody \n") newdoc.gsub!(/<\/tbody>/,"\\eTABLEbody \n") # Context does not mind "_" in figures and does not recognize \_, # so i have to catch these and replace \_ with _ # First catch filter=/\/AnnRep07\/Figures\/(\w+\/)*(\w+\\_)*/ if newdoc[filter] newdoc.gsub!(filter) { |fString| fString.gsub("\\_","_") } end # Second catch filter2=/\/AnnRep07\/Figures\/(\w+\/)*\w+[-.]\w+\\_\w+/ if newdoc[filter2] newdoc.gsub!(filter2) { |fString| fString.gsub("\\_","_") } end # Third catch; remove \_ inside [] filter3=/\[\w+\\_\w+\]/ if newdoc[filter3] newdoc.gsub!(filter3) { |fString| puts fString fString.gsub("\\_","_") } end # remove the comment tag, which we used to embed context commands newdoc.gsub!("","") # add full path to the images newdoc.gsub!("\/AnnRep07\/Figures\/","~\/AnnRep07\/Figures\/") newdoc.gsub!(/<\w+\s*\/>/,"") #puts newdoc # open file for output #outfil="#{oFile}.tex" #`rm #{outfil}` #fil=File.new(outfil,"a") #puts "Writing #{oFile}" oFile.write newdoc end # imgProps={} # img_attrs.each do |arr| # imgProps['width']=arr.gsub("width=","") if arr.match("width=") # imgProps['position']=arr.gsub("position=","") if arr.match("position=") # end * Andrea Valle [2007-11-10 02:30:36 +0100]:

...

Hi to all (Idris, in particular, as we are always dealing with the same problems... ),

I just want to share some thoughts about the ol' damn' problem of converting to ConTeXt from Word et al.

...
As I told Andrea: For relatively simple documents (like the kind we use in academic journals) it seems we can now

1) convert doc to odt using OOo 2) convert odt to markdown using

As suggest by Idris, I subscribed to the pandoc list, but I have to say that the activity is not exactly like the one on ConTeXt list... So the actual support for ConTeXt conversion is not convincing. More, it's always better to put the hands on your machine...

My problem is to convert a series of academic journals in ConTeXt. They come form the Humanities so little structure (basically, mainly body and footnotes). Far from me the idea of automatically doing all the stuff, I'd like to be faster and more accurate in conversion. (No particular interest in figures, they are few, not so much in references: they tends to be typographically inconsistent if done in a WYSISYG environment, so difficult to parse). More, as the journal has already being published we need to work with final pdfs.

After wasting my time with an awful pdf to html converter by Acrobat, I discovered this, you may all know: http://pdftohtml.sourceforge.net/

The html conversion is very very good in resulting rendering and also in sources, but after some tweakings I got interested in the xml conversion it allows. The xml format substantially encodes the infos related to page, typically each line is an element. Plus, there are bold and italics marked easily as <b> and <i> I'm still struggling to understand something really operative of XML processing in ConTeXt, so I switched back to Python. I used an incremental sax parser with some replacement. This is today's draft. Original: http://www.semiotiche.it/andrea/membrana/02%20imp.pdf

Recomposed (no setup at all, only \enableregime[utf]): http://www.semiotiche.it/andrea/membrana/02imp.pdf

pdf --> pdftoxml --> xml --> python script --> tex --> pdf

I recovered par, bold, em, footnotes, stripping dashes and reassembling the text with footnote references. Not bad as a first step.

I guess that you xml gurus could probably do much easier and cleaner. So, I mean -just for my very specific needs, I con probably take word sources, convert to pdf and then finally reach ConTeXt as discussed.

Just some ideas to share with the list

Best

-a-

-------------------------------------------------- Andrea Valle -------------------------------------------------- CIRMA - DAMS Università degli Studi di Torino --> http://www.cirma.unito.it/andrea/ --> andrea.valle@unito.it --------------------------------------------------

I did this interview where I just mentioned that I read Foucault. Who doesn't in university, right? I was in this strip club giving this guy a lap dance and all he wanted to do was to discuss Foucault with me. Well, I can stand naked and do my little dance, or I can discuss Foucault, but not at the same time; too much information. (Annabel Chong)

...

___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : https://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________

-- Saji N. Hameed APEC Climate Center +82 51 668 7470 National Pension Corporation Busan Building 12F Yeonsan 2-dong, Yeonje-gu, BUSAN 611705 saji@apcc21.net KOREA

Andrea Valle

2:10 p.m.

New subject: Doc to ConTeXt [was Re: HTML to ConTeXt]

Hi Saji, Thanks, I've already looked at it. I will surely take into account your idea, as I'd like to convert to context my wiki pages (with wikka wiki). At the end, the problem is html to context. Powerful library indeed, as far as I can understand Ruby. Best -a- -------------------------------------------------- Andrea Valle -------------------------------------------------- CIRMA - DAMS Università degli Studi di Torino --> http://www.cirma.unito.it/andrea/ --> andrea.valle@unito.it -------------------------------------------------- I did this interview where I just mentioned that I read Foucault. Who doesn't in university, right? I was in this strip club giving this guy a lap dance and all he wanted to do was to discuss Foucault with me. Well, I can stand naked and do my little dance, or I can discuss Foucault, but not at the same time; too much information. (Annabel Chong)

6442

Age (days ago)

6458

Last active (days ago)

List overview

Download

14 comments

5 participants

participants (5)

Aditya Mahajan
Andrea Valle
Hans Hagen
Idris Samawi Hamid
Saji Njarackalazhikam Hameed

HTML to ConTeXt

Saji Njarackalazhikam Hameed

tags

participants (5)