Re: [NTG-context] Doc to ConTeXt [was Re: HTML to ConTeXt]

10 Nov 2007

      Hi Andrea,

I face a similar issue while organizing large-scale documents
prepared by members of my group (many folks are not conversant
with TeX here and write documents with WORD). My solution was to take their
input through a wiki and convert the HTML to context markup
using filters written with ruby (also see 
http://wiki.contextgarden.net/HTML_and_ConTeXt). Converting
HTML syntax to ConTeXt syntax is very do-able. 

If it is of any use, I attach the ruby filters I use for
my purpose. BTW, I use a ruby library called "hpricot" to ease
some of these conversions.

saji
...

def scrape_the_page(pagePath,oFile,hFile) 
items_to_remove = [
  "#menus",        #menus notice
  "div.markedup",
  "div.navigation",
  "head",          #table of contents 
  "hr"
  ]

doc=Hpricot(open(pagePath))
# this may not be applicable to your case
# this removes some unnecessary markup from the Wiki pages

@article = (doc/"#container").each do |content|
  #remove unnecessary content and edit links
  items_to_remove.each { |x| (content/x).remove }
end 

# Write HTML content to file
hFile.write @article.inner_html

# How to replace various syntactic elements using Hpricot
# replace p/b element with \bf
(@article/"p/*/b").each do |pb|
  pb.swap("{\\bf #{pb.inner_html}}")
end

# replace p/b element with \bf
(@article/"p/b").each do |pb|
  pb.swap("{\\bf #{pb.inner_html}}")
end

# replace strong element with \bf
(@article/"strong").each do |ps|
  ps.swap("{\\bf #{ps.inner_html}}")
end

# replace h1 element with section
(@article/"h1").each do |h1|
  h1.swap("\\section{#{h1.inner_html}}")
end

# replace h2 element with subsection
(@article/"h2").each do |h2|
  h2.swap("\\subsection{#{h2.inner_html}}")
end

# replace h3 element with subsection
(@article/"h3").each do |h3|
  h3.swap("\\subsubsection{#{h3.inner_html}}")
end

# replace h4 element with subsection
(@article/"h4").each do |h4|
  h4.swap("\\subsubsubsection{#{h4.inner_html}}")
end

# replace h5 element with subsection
(@article/"h5").each do |h5|
  h5.swap("\\subsubsubsubsection{#{h5.inner_html}}")
end

# replace <pre><code> by equivalent command in context
(@article/"pre").each do |pre|
  pre.swap("\\startcode \n #{pre.at("code").inner_html} \n
  \\stopcode")
end

# when we encounter a reference to a figure inside the html
# we replace it with a ConTeXt reference

(@article/"a").each do |a|
  a.swap("\\in[#{a.inner_html}]")
end

# remove 'alt' attribute inside <img> element
# replace <p><img> by equivalent command in context
(@article/"p/img").each do |img|

  img_attrs=img.attributes['alt'].split(",")

  # separate the file name from the extension
  # have to take of file names that have a "." embedded in them
  img_src=img.attributes['src'].reverse.sub(/\w+\./,"").reverse
  # puts img_src
  # see if position of figure is indicated
  img_pos="force"
  img_attrs.each do |arr| 
    img_pos=arr.gsub("position=","") if arr.match("position=")
  end
  img_attrs.delete("position=#{img_pos}") unless img_pos=="force" 

  # see if the array img_attrs contains an referral key word
  if img_attrs.first.match(/\w+[=]\w+/)
    img_id=" "
  else
    img_id=img_attrs.first
    img_attrs.delete_at(0)
  end

  if img_pos=="force"
    if img.attributes['title']
      img.swap("
      \\placefigure\n 
      [#{img_pos}][#{img_id}] \n 
      {#{img.attributes['title']}} \n 
      {\\externalfigure[#{img_src}][#{img_attrs.join(",")}]}  \n
              ")
    else
      img.swap("
      \\placefigure\n 
      [#{img_pos}] \n
      {none} \n
      {\\externalfigure[#{img_src}][#{img_attrs.join(",")}]} 
              ")
    end
  else
    if img.attributes['title']
      img.swap("
      \\placefigure\n 
      [#{img_pos}][#{img_id}] \n 
      {#{img.attributes['title']}} \n 
      {\\externalfigure[#{img_src}][#{img_attrs.join(",")}]}  \n
              ")
    else
      img.swap("
      \\placefigure\n 
      [#{img_pos}] \n
      {none} \n
      {\\externalfigure[#{img_src}][#{img_attrs.join(",")}]}
       \n 
              ")
    end
  end

end # end of converting inside (@article/"p/img")

# why not search for table and if we find caption, keep it ; if not add an empty

# Styling options: Here I catch the div element called Col2 and
# format the tex document in 2 columns

# Tables : placing them
# replace <p><img> by equivalent command in context
(@article/"table").each do |tab|
  if tab.at("caption")
  tab.swap("
  \\placetable[split]{#{tab.at("caption").inner_html}}\n
  {\\bTABLE \n
  #{tab.inner_html}
  \\eTABLE} 
             ")
  else
  tab.swap("
   \\placetable[split]{}\n
   {\\bTABLE \n
  #{tab.inner_html}
  \\eTABLE} \n 
            ")
  end
end

# Tables: remove the caption
(@article/"caption").each do |cap|
  cap.swap("\n")
end

# Now we transfer the syntactically altered html to a string Object
# and manipulate that object further

newdoc=@article.inner_html

# remove empty space in the beginning
newdoc.gsub!(/^\s+/,"")

# remove all elements we don't need.
newdoc.gsub!(/^/,"\n")
newdoc.gsub!(/<\u>/,"")
newdoc.gsub!(/<\/u>/,"")
newdoc.gsub!(/<ul>/,"\\startitemize[1]")
newdoc.gsub!(/<\/ul>/,"\\stopitemize")
newdoc.gsub!(/<ol>/,"\\startitemize[n]")
newdoc.gsub!(/<\/ol>/,"\\stopitemize")
newdoc.gsub!(/<li>/,"\\item ")
newdoc.gsub!(/<\/li>/,"\n")
newdoc.gsub!("_","\\_")
newdoc.gsub!(/<table>/,"\\bTABLE \n")
newdoc.gsub!(/<\/table>/,"\\eTABLE \n")
newdoc.gsub!(/<tr>/,"\\bTR ")
newdoc.gsub!(/<\/tr>/,"\\eTR ")
newdoc.gsub!(/<td>/,"\\bTD ")
newdoc.gsub!(/<\/td>/,"\\eTD ")
newdoc.gsub!(/<th>/,"\\bTH ")
newdoc.gsub!(/<\/th>/,"\\eTH ")
newdoc.gsub!(/<center>/,"")
newdoc.gsub!(/<\/center>/,"")
newdoc.gsub!(/<em>/,"{\\em ")
newdoc.gsub!(/<\/em>/,"}")
newdoc.gsub!("^","")
newdoc.gsub!("\%","\\%")
newdoc.gsub!("&","&")
newdoc.gsub!("&",'\\\&')
newdoc.gsub!("$",'\\$')
newdoc.gsub!(/<tbody>/,"\\bTABLEbody \n")
newdoc.gsub!(/<\/tbody>/,"\\eTABLEbody \n")

# Context does not mind "_" in figures and does not recognize \_,
# so i have to catch these and replace \_ with _

# First catch
filter=/\/AnnRep07\/Figures\/(\w+\/)*(\w+\\_)*/

if newdoc[filter]
newdoc.gsub!(filter) { |fString| 
fString.gsub("\\_","_") 
}
end

# Second catch
filter2=/\/AnnRep07\/Figures\/(\w+\/)*\w+[-.]\w+\\_\w+/

if newdoc[filter2]
newdoc.gsub!(filter2) { |fString| 
fString.gsub("\\_","_") }
end

# Third catch; remove \_ inside []
filter3=/\[\w+\\_\w+\]/

if newdoc[filter3]
newdoc.gsub!(filter3) { |fString| 
puts fString
fString.gsub("\\_","_") }
end

# remove the comment tag, which we used to embed context commands
newdoc.gsub!("<!--","")
newdoc.gsub!("-->","")

# add full path to the images
newdoc.gsub!("\/AnnRep07\/Figures\/","~\/AnnRep07\/Figures\/")

newdoc.gsub!(/<\w+\s*\/>/,"")

#puts newdoc
# open file for output
#outfil="#{oFile}.tex"
#`rm #{outfil}`

#fil=File.new(outfil,"a")
#puts "Writing #{oFile}"
oFile.write newdoc

end
# imgProps={}
  #       img_attrs.each do |arr| 
  #       imgProps['width']=arr.gsub("width=","") if arr.match("width=")
  #       imgProps['position']=arr.gsub("position=","") if arr.match("position=")
  #       end

* Andrea Valle  [2007-11-10 02:30:36 +0100]:
...
Hi to all (Idris, in particular, as we are always dealing with the same 
problems... ),
I just want to share some thoughts about the ol' damn' problem of 
converting to ConTeXt from Word et al.
...
As I told Andrea: For relatively simple documents (like the kind we use in
academic journals) it seems we can now
1) convert doc to odt using OOo
2) convert odt to markdown using
As suggest by Idris, I subscribed to the pandoc list, but I have to say 
that the activity is not exactly like the one on ConTeXt list...
So the actual support for ConTeXt conversion is not convincing. More, it's 
always better to put the hands on your machine...
My problem is to convert a series of academic journals in ConTeXt. They 
come form the Humanities so little structure (basically, mainly body and 
footnotes).
Far from me the idea of automatically doing all the stuff, I'd like to be 
faster and more accurate in conversion.
(No particular interest in figures, they are few, not so much in 
references: they tends to be typographically inconsistent if done
in a WYSISYG environment, so difficult to parse).
More, as the journal has already being published we need to work with final 
pdfs.
After wasting my time with an awful pdf to html converter by Acrobat,  I 
discovered this, you may all know:
http://pdftohtml.sourceforge.net/
The html  conversion is very very good in resulting rendering and also in 
sources, but after some tweakings I got interested in the xml conversion it 
allows.
The xml format  substantially encodes the infos related to page, typically 
each line is an element. Plus, there are bold and italics marked easily as 
<b> and <i>
I'm still struggling to understand something really operative of XML 
processing in ConTeXt, so  I switched back to Python.
I used an incremental sax parser with some replacement.
This is today's draft.
Original:
http://www.semiotiche.it/andrea/membrana/02%20imp.pdf
Recomposed (no setup at all, only \enableregime[utf]):
http://www.semiotiche.it/andrea/membrana/02imp.pdf
pdf --> pdftoxml --> xml --> python script --> tex --> pdf
I recovered par, bold, em, footnotes,  stripping dashes and reassembling 
the text with footnote references. Not bad as a first step.
I guess that you xml gurus could probably do much easier and cleaner.
So, I mean -just for my very specific needs, I con probably  take word 
sources, convert to pdf and then finally reach ConTeXt as discussed.
Just some ideas to share with the list
Best
-a-
--------------------------------------------------
Andrea Valle
--------------------------------------------------
CIRMA - DAMS
Università degli Studi di Torino
--> http://www.cirma.unito.it/andrea/
--> andrea.valle@unito.it
--------------------------------------------------
I did this interview where I just mentioned that I read Foucault. Who 
doesn't in university, right? I was in this strip club giving this guy a 
lap dance and all he wanted to do was to discuss Foucault with me. Well, I 
can stand naked and do my little dance, or I can discuss Foucault, but not 
at the same time; too much information.
(Annabel Chong)

...
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!
maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : https://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________________
-- 
Saji N. Hameed

APEC Climate Center          				+82 51 668 7470
National Pension Corporation Busan Building 12F         
Yeonsan 2-dong, Yeonje-gu, BUSAN 611705			saji@apcc21.net
KOREA

Re: [NTG-context] Doc to ConTeXt [was Re: HTML to ConTeXt]

Saji Njarackalazhikam Hameed