Re: [NTG-context] DOC/RTF to ConTeXt via XML
No need for rtf. That would loose lots of information anyway, wouldn't it?
RTF can capture everything that .doc can (MS update it every time they rev the .doc format), and it has the advantage that it is defined in a spec with a grammar, which means that importing routines (like the one in OO.o) tend to be better than for the binary .doc format. So I would usually use .rtf as the Save As... from Word, rather than relying on OO.o's reverse engineering of the .doc format. Others' experiences may vary, of course, and perhaps I do an injustice to OO.o's Word imports, which have certainly improved. But RTF is a fairly safe bet, and additionally it is 'human readable' so that helps debugging.
\startHans converting open office xml is not always easy; stay away from tab's and use high level constructs as much as possible \stopHans
I would add to this - make sure you use either OO.o 1.1.5 or a 2.0 Beta, since earlier versions used a file format which was a lot trickier to post-process (problems with conflating styles into paragraph formats).
Once I get a sane xml file (this seems to be the biggest problem) what is the best tool to convert this to ConTeXt?
Well you might not need to - remember that ConTeXt can process XML natively now, which is why I suggested you look at the DocBook-in-ConTeXt project, which uses this feature. You wouldn't necessarily have to use the DocBook standard, but you could use the principles of that project to define a nice output from your own (simple) brand of XML. Duncan
Duncan Hothersall wrote:
RTF can capture everything that .doc can (MS update it every time they rev the .doc format), and it has the advantage that it is defined in a spec with a grammar, which means that importing routines (like the one
Oh, yes, the RTF spec. It really makes you wonder what Microsoft employees understand by the word “spec.” Word breaks almost every single rule in that spec and has done so for ages: “The LetterSequence is made up of lowercase alphabetic characters (a-z). RTF is case sensitive. The following Word 97-2000 keywords do not currently follow the requirement that keywords may not contain any uppercase alphabetic characters. ...” But I should be happy that these violations are actually dcumented.
in OO.o) tend to be better than for the binary .doc format. So I would
Okay; I did not know that whatever Microsoft currently calls RTF is actually able to save all Word files losslessly. (I am in the lucky position not to have any Word files to convert.) Makes me wonder if there really is any need for an XML step in between. Can OOo convert RTF to XML without user intervention, such as clicking somewhere with a mouse? Maybe rtf2fo.com, http://www.infinity-loop.de/products/upcast/, or http://sourceforge.net/projects/majix/ are good alternatives for this step? (I never used any one of them.)
which have certainly improved. But RTF is a fairly safe bet, and additionally it is 'human readable' so that helps debugging.
Asking a human to read RTF is certainly inhuman. :-) But there is another advantage of using RTF: Authors can use almost any word processor they want. :-)
Well you might not need to - remember that ConTeXt can process XML natively now, which is why I suggested you look at the
But unless I'm mistaken, this is based on a streaming model, which has its advantages, but also disadvantages. So, the question is whether the xml format is close enough to the order in which ConTeXt would like to get the bits and pieces. Since the format has not been defined yet, this question should be kept in mind. Christopher
participants (2)
-
Christopher Creutzig
-
Duncan Hothersall