Non-printable Unicode control characters
Unicode has many "control characters" that only control text behaviour and shouldn't be rendered visually in the text, such as Bidi_Control and Join_Control chars (see http://www.unicode.org/Public/5.1.0/ucd/PropList.txt and http://unicode.org/Public/UNIDATA/UCD.html) Currently, ConTeXt handles ZWJ and ZWNJ, but other characters get rendered if the font has glyphs for them or make no effect at all if the font has no glyphs for them. I think that the optimum behaviour is to make those characters affect text formatting while not visually rendered whether the font has glyphs for them or not. It might be also useful if we can enable rendering those characters manually, for drafts and such. Regards, Khaled -- Khaled Hosny Arabic localizer and member of Arabeyes.org team
Khaled Hosny wrote:
Unicode has many "control characters" that only control text behaviour and shouldn't be rendered visually in the text, such as Bidi_Control and Join_Control chars (see http://www.unicode.org/Public/5.1.0/ucd/PropList.txt and http://unicode.org/Public/UNIDATA/UCD.html)
Currently, ConTeXt handles ZWJ and ZWNJ, but other characters get rendered if the font has glyphs for them or make no effect at all if the font has no glyphs for them. I think that the optimum behaviour is to make those characters affect text formatting while not visually rendered whether the font has glyphs for them or not. It might be also useful if we can enable rendering those characters manually, for drafts and such.
actually we need: - ignore them (like in verbatim) - act upon them and - show them (might somehow interfere with other things) - hide them if i'm right, when bidi is turned on, those chars get processed and then discarded from the node list, so some more than zwj and zwnj is handled, and of course others need to be handled as well Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
On Sat, 16 Aug 2008 01:53:06 -0600, Hans Hagen
Khaled Hosny wrote:
Unicode has many "control characters" that only control text behaviour and shouldn't be rendered visually in the text, such as Bidi_Control and Join_Control chars (see http://www.unicode.org/Public/5.1.0/ucd/PropList.txt and http://unicode.org/Public/UNIDATA/UCD.html) Currently, ConTeXt handles ZWJ and ZWNJ, but other characters get rendered if the font has glyphs for them or make no effect at all if the font has no glyphs for them. I think that the optimum behaviour is to make those characters affect text formatting while not visually rendered whether the font has glyphs for them or not. It might be also useful if we can enable rendering those characters manually, for drafts and such.
actually we need:
- ignore them (like in verbatim)
Eventually we want to be able to show them in verbatim also (provided the font has them). Indeed, I suggest that -- given an appropriate teletype font -- the default for _verbatim text_ should be to _show_ the control chars.
- act upon them
and
- show them (might somehow interfere with other things)
Showing the control chars in typeset text -- non-verbatim -- should be rare; more appropriate for verbatim
- hide them
I suggest that the default for _typeset text_ should definitely be to _hide_ the control chars.
if i'm right, when bidi is turned on, those chars get processed and then discarded from the node list, so some more than zwj and zwnj is handled,
It appears to me that zwj and zwnj etc. should be invisible in typeset-text output -- as explained above, but should still be encoded in the output pdf. Think pdf-text extraction, converting between Arabic and Farsi typesetting conventions, etc.
and of course others need to be handled as well
Even lsep's and psep's should be present in the output pdf (eg \par => psep). Will make text extraction much more useful, etc. Best wishes Idris -- Professor Idris Samawi Hamid, Editor-in-Chief International Journal of Shi`i Studies Department of Philosophy Colorado State University Fort Collins, CO 80523
Idris Samawi Hamid ادريس سماوي حامد wrote:
It appears to me that zwj and zwnj etc. should be invisible in typeset-text output -- as explained above, but should still be encoded in the output pdf. Think pdf-text extraction, converting between Arabic and Farsi typesetting conventions, etc.
we can do that later (we can use an attribute to keep track of preceding/following special thingies and inject them in the output later on)
and of course others need to be handled as well
Even lsep's and psep's should be present in the output pdf (eg \par => psep). Will make text extraction much more useful, etc.
rather useless in pdf; at some point i might add proper structure to the pdf output but it has a rather low priority (never needed it) Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
participants (3)
-
Hans Hagen
-
Idris Samawi Hamid ادريس سماوي ح امد
-
Khaled Hosny