modifying URL wrapping rules
Hello, I would like to modify the rules for wrapping URLs (aka "hyphenating" URLs, but generally without inserting hyphens) to conform more closely to the Chicago Manual of Style, 15th ed. I've read on this list that the code for doing this is defined in the \hypthenatedurl There are several files with the same name, in contextminimal\texmf-context\tex\context\base\: lang-url.mkii lang-url.mkiv lang-url.lua lang-url.tex
From what I can tell, the .tex file loads one of the other three: \loadmarkfile{lang-url}
I'm not sure which one it loads -- the lua, mkii, or mkiv. All three seem to have the rules for url hyphenation encoded in them. Is it a matter of which engine I'm using, e.g. luatex? Our project has a requirement of using Xetex, so I have to stick with that. Does that mean lang-url doesn't work at all? We also have users using ConTeXt Minimal as well as ConTeXt from the TeXLive 2008 distribution. I want to do things in a way that will work in both. I would be happy to put in some extra effort to make the result generally available to others who want to follow the Chicago style for wrapping long URLs. Thanks for any hints... Lars
From what I can tell, the .tex file loads one of the other three: \loadmarkfile{lang-url}
\loadmarkfile loads either lang-url.mkii or lang-url.mkiv, depending on the ConTeXt version you're running (MkII / MkIV). In Mark IV, the Lua code is then put in lang-url.lua, which is input by lang-url.mkiv (you can see "\registerctxluafile{lang-url}{1.001}" near the beginning of the latter). This architecture enables you to reuse the Lua code in completely different environments (for example, in a pure Lua script).
Our project has a requirement of using Xetex, so I have to stick with that. Does that mean lang-url doesn't work at all?
ConTeXt on XeTeX is considered Mark II as far as the mark business goes (it doesn't know about Lua), so you have access to the exact same code as with pdfTeX; in this case, lang-url.mkii will be loaded. But if you know that all your users will be using XeTeX, you don't really need to worry about the \loadmarkfile mechanism; it is there to accommodate different engines.
We also have users using ConTeXt Minimal as well as ConTeXt from the TeXLive 2008 distribution.
The particular distribution one uses shouldn't be a problem at all for implementing hyphenation rules.
I want to do things in a way that will work in both. I would be happy to put in some extra effort to make the result generally available to others who want to follow the Chicago style for wrapping long URLs.
This will certainly be most appreciated. Arthur
On 11/18/2008 3:44 PM, Arthur Reutenauer wrote:
From what I can tell, the .tex file loads one of the other three: \loadmarkfile{lang-url}
\loadmarkfile loads either lang-url.mkii or lang-url.mkiv, depending on the ConTeXt version you're running (MkII / MkIV). In Mark IV, the Lua code is then put in lang-url.lua, which is input by lang-url.mkiv (you can see "\registerctxluafile{lang-url}{1.001}" near the beginning of the latter). This architecture enables you to reuse the Lua code in completely different environments (for example, in a pure Lua script).
Our project has a requirement of using Xetex, so I have to stick with that. Does that mean lang-url doesn't work at all?
ConTeXt on XeTeX is considered Mark II as far as the mark business goes (it doesn't know about Lua), so you have access to the exact same code as with pdfTeX; in this case, lang-url.mkii will be loaded.
Thanks for the explanation... this is helpful. So it sounds like I should definitely modify the lang-url.mkii file.
But if you know that all your users will be using XeTeX, you don't really need to worry about the \loadmarkfile mechanism; it is there to accommodate different engines.
OK... but I'm not sure what I would do differently if I'm not worrying about the \loadmarkfile mechanism... Still modify the lang-url.mkii file? Given that I'm willing to put in a little extra effort to make the result available to a wider set of users, should I still modify lang-url.mkii?
We also have users using ConTeXt Minimal as well as ConTeXt from the TeXLive 2008 distribution.
The particular distribution one uses shouldn't be a problem at all for implementing hyphenation rules.
Good to know, thanks.
I want to do things in a way that will work in both. I would be happy to put in some extra effort to make the result generally available to others who want to follow the Chicago style for wrapping long URLs.
This will certainly be most appreciated.
I will probably need more help in order to know how to do this. Once I've finished doing it for us, in .mkii, I'll ask again on this list. Thanks again, Lars
OK... but I'm not sure what I would do differently if I'm not worrying about the \loadmarkfile mechanism... Still modify the lang-url.mkii file?
Sure. I simply meant that you only needed to produce a single file, your modified lang-url.mkii, whereas if you wanted to develop for both MkII and MkIV, you would have needed a more involved structure. Anyway ... Arthur
On 11/18/2008 3:44 PM, Arthur Reutenauer wrote:
From what I can tell, the .tex file loads one of the other three: \loadmarkfile{lang-url}
\loadmarkfile loads either lang-url.mkii or lang-url.mkiv, depending on the ConTeXt version you're running (MkII / MkIV). In Mark IV, the Lua code is then put in lang-url.lua, which is input by lang-url.mkiv (you can see "\registerctxluafile{lang-url}{1.001}" near the beginning of the latter). This architecture enables you to reuse the Lua code in completely different environments (for example, in a pure Lua script).
Our project has a requirement of using Xetex, so I have to stick with that. Does that mean lang-url doesn't work at all?
ConTeXt on XeTeX is considered Mark II as far as the mark business goes (it doesn't know about Lua), so you have access to the exact same code as with pdfTeX; in this case, lang-url.mkii will be loaded.
OK, I've taken a stab at it. Here is the main code now in the modified lang-url.mkii. For brevity in this email I've just omitted the lines that I actually commented out in the file, namely characters that Chicago style does not say you can line-break URLs on. \def\sethyphenatedurlnormal#1{\expandafter\chardef\csname url @ #1\endcsname\zerocount} \def\sethyphenatedurlbefore#1{\expandafter\chardef\csname url @ #1\endcsname\plusone } \def\sethyphenatedurlafter #1{\expandafter\chardef\csname url @ #1\endcsname\plustwo } % Chicago manual of style rules: % Break URLs after: / or // (I don't know how to implement // so will be content with / for now. % To do: prevent breaking in middle of double slash //.) % Break URLs before: ~ . , - _ ? # % % Break URLs before or after: = & (I don't know how to implement 'before or after' so will % be content with breaking 'before' these characters for now). \sethyphenatedurlbefore \letterhash \sethyphenatedurlbefore \letterpercent \sethyphenatedurlbefore \letterampersand \sethyphenatedurlbefore , \sethyphenatedurlbefore - \sethyphenatedurlbefore . \sethyphenatedurlbefore = \sethyphenatedurlbefore ? \sethyphenatedurlbefore _ \sethyphenatedurlbefore \lettertilde \sethyphenatedurlafter / % was \sethyphenatedurlbefore / However, I have a few unsolved problems here. 1) I don't see a way, with the '\sethyphenatedurlbefore' or 'after' mechanism, to tell it not to break a URL between two slashes, as in "http://". At first I thought that since our text only had a few URLs, we'd likely never care. But ... you guessed it. One URL got broken between the slashes: "http:/ /www.sil.org/..." So I tried using the base tex hyphenation mechanism to inhibit breaking there: I changed the document from \hyphenatedurl{http://www.sil.org/...} to \hyphenatedurl{\hyphenation{http://}www.sil.org/...} but that gave a stack overflow. Then I tried \hyphenation{http://}\hyphenatedurl{www.sil.org/...} but got this error: ! Not a letter. <inserted text> http: // \hyphenation ...malhyphenation {\the \scratchtoks }\endgroup <argument> ... Linguistics. \hyphenation {http://} \hyphenatedurl {www.sil.or... \BE #1->\startmainexdent {#1 }\stopmainexdent l.317 ...l.org/silesr/abstract.asp?ref=2007-015}.} I'm kind of shooting in the dark there, so maybe somebody who knows TeX can help me out. 2) Even though I have "\sethyphenatedurlafter /" instead of "\sethyphenatedurlbefore /", there are four cases where a URL is broken before a slash, e.g.: http://www.sil.org/.../009 /YAMBASSA.html. and no cases where a URL is broken after a slash (except when it's also before a slash -- see 1). I wonder if my modifications are actually taking effect? Do I need to compile the changes to the .mkii file or something? I tried texexec.bat --make --all, but that didn't seem to change the outcome. 3) Conversely, even though I have "\sethyphenatedurlbefore -" and not "\sethyphenatedurlafter -", there is a case where a URL is broken after a hyphen (a hyphen that was already present in the URL): http://www..../Niger- Congo/... and no case where a URL is broken before a hyphen. Note that the "\sethyphenatedurlbefore -" setting is unchanged from the original lang-url.mkii, so this is not an issue of needing to recompile. Maybe the general tex hyphenation mechanism is operating here, in spite of the URL breaking settings. How do I override that (only for the URL)? 4) In one case, a URL is broken over the end of a column. That's ok, but it would be nice to be able to strongly discourage that from happening at the end of a page. I'm told that's a difficult problem to solve. It's not mandatory for us at this point but if anyone has a solution I'd like to hear about it. Thanks, Lars
On 11/19/2008 2:35 PM, Lars Huttar wrote:
However, I have a few unsolved problems here.
1) I don't see a way, with the '\sethyphenatedurlbefore' or 'after' mechanism, to tell it not to break a URL between two slashes, as in "http://". At first I thought that since our text only had a few URLs, we'd likely never care. But ... you guessed it. One URL got broken between the slashes: "http:/ /www.sil.org/..."
I found a way to deal with this... Based on a tip from http://xpt.sourceforge.net/techdocs/language/latex/latex03-LaTexUsage/ar01s0..., I used {\lefthyphenmin=64 http://}\hyphenatedurl{www.sil.org/...} It seems to work in practice -- hyphenation and breaking are disabled for the "http://" chunk. And hyphenation seems to successfully resume afterwards. This also fixes the problem of a URL breaking before the "//". The other problems are still outstanding though (wanting to break a URL after a slash, not before; and before a hyphen, not after). Thanks for any ideas... Lars
Lars Huttar wrote:
On 11/19/2008 2:35 PM, Lars Huttar wrote:
However, I have a few unsolved problems here.
1) I don't see a way, with the '\sethyphenatedurlbefore' or 'after' mechanism, to tell it not to break a URL between two slashes, as in "http://". At first I thought that since our text only had a few URLs, we'd likely never care. But ... you guessed it. One URL got broken between the slashes: "http:/ /www.sil.org/..."
I found a way to deal with this... Based on a tip from http://xpt.sourceforge.net/techdocs/language/latex/latex03-LaTexUsage/ar01s0..., I used {\lefthyphenmin=64 http://}\hyphenatedurl{www.sil.org/...} It seems to work in practice -- hyphenation and breaking are disabled for the "http://" chunk. And hyphenation seems to successfully resume afterwards. This also fixes the problem of a URL breaking before the "//".
The other problems are still outstanding though (wanting to break a URL after a slash, not before; and before a hyphen, not after).
Thanks for any ideas...
cleaner than the lefthyphenmin hackery .,.. {\hbox{http://}\hyphenatedurl{www.sil.org/... in context mkiv i can provide a hyphenater based on the url syntax (after all, mkiv already has an analyser for urls) ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
On 11/20/2008 2:29 AM, Hans Hagen wrote:
Lars Huttar wrote:
On 11/19/2008 2:35 PM, Lars Huttar wrote:
However, I have a few unsolved problems here.
1) I don't see a way, with the '\sethyphenatedurlbefore' or 'after' mechanism, to tell it not to break a URL between two slashes, as in "http://". At first I thought that since our text only had a few URLs, we'd likely never care. But ... you guessed it. One URL got broken between the slashes: "http:/ /www.sil.org/..." I found a way to deal with this... Based on a tip from http://xpt.sourceforge.net/techdocs/language/latex/latex03-LaTexUsage/ar01s0..., I used {\lefthyphenmin=64 http://}\hyphenatedurl{www.sil.org/...} It seems to work in practice -- hyphenation and breaking are disabled for the "http://" chunk. And hyphenation seems to successfully resume afterwards. This also fixes the problem of a URL breaking before the "//".
The other problems are still outstanding though (wanting to break a URL after a slash, not before; and before a hyphen, not after).
Thanks for any ideas...
cleaner than the lefthyphenmin hackery .,..
{\hbox{http://}\hyphenatedurl{www.sil.org/...
Thank you!
in context mkiv i can provide a hyphenater based on the url syntax (after all, mkiv already has an analyser for urls)
That would be great, but as I understand it, using mkiv would require us to move to a beta version of ConTeXt... and we're right at the end (we hope!) of a production cycle, where moving to any new version (whether beta or not) could cost us a lot of time if any behavior changes. So while I would be glad to see a hyphenator based on URL syntax, I don't think a mkiv version won't help us this time. Lars
On 11/19/2008 2:35 PM, Lars Huttar wrote:
2) Even though I have "\sethyphenatedurlafter /" instead of "\sethyphenatedurlbefore /", there are four cases where a URL is broken before a slash, e.g.: http://www.sil.org/.../009 /YAMBASSA.html. and no cases where a URL is broken after a slash (except when it's also before a slash -- see 1).
I wonder if my modifications are actually taking effect? Do I need to compile the changes to the .mkii file or something? I tried texexec.bat --make --all, but that didn't seem to change the outcome.
Can someone tell me if there's a compile command necessary for mkii?
3) Conversely, even though I have "\sethyphenatedurlbefore -" and not "\sethyphenatedurlafter -", there is a case where a URL is broken after a hyphen (a hyphen that was already present in the URL): http://www..../Niger- Congo/... and no case where a URL is broken before a hyphen. Note that the "\sethyphenatedurlbefore -" setting is unchanged from the original lang-url.mkii, so this is not an issue of needing to recompile.
Maybe the general tex hyphenation mechanism is operating here, in spite of the URL breaking settings. How do I override that (only for the URL)?
Maybe it would help if someone could explain to me what 'normal' means in lang-url.mkii: \def\dohyphenatedurlnormal#1{\char#1\relax}% \def\dohyphenatedurlafter #1{\char#1\discretionary{}{\hyphenatedurlseparator}{}}% \def\dohyphenatedurlbefore#1{\discretionary{\hyphenatedurlseparator}{}{}\char#1\relax}% % 0=normal 1=before 2=after \def\sethyphenatedurlnormal#1{\expandafter\chardef\csname url @ #1\endcsname\zerocount} \def\sethyphenatedurlbefore#1{\expandafter\chardef\csname url @ #1\endcsname\plusone } \def\sethyphenatedurlafter #1{\expandafter\chardef\csname url @ #1\endcsname\plustwo } It looks like 'normal' means don't put a discretionary hyphenatedurlseparator before/after the character. Which would mean either (a) the url cannot be separated there (unless an adjacent character has hyphenatedurlbefore/after specified on it); or (b) the url will follow the same hyphenation rules as normal text (no special url-related rules). Can anyone tell me which it is? The definition of hyphenatedurl is: \unexpanded \def\hyphenatedurl#1% {\dontleavehmode \begingroup \the\everyhyphenatedurl \edef\ascii{#1}% \expanded{\handletokens{\detokenize\expandafter{\ascii}}}\with\dohyphenatedurl \endgroup} and the definition of \dontleavehmode is in syst-ext.tex with some comments: %D \macros %D {dontleavehmode} %D %D Sometimes when we enter a paragraph with some command, the %D first token gets the whole first line. We can prevent this %D by saying: %D %D \starttyping %D \dontleavehmode %D \stoptyping ... \unexpanded \def\dontleavehmode {\ifhmode\else \ifmmode\else \setbox\@@dlhbox\hbox{\mathsurround\zeropoint\everymath\emptytoks$ $}\unhbox\@@dlhbox \fi \fi} ... %D But, if you run a recent version of \TEX, we can use the new %D primitive: \ifx\normalquitvmode\undefined \else \let\dontleavehmode\normalquitvmode \fi I am running Xetex, FWIW. "This is XeTeX, Version 3.1415926-2.2-0.999.6 (Web2C 7.5.7)" The above makes me think that "dontleavehmode" should prevent any 'hyphenation' except for the types explicitly allowed in lang-url.mkii via \sethyphenatedurlafter/before/normal. Yet that isn't happening... it's breaking before slash instead of after, and after hyphen instead of before. I wondered briefly whether I had misinterpreted (swapped) the semantics of \sethyphenatedurlafter and \sethyphenatedurlbefore. But no, "\sethyphenatedurlbefore ." is working as expected: URLs break before a period. So I'm just puzzled. Thanks for any help... Lars
Lars Huttar wrote:
On 11/19/2008 2:35 PM, Lars Huttar wrote:
2) Even though I have "\sethyphenatedurlafter /" instead of "\sethyphenatedurlbefore /", there are four cases where a URL is broken before a slash, e.g.: http://www.sil.org/.../009 /YAMBASSA.html. and no cases where a URL is broken after a slash (except when it's also before a slash -- see 1).
I wonder if my modifications are actually taking effect? Do I need to compile the changes to the .mkii file or something? I tried texexec.bat --make --all, but that didn't seem to change the outcome.
Can someone tell me if there's a compile command necessary for mkii?
texexec --make however, i strongly advise you to put such patches or tuning in your document style because otherwise you loose them when you update
\ifx\normalquitvmode\undefined \else \let\dontleavehmode\normalquitvmode \fi
I am running Xetex, FWIW. "This is XeTeX, Version 3.1415926-2.2-0.999.6 (Web2C 7.5.7)"
The above makes me think that "dontleavehmode" should prevent any 'hyphenation' except for the types explicitly allowed in lang-url.mkii via \sethyphenatedurlafter/before/normal.
just leave dontleavehmode untouched; it's definition adapts itself to the engine i leave it to others to react on the rest of your mail (some users have been tuning the mechanism too) Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl -----------------------------------------------------------------
On 11/21/2008 2:22 AM, Hans Hagen wrote:
Lars Huttar wrote:
On 11/19/2008 2:35 PM, Lars Huttar wrote:
2) Even though I have "\sethyphenatedurlafter /" instead of "\sethyphenatedurlbefore /", there are four cases where a URL is broken before a slash, e.g.: http://www.sil.org/.../009 /YAMBASSA.html. and no cases where a URL is broken after a slash (except when it's also before a slash -- see 1).
I wonder if my modifications are actually taking effect? Do I need to compile the changes to the .mkii file or something? I tried texexec.bat --make --all, but that didn't seem to change the outcome. Can someone tell me if there's a compile command necessary for mkii?
texexec --make
Thanks for your reply... OK, I did that. Behavior with respect to the two outstanding problems (breaking after hyphen and before slash) has not changed. But it's good to know that it's not due to some dependencies not being updated.
however, i strongly advise you to put such patches or tuning in your document style because otherwise you loose them when you update
Understood. A colleague tells me that if I put the \sethyphenatedurlbefore/after settings in the .tex document they will override the settings in lang-url.mkii, which is very good news. So if lang-url.mkii says \sethyphenatedurlbefore \letterbar I can comment that line out in lang-url.mkii; but if I don't want to modify lang-url.mkii, can I accomplish the same thing by putting \sethyphenatedurlnormal \letterbar in my .tex file?
\ifx\normalquitvmode\undefined \else \let\dontleavehmode\normalquitvmode \fi
I am running Xetex, FWIW. "This is XeTeX, Version 3.1415926-2.2-0.999.6 (Web2C 7.5.7)"
The above makes me think that "dontleavehmode" should prevent any 'hyphenation' except for the types explicitly allowed in lang-url.mkii via \sethyphenatedurlafter/before/normal.
just leave dontleavehmode untouched; it's definition adapts itself to the engine
i leave it to others to react on the rest of your mail (some users have been tuning the mechanism too)
I would be very glad to hear from said users who have had any success. I'm emailing Steffen and Aditya now. Actually, just now looking at lang-url.tex I see the comments %D For those who want to put full \URL's in a text, we offer %D %D \startbuffer %D \hyphenatedurl{http://optimist.optimist/optimist/optimist.optimist#optimist} %D \stopbuffer %D %D \typebuffer which makes me wonder if I need to put the \startbuffer,\stopbuffer,\typebuffer commands in my tex code. But I think maybe it's markup for generating documentation. If so, I wonder why I can't find such generated documentation on \hyphenatedurl. Lars
participants (3)
-
Arthur Reutenauer
-
Hans Hagen
-
Lars Huttar