encoding code issue

Yue Wang

19 Jun 2009 19 Jun '09

3:38 p.m.

hi Taco: - when I change something in fontforge/Unicode/*, and run build.sh --make, it will recompile many luatex stuffs. That is not necessary. please fix that if you can. - cjk.c can be removed completely. Chinese TTF/OTF Fonts are arranged in unicode order (maybe with other encoding charmap provided). So no need to do font re-encoding. (I think Japanese fonts and Korean fonts do too) - source/texk/web2c/luatexdir/luafontloader/fontforge/Unicode/backtrns.c is not needed, please remove that. - maybe it is not necessary to extend ctype to utype: in ConTeXt we have char-def.lua which gives very detailed information. After browsing the source code I think there is almost no dependency on unicode range for isxxxx and tolower/toupper. Most are for conversion of file names on local filesystem and so on. after removing utype.[ch], only two APIs ishexdigit and iscombinedchar are missing. but this two is very easy to implement. e.g. hexdigit = 0-9, a-f. - unialt.c is only needed for autohint.c. since hinting have nothing to do with typesetting, perhaps these two files can be gone too... After removing these files I ended up building a 3.9M luatex on Mac OS X. Maybe Linux binary can be even smaller. of course, the above thoughts have not been throughly tested. Yue Wang

Show replies by date

Yue Wang

19 Jun 19 Jun

4:09 p.m.

After copy the stripped down luatex binary into context tree and test a few documents, I am quite happy with my change. btw, taco, some text blocks are placed in the wrong places in the pdf using the unchanged version. known issue? On Fri, Jun 19, 2009 at 9:38 PM, Yue Wang wrote:

...

hi Taco:

- when I change something in fontforge/Unicode/*, and run build.sh --make, it will recompile many luatex stuffs. That is not necessary. please fix that if you can.

- cjk.c can be removed completely. Chinese TTF/OTF Fonts are arranged in unicode order (maybe with other encoding charmap provided). So no need to do font re-encoding. (I think Japanese fonts and Korean fonts do too)

- source/texk/web2c/luatexdir/luafontloader/fontforge/Unicode/backtrns.c is not needed, please remove that.

- maybe it is not necessary to extend ctype to utype: in ConTeXt we have char-def.lua which gives very detailed information. After browsing the source code I think there is almost no dependency on unicode range for isxxxx and tolower/toupper. Most are for conversion of file names on local filesystem and so on. after removing utype.[ch], only two APIs ishexdigit and iscombinedchar are missing. but this two is very easy to implement. e.g. hexdigit = 0-9, a-f.

- unialt.c is only needed for autohint.c. since hinting have nothing to do with typesetting, perhaps these two files can be gone too...

After removing these files I ended up building a 3.9M luatex on Mac OS X. Maybe Linux binary can be even smaller.

of course, the above thoughts have not been throughly tested.

Yue Wang

Taco Hoekwater

4:25 p.m.

Yue Wang wrote:

...

After copy the stripped down luatex binary into context tree and test a few documents, I am quite happy with my change.

Taco Hoekwater

5:16 p.m.

Hi Yue Wang, Yue Wang wrote:

...

hi Taco:

- when I change something in fontforge/Unicode/*, and run build.sh --make, it will recompile many luatex stuffs. That is not necessary. please fix that if you can.

Sorry, I can't fix that (at least not right now). The dependencies are auto-generated and luatex's C library as a whole depends on libff.a.

...

- cjk.c can be removed completely. Chinese TTF/OTF Fonts are arranged in unicode order (maybe with other encoding charmap provided). So no need to do font re-encoding. (I think Japanese fonts and Korean fonts do too)

I have not actually removed the source code (just in case there is a problem discovered later) but I have completely hidden it from the compiler so that it is no longer compiled in the binary. So, Yanrui Li (and maybe for you as well, just to verify I did not mess up anything): if you want to run your tests, you only have to grab the current trunk and recompile. Probably the most important thing to test is whether searching in Acroread still works as it should.

...

- source/texk/web2c/luatexdir/luafontloader/fontforge/Unicode/backtrns.c is not needed, please remove that.

Done. I also removed the dump.c file that is used to generate some of these support data files.

...

- maybe it is not necessary to extend ctype to utype: in ConTeXt we have char-def.lua which gives very detailed information.

I have decided to keep that code: at some time in the future I want to expose the fontforge Unicode library to the lua scripting language. The current unicode library (slunicode) is minimalistic, already outdated, and hard to keep up-to-date, so it makes sense to switch to the much cleaner version from Fontforge at some point (not too soon though, it has a rather low priority).

...

- unialt.c is only needed for autohint.c. since hinting have nothing to do with typesetting, perhaps these two files can be gone too...

Autohint.c (and tocff.c) is really needed: for some odd legacy fonts, I generate a CFF font on the fly. But unialt.c was is used only for the FindBlues() function, and for that, the test for unicode alternates was definately overkill, so unialt.c is gone now.

...

After removing these files I ended up building a 3.9M luatex on Mac OS X. Maybe Linux binary can be even smaller.

The size of my cross-compiled windows binary dropped by some 750K thanks to all this. I can't easily check linux binary sizes because I always compile with debugging symbols on and optimization off (except for releases).

...

of course, the above thoughts have not been throughly tested.

Best wishes, thanks for the digging up the information, Taco

Taco Hoekwater

5:19 p.m.

Taco Hoekwater wrote:

...

if you want to run your tests, you only have to grab the current trunk and recompile. Probably the most important thing to test is whether searching in Acroread still works as it should.

Just in case, for those of you that you use context: don't forget to empty the font cache first. All this stuff only applies to the initial font loading stage. Best wishes, Taco

Yue Wang

20 Jun 20 Jun

5:21 a.m.

Hi, Taco: With more extensive test on Chinese and Korean fonts (40 or so fonts) by Li Yanrui, Wang Longming, and me, we encountered no font loading/embedding problem. PDFs can still copy and paste correctly. So this is a quite good change. No idea on Japanese fonts (I don't speak Japanese). But KozMinPr6N-Regular.otf (The only Japanese fonts Li Yanrui have) works. I think Japanese users on dev-luatex list can tell more about this change. Yue Wang On Fri, Jun 19, 2009 at 11:19 PM, Taco Hoekwater wrote:

...

Taco Hoekwater wrote:

...
if you want to run your tests, you only have to grab the current trunk and recompile. Probably the most important thing to test is whether searching in Acroread still works as it should.

Just in case, for those of you that you use context: don't forget to empty the font cache first. All this stuff only applies to the initial font loading stage.

Best wishes, Taco

Yue Wang

19 Jun 19 Jun

6:20 p.m.

...

I have decided to keep that code: at some time in the future I want to expose the fontforge Unicode library to the lua scripting language. The current unicode library (slunicode) is minimalistic, already outdated, and hard to keep up-to-date, so it makes sense to switch to the much cleaner version from Fontforge at some point (not too soon though, it has a rather low priority).

OK. I understand. but can you put tolower into #ifdef too? tolower is only needed for macbinary.c for a filename related call. It is not needed to be in full unicode range.

Yue Wang

20 Jun 20 Jun

2:32 a.m.

Hi, Taco: On Sat, Jun 20, 2009 at 12:20 AM, Yue Wang wrote:

...

...
I have decided to keep that code: at some time in the future I want to expose the fontforge Unicode library to the lua scripting language. The current unicode library (slunicode) is minimalistic, already outdated, and hard to keep up-to-date, so it makes sense to switch to the much cleaner version from Fontforge at some point (not too soon though, it has a rather low priority).

OK. I understand. but can you put tolower into #ifdef too? tolower is only needed for macbinary.c for a filename related call. It is not needed to be in full unicode range.

here is the patch. Index: source/texk/web2c/luatexdir/luafontloader/fontforge/Unicode/utype.c =================================================================== --- source/texk/web2c/luatexdir/luafontloader/fontforge/Unicode/utype.c (revision 2540) +++ source/texk/web2c/luatexdir/luafontloader/fontforge/Unicode/utype.c (working copy) @@ -1,5 +1,6 @@ #include "utype.h" +#if 0 const unsigned short ____tolower[]= { 0, 0x0000, 0x0001, 0x0002, 0x0003, 0x0004, 0x0005, 0x0006, 0x0007, 0x0008, 0x0009, 0x000a, 0x000b, 0x000c, 0x000d, 0x000e, 0x000f, @@ -8195,7 +8196,6 @@ 0x0000, 0xfff9, 0xfffa, 0xfffb, 0xfffc, 0xfffd, 0x0000, 0x0000 }; -#if 0 const unsigned short ____toupper[] = { 0, 0x0000, 0x0001, 0x0002, 0x0003, 0x0004, 0x0005, 0x0006, 0x0007, 0x0008, 0x0009, 0x000a, 0x000b, 0x000c, 0x000d, 0x000e, 0x000f, Index: source/texk/web2c/luatexdir/luafontloader/fontforge/fontforge/macbinary.c =================================================================== --- source/texk/web2c/luatexdir/luafontloader/fontforge/fontforge/macbinary.c (revision 2540) +++ source/texk/web2c/luatexdir/luafontloader/fontforge/fontforge/macbinary.c (working copy) @@ -1155,7 +1155,7 @@ spt = strrchr(buffer,'/')+1; for ( pt=spt; *pt; ++pt ) if ( isupper( *pt )) - *pt = tolower( *pt ); + *pt = *pt - 'A' + 'a'; dpt = strchr(spt,'.'); if ( dpt==NULL ) dpt = spt+strlen(spt); if ( dpt-spt>8 || strlen(dpt)>4 ) { Index: source/texk/web2c/luatexdir/luafontloader/fontforge/inc/utype.h =================================================================== --- source/texk/web2c/luatexdir/luafontloader/fontforge/inc/utype.h (revision 2540) +++ source/texk/web2c/luatexdir/luafontloader/fontforge/inc/utype.h (working copy) @@ -47,14 +47,14 @@ #define ____TOUCHING 0x100000 #define ____COMBININGPOSMASK 0x1fff00 +#if 0 extern const unsigned short ____tolower[]; -#if 0 extern const unsigned short ____toupper[]; #endif extern const unsigned int ____utype[]; +#if 0 #define tolower(ch) (____tolower[(ch)+1]) -#if 0 #define toupper(ch) (____toupper[(ch)+1]) #endif #define islower(ch) (____utype[(ch)+1]&____L) (and personally I think ____utype can be gone too... such unicode library can be very easy to be implement in pure Lua way.) Yue Wang

Taco Hoekwater

9:15 a.m.

Yue Wang wrote:

...

...
...
OK. I understand. but can you put tolower into #ifdef too? tolower is only needed for macbinary.c for a filename related call.

It is also used by the strmatch() function collection in Unicode/char.c, which themselves are used in various places all over the source. I have applied the patch for now (after checking all actual usages of those functions to make sure they do not need unicode) but I hope you see why this gets problematic? I will have to revert it back at the first instance of actual unicode strings that need to compared. Best wishes, Taco

Yue Wang

9:49 a.m.

On Sat, Jun 20, 2009 at 3:15 PM, Taco Hoekwater wrote:

...

Yue Wang wrote:

...
...
...
OK. I understand. but can you put tolower into #ifdef too? tolower is only needed for macbinary.c for a filename related call.

It is also used by the strmatch() function collection in Unicode/char.c, which themselves are used in various places all over the source.

I have applied the patch for now (after checking all actual usages of those functions to make sure they do not need unicode) but I hope you see why this gets problematic? I will have to revert it back at the first instance of actual unicode strings that need to compared.

well, that's ok. 200K size can be ignored since now 500G hard drive is pretty cheap...

...

Best wishes, Taco

Khaled Hosny

10:38 a.m.

On Sat, Jun 20, 2009 at 09:15:10AM +0200, Taco Hoekwater wrote:

...

Yue Wang wrote:

...
...
...
OK. I understand. but can you put tolower into #ifdef too? tolower is only needed for macbinary.c for a filename related call.

It is also used by the strmatch() function collection in Unicode/char.c, which themselves are used in various places all over the source.

I have applied the patch for now (after checking all actual usages of those functions to make sure they do not need unicode) but I hope you see why this gets problematic? I will have to revert it back at the first instance of actual unicode strings that need to compared.

SVN revision 2541 doesn't even work for me, I get a FontForge error whenever I run luatex, even with no files at all: FontForge does not support your encoding (UTF-8), it will pretend the local encoding is latin1 Internal Error: I can't figure out your version of iconv(). I need a name for the UCS-4 encoding and I can't find one. Reconfigure --without-iconv. Bye. Regards, Khaled -- Khaled Hosny Arabic localiser and member of Arabeyes.org team Free font developer

Yue Wang

10:56 a.m.

that's because Taco's version of string match is buggy. On Sat, Jun 20, 2009 at 4:38 PM, Khaled Hosny wrote:

...

On Sat, Jun 20, 2009 at 09:15:10AM +0200, Taco Hoekwater wrote:

...
Yue Wang wrote:

...
...
...
OK. I understand. but can you put tolower into #ifdef too? tolower is only needed for macbinary.c for a filename related call.

It is also used by the strmatch() function collection in Unicode/char.c, which themselves are used in various places all over the source.

I have applied the patch for now (after checking all actual usages of those functions to make sure they do not need unicode) but I hope you see why this gets problematic? I will have to revert it back at the first instance of actual unicode strings that need to compared.

SVN revision 2541 doesn't even work for me, I get a FontForge error whenever I run luatex, even with no files at all:

FontForge does not support your encoding (UTF-8), it will pretend the local encoding is latin1 Internal Error: I can't figure out your version of iconv(). I need a name for the UCS-4 encoding and I can't find one. Reconfigure --without-iconv. Bye.

Regards, Khaled

-- Khaled Hosny Arabic localiser and member of Arabeyes.org team Free font developer

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAko8oB4ACgkQRoqITGOuyPJ63wCeKogYEqRwy04Vl0dOCSpmXWBB Re8AoInlrCUQjBMEDVvZaBhoatasoCZe =+3x1 -----END PGP SIGNATURE-----

Yue Wang

11:04 a.m.

no, I find out that my version of tolower is buggy... lower case can not be lower case again. for macbinary.c it's ok, but not for strmatch. On Sat, Jun 20, 2009 at 4:56 PM, Yue Wang wrote:

...

that's because Taco's version of string match is buggy.

On Sat, Jun 20, 2009 at 4:38 PM, Khaled Hosny wrote:

...
On Sat, Jun 20, 2009 at 09:15:10AM +0200, Taco Hoekwater wrote:

...
Yue Wang wrote:

...
...
...
OK. I understand. but can you put tolower into #ifdef too? tolower is only needed for macbinary.c for a filename related call.

It is also used by the strmatch() function collection in Unicode/char.c, which themselves are used in various places all over the source.

I have applied the patch for now (after checking all actual usages of those functions to make sure they do not need unicode) but I hope you see why this gets problematic? I will have to revert it back at the first instance of actual unicode strings that need to compared.

SVN revision 2541 doesn't even work for me, I get a FontForge error whenever I run luatex, even with no files at all:

FontForge does not support your encoding (UTF-8), it will pretend the local encoding is latin1 Internal Error: I can't figure out your version of iconv(). I need a name for the UCS-4 encoding and I can't find one. Reconfigure --without-iconv. Bye.

Regards, Khaled

-- Khaled Hosny Arabic localiser and member of Arabeyes.org team Free font developer

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAko8oB4ACgkQRoqITGOuyPJ63wCeKogYEqRwy04Vl0dOCSpmXWBB Re8AoInlrCUQjBMEDVvZaBhoatasoCZe =+3x1 -----END PGP SIGNATURE-----

Taco Hoekwater

11:14 a.m.

Yue Wang wrote:

...

no, I find out that my version of tolower is buggy... lower case can not be lower case again. for macbinary.c it's ok, but not for strmatch.

On Sat, Jun 20, 2009 at 4:56 PM, Yue Wang wrote:

...
that's because Taco's version of string match is buggy.

On Sat, Jun 20, 2009 at 4:38 PM, Khaled Hosny wrote:

...
SVN revision 2541 doesn't even work for me, I get a FontForge error whenever I run luatex, even with no files at all:

#2542 simply reverts #2451. Best wishes, Taco

Yue Wang

11:47 a.m.

Hi, Taco and Khaled

...

#2542 simply reverts #2451.

first of all, you left two typos there, so it will simply not work: what I said is *pt = *pt - 'A' + 'a'; // to lower however, you write it like that: #define tolower(ch) (ch+'A'-'a') it is a "toupper" statement. moreover, the statement I left in macbinary.c is safe since there is a "isupper" to do the test. however, it's totally wrong for you to put the same stuff to utype.h. utype.h should check whether it is in the [A-Z] range or not. So this is not because my patch sucks, but because you wrote the wrong statement... Here is the patch for 2451: Index: source/texk/web2c/luatexdir/luafontloader/fontforge/inc/utype.h =================================================================== --- source/texk/web2c/luatexdir/luafontloader/fontforge/inc/utype.h (revision 2541) +++ source/texk/web2c/luatexdir/luafontloader/fontforge/inc/utype.h (working copy) @@ -58,7 +58,7 @@ #define toupper(ch) (____toupper[(ch)+1]) #else /* ASCII style */ -#define tolower(ch) (ch+'A'-'a') +#define tolower(ch) ((ch >= 'A' && ch <= 'Z') ? ch + 32: ch) #endif #define islower(ch) (____utype[(ch)+1]&____L) #define isupper(ch) (____utype[(ch)+1]&____U) can do all the trick.

...

Best wishes, Taco

Yue Wang

Taco Hoekwater

noon

Yue Wang wrote:

...

Hi, Taco and Khaled

...
#2542 simply reverts #2451.

first of all, you left two typos there, so it will simply not work:

Yes, I know I messed up your patch and that it was not your fault, but nevertheless I changed my mind and now prefer to keep using the unicode version of tolower(). The reason is this: any imports I make from newer versions of fontforge (and that is likely) will assume the unicode version of tolower to be present. Using a different definition could result in bugs that are very hard to find. Best wishes, Taco

5866

Age (days ago)

5867

Last active (days ago)

List overview

Download

15 comments

3 participants

participants (3)

Khaled Hosny
Taco Hoekwater
Yue Wang

encoding code issue

tags

participants (3)