Arabic-utf-8 (plus a sample)

Idris Samawi Hamid

5 Jun 2004 5 Jun '04

9:32 p.m.

Hi gang, For Arabic we use a Latin transcription in Aleph/(e-)Omega (or even ArabTeX) unless one of the encoding filters like utf-8 is used. Even for utf-8 files, however, it would be very useful to be able to convert a utf-8 file to Latin transcription for further processing by Aleph/(e-)Omega. For example, adding diacritics is much easier to do in Latin than in an Arabic script editor because Latin transcription is one-dimensional and adding diacritics to Arabic is a 2-dimen affair. The best thing would be a perl script but I don't know perl at all (except to run some some precreated scripts). If someone out of the kindness of their heart could write a short and simple script for just seven characters I could do the rest myself and present it back here. Now all of the Arabic charachters in utf-8 can be represented by extended ascii. I need something like this, that converts every extended ascii representation of Arabic utf-8 into a Latin transcription: Ø§ => A Ø¨ => b Ø¬ => j Ø¯ => d Ù‡ => h Ùˆ => w Ø² => z If someone could write a perl script that can accomplish the above conversion, I can manually fill in the rest of the script. Basically I use a modified version of the ArabTeX transcription. Here is a "gift" in return: a sample utf-8 Arabic file that can be processed by Aleph/(e-)Omega in ConTeXt (you will probably need to dvips this, though some dvi-viewers can do the postscript/16-bit thing): ============================================== \hoffset=0pt % for Omega bug: has this been fixed? \def\ArabicUTF{\ocp\UTFArUni=inutf8 %% in88596 %\ocp\UTFArUni=in88596 \ocp\UniCUni=uni2cuni \ocp\CUniArab=cuni2oar \ocplist\UTFArOCP= \addbeforeocplist 1 \UTFArUni \addbeforeocplist 1 \UniCUni \addbeforeocplist 1 \CUniArab \nullocplist \pushocplist\UTFArOCP} \input m-gamma.tex \input type-omg.tex \switchtobodyfont[omarb,12pt] % \textdir TRT% \pardir TRT% \ArabicUTF \starttext ØŒ Ø› ØŸ Ø¡ Ø¢ Ø£ Ø¤ Ø¥ Ø¦ Ø§ Ø¨ Ø© Øª Ø« Ø¬ Ø Ø® Ø¯ Ø° Ø± Ø² Ø³ Ø´ Øµ Ø¶ Ø· Ø¸ Ø¹ Øº Ù€ Ù Ù‚ Ùƒ Ù„ Ù… Ù† Ù‡ Ùˆ Ù‰ ÙŠ \blank[big] %Ù‹ ÙŒ Ù ÙŽ Ù Ù Ù‘ Ù’ Ù Ù¡ Ù¢ Ù£ Ù¤ Ù¥ Ù¦ Ù§ Ù¨ Ù© Ùª Ù« Ù¬ Ù° Ù± Ù² Ù³ Ù´ Ùµ Ù¶ Ù· Ù¸ Ù¹ Ùº Ù» Ù¼ Ù½ Ù¾ Ù¿ Ú€ Ú Ú‚ Úƒ Ú„ Ú… Ú† Ú‡ Úˆ Ú‰ ÚŠ Ú‹ ÚŒ Ú ÚŽ Ú Ú Ú‘ Ú’ Ú“ Ú” Ú• Ú– Ú— Ú˜ Ú™ Úš Ú› Úœ Ú Úž ÚŸ Ú¢ Ú¡ Ú¢ Ú£ Ú¤ Ú¥ Ú¦ Ú§ Ú¨ Ú© Úª Ú« Ú¬ Ú Ú® Ú¯ Ú° Ú± Ú² Ú³ Ú´ Úµ Ú¶ Ú· Úº Ú» Ú¼ Ú¾ Û€ Û Ûƒ Û„ Û… Û† Û‡ Ûˆ Û‰ ÛŠ Û‹ ÛŒ Û Û‘ Û’ Û“ Û” Û• Û° Û± Û² Û³ Û´ Ûµ Û¶ Û· Û¸ Û¹ \blank[big] Ù€Ù‹ Ù€ÙŒ Ù€Ù Ù€ÙŽ Ù€Ù Ù€Ù Ù€Ù‘ Ù€Ù’ Ù€Ù° Ø§ Ø¨ Ø¬ Ø¯ Ù‡ Ùˆ Ø² \stoptext ============================================== Best Idris -- Professor Idris Samawi Hamid Department of Philosophy Colorado State University Fort Collins, CO 80523

Show replies by date

Thomas A. Schmitz

5 Jun 5 Jun

10:41 p.m.

Idris, I know a bit of perl and would love to help. However, I fear that sending us your stuff via mail will be a bit difficult because the utf-8 chracters get transformed into gibberish. Could you send the hexadecimal code of the characters you want to convert? Or I could simply give you the syntax, you'll know what to do. So here comes a perl script that works for my greek stuff; I don't see why it shouldn't work with Arabic: ==================================cut here #!/usr/bin/perl -w use strict; use open ':utf8'; open(NEW,">new.tex"); #opens file to print out the result while (<>); { #this opens the file for reading $_ =~ s/\x{HEXADECIMAL_VALUE_OF_CHARACTER}/\x{HEXADECIMAL_VALUE_OF_NEW_CHARACTER}/esg; #this is the actual conversion print NEW "$_"; #and this writes the result into file "new.tex" } close(NEW); ==================================and here Make the script executable and call it with the name of a file as an argument. HTH Thomas On Sat, 2004-06-05 at 21:32, Idris Samawi Hamid wrote:

...

Hi gang,

For Arabic we use a Latin transcription in Aleph/(e-)Omega (or even ArabTeX) unless one of the encoding filters like utf-8 is used. Even for utf-8 files, however, it would be very useful to be able to convert a utf-8 file to Latin transcription for further processing by Aleph/(e-)Omega. For example, adding diacritics is much easier to do in Latin than in an Arabic script editor because Latin transcription is one-dimensional and adding diacritics to Arabic is a 2-dimen affair.

The best thing would be a perl script but I don't know perl at all (except to run some some precreated scripts). If someone out of the kindness of their heart could write a short and simple script for just seven characters I could do the rest myself and present it back here.

Now all of the Arabic charachters in utf-8 can be represented by extended ascii. I need something like this, that converts every extended ascii representation of Arabic utf-8 into a Latin transcription:

Ø§ => A

Ø¨ => b

Ø¬ => j

Ø¯ => d

Ù‡ => h

Ùˆ => w

Ø² => z

If someone could write a perl script that can accomplish the above conversion, I can manually fill in the rest of the script. Basically I use a modified version of the ArabTeX transcription.

Here is a "gift" in return: a sample utf-8 Arabic file that can be processed by Aleph/(e-)Omega in ConTeXt (you will probably need to dvips this, though some dvi-viewers can do the postscript/16-bit thing):

============================================== \hoffset=0pt % for Omega bug: has this been fixed?

\def\ArabicUTF{\ocp\UTFArUni=inutf8 %% in88596 %\ocp\UTFArUni=in88596 \ocp\UniCUni=uni2cuni \ocp\CUniArab=cuni2oar \ocplist\UTFArOCP= \addbeforeocplist 1 \UTFArUni \addbeforeocplist 1 \UniCUni \addbeforeocplist 1 \CUniArab \nullocplist \pushocplist\UTFArOCP}

\input m-gamma.tex \input type-omg.tex \switchtobodyfont[omarb,12pt] %

\textdir TRT% \pardir TRT% \ArabicUTF

\starttext

ØŒ Ø› ØŸ Ø¡ Ø¢ Ø£ Ø¤ Ø¥ Ø¦ Ø§ Ø¨ Ø© Øª Ø« Ø¬ Ø Ø® Ø¯ Ø° Ø± Ø² Ø³ Ø´ Øµ Ø¶ Ø· Ø¸ Ø¹ Øº Ù€ Ù Ù‚ Ùƒ Ù„ Ù… Ù† Ù‡ Ùˆ Ù‰ ÙŠ

\blank[big]

%Ù‹ ÙŒ Ù ÙŽ Ù Ù Ù‘

Ù’ Ù Ù¡ Ù¢ Ù£ Ù¤ Ù¥ Ù¦ Ù§ Ù¨ Ù© Ùª Ù« Ù¬ Ù° Ù± Ù² Ù³ Ù´ Ùµ Ù¶ Ù· Ù¸ Ù¹ Ùº Ù» Ù¼ Ù½ Ù¾ Ù¿ Ú€ Ú Ú‚ Úƒ Ú„ Ú… Ú† Ú‡ Úˆ Ú‰ ÚŠ Ú‹ ÚŒ Ú ÚŽ Ú Ú Ú‘ Ú’ Ú“ Ú” Ú• Ú– Ú— Ú˜ Ú™ Úš Ú› Úœ Ú Úž ÚŸ Ú¢ Ú¡ Ú¢ Ú£ Ú¤ Ú¥ Ú¦ Ú§ Ú¨ Ú© Úª Ú« Ú¬ Ú Ú® Ú¯ Ú° Ú± Ú² Ú³ Ú´ Úµ Ú¶ Ú· Úº Ú» Ú¼ Ú¾ Û€ Û Ûƒ Û„ Û… Û† Û‡ Ûˆ Û‰ ÛŠ Û‹ ÛŒ Û Û‘ Û’ Û“ Û” Û• Û° Û± Û² Û³ Û´ Ûµ Û¶ Û· Û¸ Û¹

\blank[big]

Ù€Ù‹ Ù€ÙŒ Ù€Ù Ù€ÙŽ Ù€Ù Ù€Ù Ù€Ù‘ Ù€Ù’ Ù€Ù°

Ø§ Ø¨ Ø¬ Ø¯ Ù‡ Ùˆ Ø²

\stoptext

==============================================

Best Idris

Idris Samawi Hamid

11:33 p.m.

On Sat, 05 Jun 2004 22:41:39 +0200, Thomas A. Schmitz wrote:

...

Idris,

I know a bit of perl and would love to help. However, I fear that sending us your stuff via mail will be a bit difficult because the utf-8 chracters get transformed into gibberish.

Thnx 4 such a speedy reply! I don't think you are getting gibberish though; you should be getting the extended ascii representation. So the letter alif (hex 0627) should look like this: Ø§ Do you get a forward-slashed circle and a section symbol? If so, that's the ascii representation I'm trying to convert to the letter `A'. Here are the codes you want: Ø§ [0627] => A Ø¨ [0628] => b Ø¬ [062C] => j Ø¯ [062F] => d Ù‡ [0647] => h Ùˆ [0648] => w Ø² [0632] => z Let me explain my situation more clearly:-) I have a unicode editor, Unitype Global Writer. I save a unicode document as a utf *.txt file. When I open that saved file in my TeX editor (WinEdt), it comes out as extended ascii (that's the "gibberish"). So what I wanted to do was convert the ascii "gibberish" to my Latin transcription. It seems that what you are suggesting is to use the hex representation and convert the unicode txt file into a Latin transcription file directly and bypass the gibberish. On your perl file, can you give me an example of how to use it? I tried (in windows, with name utf2tex.pl and unicode text in unicode-utf.txt) and get =========================

...

perl utf2tex.pl unicode-utf.txt Unknown discipline class ':utf8' at C:/Perl/lib/open.pm line 18. BEGIN failed--compilation aborted at utf2tex.pl line 4. =========================

from your script I tried, e.g. ============================ $_ =~ s/\x{0627}/\x{0041}/esg; # from alif to `A' ============================ Your guidance will be greatly appreciated! Thnx a million! Idris -- Professor Idris Samawi Hamid Department of Philosophy Colorado State University Fort Collins, CO 80523

Thomas A. Schmitz

11:48 p.m.

Just a quick reply (it's bedtime over here): there may be 2 problems. 1 is that the mail program put in an unwanted linebreak after the =~ part, just remove it; it should all be one line. And then: you'll need a fairly recent version of perl for it to work, what do you get when you do perl --version I guess for utf to work, it should be at least 5.8.0. Your basic idea of the usage is right (I'm not a windows person, but I assume it should be the same): save the scipt as utf2tex.pl, make it executable and call it as utf2tex.pl FILENAME.txt. I guess it would be easiest to convert the utf to ascii directly - that would mean you could later convert it back. I have a set of scripts that do just that -- convert babel Greek into utf-8 and back. If you need more help, I'll look into it tomorrow! Best Thomas On Sat, 2004-06-05 at 23:33, Idris Samawi Hamid wrote:

...

On Sat, 05 Jun 2004 22:41:39 +0200, Thomas A. Schmitz wrote:

...
Idris,

I know a bit of perl and would love to help. However, I fear that sending us your stuff via mail will be a bit difficult because the utf-8 chracters get transformed into gibberish.

Thnx 4 such a speedy reply! I don't think you are getting gibberish though; you should be getting the extended ascii representation. So the letter alif (hex 0627) should look like this:

Ø§

Do you get a forward-slashed circle and a section symbol? If so, that's the ascii representation I'm trying to convert to the letter `A'.

Here are the codes you want:

Ø§ [0627] => A

Ø¨ [0628] => b

Ø¬ [062C] => j

Ø¯ [062F] => d

Ù‡ [0647] => h

Ùˆ [0648] => w

Ø² [0632] => z

Let me explain my situation more clearly:-)

I have a unicode editor, Unitype Global Writer. I save a unicode document as a utf *.txt file. When I open that saved file in my TeX editor (WinEdt), it comes out as extended ascii (that's the "gibberish"). So what I wanted to do was convert the ascii "gibberish" to my Latin transcription. It seems that what you are suggesting is to use the hex representation and convert the unicode txt file into a Latin transcription file directly and bypass the gibberish.

On your perl file, can you give me an example of how to use it? I tried (in windows, with name utf2tex.pl and unicode text in unicode-utf.txt) and get

=========================

...
perl utf2tex.pl unicode-utf.txt Unknown discipline class ':utf8' at C:/Perl/lib/open.pm line 18. BEGIN failed--compilation aborted at utf2tex.pl line 4. =========================

from your script I tried, e.g.

============================ $_ =~ s/\x{0627}/\x{0041}/esg; # from alif to `A' ============================

Your guidance will be greatly appreciated!

Thnx a million! Idris

Idris Samawi Hamid

6 Jun 6 Jun

12:51 a.m.

On Sat, 05 Jun 2004 23:48:18 +0200, Thomas A. Schmitz wrote:

...

Just a quick reply (it's bedtime over here): there may be 2 problems.

Ok, get some sleep;-) Anyhow, I fixed the line break (is the space between tilda and `s' correct?) ============================== $_ =~ s/\x{0627}/\x{0041}/esg; #this is the actual conversion ============================== did not work though:-( My perl version is v5.6.1; I went to the ActivePerl website and the only version they had is v5.6.1.638; so from perl.org I found Indigoperl and switched;-) This solves part of the problem:-) Now I get

...

perl utf2tex.pl unicode-utf.txt syntax error at utf2tex.pl line 8, near ");" Execution of utf2tex.pl aborted due to compilation errors.

line 8 is while (<>); { #this opens the file for reading Here is the whole file once again: ================================== #!/usr/bin/perl -w use strict; use open ':utf8'; open(NEW,">new.tex"); #opens file to print out the result while (<>); { #this opens the file for reading $_ =~ s/\x{0627}/\x{0041}/esg; #this is the actual conversion print NEW "$_"; #and this writes the result into file "new.tex" } close(NEW); ==================================

...

the usage is right (I'm not a windows person,

If WinEdt (and some other things) worked under WINE, I would not be a windows person either:-( And will attempt yet another switch (I've lost count) to Linux-KDE sometime this Summer... Thnx a million Idris -- Professor Idris Samawi Hamid Department of Philosophy Colorado State University Fort Collins, CO 80523

Giuseppe Bilotta

1:15 a.m.

New subject: Re[2]: Arabic-utf-8 (plus a sample)

Sunday, June 6, 2004 Idris Samawi Hamid wrote:

...

Here is the whole file once again:

...

================================== #!/usr/bin/perl -w

...

use strict; use open ':utf8';

...

open(NEW,">new.tex"); #opens file to print out the result

...

while (<>); { #this opens the file for reading

...

$_ =~ s/\x{0627}/\x{0041}/esg; #this is the actual conversion

...

print NEW "$_"; #and this writes the result into file "new.tex" }

...

close(NEW); ==================================

My take: try the following (should work even with ActiveState 5.6) === #!/usr/bin/perl use strict; #D comment the following, I think we can do without # use open ':utf8'; open(NEW,">new.tex"); #opens file to print out the result while (<>); { #this opens the file for reading $_ =~ s/\x06\x27/A/esg; #this is the actual conversion print NEW "$_"; #and this writes the result into file "new.tex" } close(NEW); === Save as e.g. idris_conv.pl and issue as perl idris_conv.pl < filename.txt where filename.txt is the filename to convert. -- Giuseppe "Oblomov" Bilotta

Idris Samawi Hamid

1:31 a.m.

New subject: Re[2]: Arabic-utf-8 (plus a sample)

On Sun, 6 Jun 2004 01:15:56 +0200, Giuseppe Bilotta wrote:

...

My take: try the following (should work even with ActiveState 5.6)

=== #!/usr/bin/perl

use strict; #D comment the following, I think we can do without # use open ':utf8';

open(NEW,">new.tex"); #opens file to print out the result

while (<>); { #this opens the file for reading

$_ =~ s/\x06\x27/A/esg; #this is the actual conversion

print NEW "$_"; #and this writes the result into file "new.tex" }

close(NEW); ===

Hi Giuseppe (Is it not way past your bedtime;->), Here's my result:

...

perl utf2tex2.pl < unicode-utf.txt syntax error at utf2tex2.pl line 9, near ");" Bareword "A" not allowed while "strict subs" in use at utf2tex2.pl line 11. Execution of utf2tex2.pl aborted due to compilation errors.

please advise;-> best Idris -- Professor Idris Samawi Hamid Department of Philosophy Colorado State University Fort Collins, CO 80523

Giuseppe Bilotta

1:58 a.m.

New subject: Re[4]: Arabic-utf-8 (plus a sample)

Sunday, June 6, 2004 Idris Samawi Hamid wrote:

...

Hi Giuseppe (Is it not way past your bedtime;->),

Yes it is, and it shows. But since I'm up and not having any particular urge to go to bed in this very moment, here's a tested alternative that works here: == #!/usr/bin/perl use strict; use warnings; open(NEW,">new.tex"); #opens file to print out the result while (<>) { #this opens the file for reading $_ =~ s/\xD8\xA7/A/g; #this is the actual conversion $_ =~ s/\xD8\xA8/b/g; #this is the actual conversion $_ =~ s/\xD8\xAC/j/g; #this is the actual conversion $_ =~ s/\xD8\xAF/d/g; #this is the actual conversion $_ =~ s/\xD9\x87/h/g; #this is the actual conversion $_ =~ s/\xD9\x88/w/g; #this is the actual conversion $_ =~ s/\xD8\xB2/z/g; #this is the actual conversion print NEW "$_"; #and this writes the result into file "new.tex" } close(NEW); === to be used as utf2tex filename If you want to add more conversions, open your unicode file in an hex editor and check the actual byte-per-byte hex value of the utf text for the other characters you want to add. This should be enough for your needs. -- Giuseppe "Oblomov" Bilotta

Idris Samawi Hamid

2:19 a.m.

New subject: Re[4]: Arabic-utf-8 (plus a sample)

On Sun, 6 Jun 2004 01:58:44 +0200, Giuseppe Bilotta wrote:

...

== #!/usr/bin/perl

use strict; use warnings;

open(NEW,">new.tex"); #opens file to print out the result

while (<>) { #this opens the file for reading

$_ =~ s/\xD8\xA7/A/g; #this is the actual conversion $_ =~ s/\xD8\xA8/b/g; #this is the actual conversion $_ =~ s/\xD8\xAC/j/g; #this is the actual conversion $_ =~ s/\xD8\xAF/d/g; #this is the actual conversion $_ =~ s/\xD9\x87/h/g; #this is the actual conversion $_ =~ s/\xD9\x88/w/g; #this is the actual conversion $_ =~ s/\xD8\xB2/z/g; #this is the actual conversion

print NEW "$_"; #and this writes the result into file "new.tex" }

close(NEW); ===

It works! I'll try to finish a basic script that works for Lagally's ArabTeX transcription (that I use) and post it here and on the aleph list. One question: The hex for e.g. alif is 0627; how did you get D8A7 from that for purposes of the script (so I can follow along for the rest)? Best Idris -- Professor Idris Samawi Hamid Department of Philosophy Colorado State University Fort Collins, CO 80523

Idris Samawi Hamid

2:26 a.m.

New subject: Re[4]: Arabic-utf-8 (plus a sample)

On Sat, 05 Jun 2004 18:19:22 -0600, Idris Samawi Hamid wrote:

...

One question: The hex for e.g. alif is 0627; how did you get D8A7 from that for purposes of the script (so I can follow along for the rest)?

Ok, I found it:

...

...
If you want to add more conversions, open your unicode file in an hex editor and check the actual byte-per-byte hex value of the utf text for the other characters you want to add. This should be enough for your needs.

I just downloaded XVI32. Hmm... never heard of or needed a hex editor before now... Best Idris -- Professor Idris Samawi Hamid Department of Philosophy Colorado State University Fort Collins, CO 80523

Henning Hraban Ramm

11:09 a.m.

New subject: Perl scripting (was: Arabic-utf-8)

Am Sonntag, 06.06.04, um 02:19 Uhr (Europe/Zurich) schrieb Idris Samawi Hamid:

...

...
open(NEW,">new.tex"); #opens file to print out the result

better: open NEW, ">", "new.tex" || die $!;

...

...
$_ =~ s/\xD8\xA7/A/g; #this is the actual conversion

if you work with $_ you can leave it out, simply: s/\xD8\xA7/A/g; But for a series of conversions I'd suggest an hash for better overview. Whole script like this: ----- #!/usr/bin/perl -w use strict; use warnings; my ($Source, $Target) = (shift, shift); # gets 2 file names from command line my %conv = ( # enhance as needed "\xD8xA7" => "A", "\xD8xA8" => "b", "\xD8xAC" => "j", "\xD8xAF" => "d" ); open SOURCE, "<", $Source || die $!; open TARGET, ">", $Target || die $!; # there are ways to read a whole file in one scalar, # e.g. with File::Slurp, but I don't know them by heart... while (my $line = <SOURCE>) { foreach my $key (keys %conv) { $line =~ s/$key/$conv{$key}/g; } # foreach print TARGET $line; } # while close SOURCE; close TARGET; ----- BTW: ActiveState has Perl 5.8.4, at least for Windows (I use it at work). Grüßlis vom Hraban! -- http://www.fiee.net/texnique/

Idris Samawi Hamid

11:03 p.m.

New subject: Perl scripting (was: Arabic-utf-8)

On Sun, 6 Jun 2004 11:09:32 +0200, Henning Hraban Ramm wrote:

...

-----

#!/usr/bin/perl -w use strict; use warnings;

my ($Source, $Target) = (shift, shift); # gets 2 file names from command line

my %conv = ( # enhance as needed "\xD8xA7" => "A", "\xD8xA8" => "b", "\xD8xAC" => "j", "\xD8xAF" => "d" );

open SOURCE, "<", $Source || die $!; open TARGET, ">", $Target || die $!; # there are ways to read a whole file in one scalar, # e.g. with File::Slurp, but I don't know them by heart... while (my $line = <SOURCE>) { foreach my $key (keys %conv) { $line =~ s/$key/$conv{$key}/g; } # foreach print TARGET $line; } # while close SOURCE; close TARGET;

-----

Thnx; I'll play around with this as well. BTW: is there any way to do this without the hex editor and just enter the full 4-digit character (a la Thomas's original suggestion) e.g., "\x0627" => "A" While the hex editor certainly works it is really slow and tedious work...

...

BTW: ActiveState has Perl 5.8.4, at least for Windows (I use it at work).

Ok, I found it: http://downloads.activestate.com/ActivePerl/Windows/5.8/ActivePerl-5.8.3.809... But the web site (at first glance) sure gives one the impression that their latest release is 5.6.1.638 http://www.activestate.com/ http://www.activestate.com/Products/ActivePerl/ Best Idris -- Professor Idris Samawi Hamid Department of Philosophy Colorado State University Fort Collins, CO 80523

Thomas A. Schmitz

11:28 p.m.

New subject: Perl scripting (was: Arabic-utf-8)

Well, if you put the use open ':utf8'; in the header of your perl script, it should work without the hex editor (btw: I would recommend using emacs in hex mode (M-x hexl-find-file). And just for the record: to put the entire file in one array, use this: my @lines = <>; my $text = join "", @lines; $text =~ s/PUT_YOUR/SUBSTITUIONS_HERE/esg; But it looks like you got a working solution now, so have fun playing around with it. And boy does it make one feel good when you realize that you windoze people are still working with perl 5.6 -- that's the stone age, man ;-) Best Thomas On Sun, 2004-06-06 at 23:03, Idris Samawi Hamid wrote:

...

On Sun, 6 Jun 2004 11:09:32 +0200, Henning Hraban Ramm wrote:

...
-----

#!/usr/bin/perl -w use strict; use warnings;

my ($Source, $Target) = (shift, shift); # gets 2 file names from command line

my %conv = ( # enhance as needed "\xD8xA7" => "A", "\xD8xA8" => "b", "\xD8xAC" => "j", "\xD8xAF" => "d" );

open SOURCE, "<", $Source || die $!; open TARGET, ">", $Target || die $!; # there are ways to read a whole file in one scalar, # e.g. with File::Slurp, but I don't know them by heart... while (my $line = <SOURCE>) { foreach my $key (keys %conv) { $line =~ s/$key/$conv{$key}/g; } # foreach print TARGET $line; } # while close SOURCE; close TARGET;

-----

Thnx; I'll play around with this as well. BTW: is there any way to do this without the hex editor and just enter the full 4-digit character (a la Thomas's original suggestion) e.g.,

"\x0627" => "A"

While the hex editor certainly works it is really slow and tedious work...

...
BTW: ActiveState has Perl 5.8.4, at least for Windows (I use it at work).

Ok, I found it:

http://downloads.activestate.com/ActivePerl/Windows/5.8/ActivePerl-5.8.3.809...

But the web site (at first glance) sure gives one the impression that their latest release is 5.6.1.638

http://www.activestate.com/

http://www.activestate.com/Products/ActivePerl/

Best Idris

Henning Hraban Ramm

7 Jun 7 Jun

9:45 p.m.

New subject: Perl scripting (was: Arabic-utf-8)

Am Sonntag, 06.06.04, um 23:28 Uhr (Europe/Zurich) schrieb Thomas A. Schmitz:

...

Well, if you put the use open ':utf8'; in the header of your perl script, it should work without the hex editor

Not needed with Perl 5.8.x and a proper UTF8 file.

...

And just for the record: to put the entire file in one array, use this: my @lines = <>; my $text = join "", @lines;

$text =~ s/PUT_YOUR/SUBSTITUIONS_HERE/esg;

Thank you, I always forget the really simple solutions. ;-) And with File::Slurp you get it directly into a scalar.

...

But it looks like you got a working solution now, so have fun playing around with it. And boy does it make one feel good when you realize that you windoze people are still working with perl 5.6 -- that's the stone age, man ;-)

MacOS X has also only 5.6 if you don't install a newer one yourself, and with newer than 5.8.0 you get endless trouble... Grüßlis vom Hraban! -- http://www.fiee.net/texnique/

Thomas A.Schmitz

10:53 p.m.

New subject: Perl scripting (was: Arabic-utf-8)

Hey Hraban, not to be a PITA, but with OS X 10.3, we moved up to perl 5.8.1. And in my gentoo installation, I'm now at perl 5.8.4, yessirre! ;-) Best Thomas On Jun 7, 2004, at 9:45 PM, Henning Hraban Ramm wrote:

...

MacOS X has also only 5.6 if you don't install a newer one yourself, and with newer than 5.8.0 you get endless trouble...

Richard MAHONEY

6 Jun 6 Jun

1:08 a.m.

New subject: [SPAM: 3.411] Arabic-utf-8 (plus a sample)

On Sat, Jun 05, 2004 at 01:32:35PM -0600, Idris Samawi Hamid wrote:

...

Hi gang,

For Arabic we use a Latin transcription in Aleph/(e-)Omega (or even ArabTeX) unless one of the encoding filters like utf-8 is used. Even for utf-8 files, however, it would be very useful to be able to convert a utf-8 file to Latin transcription for further processing by Aleph/(e-)Omega. For example, adding diacritics is much easier to do in Latin than in an Arabic script editor because Latin transcription is one-dimensional and adding diacritics to Arabic is a 2-dimen affair.

The best thing would be a perl script but I don't know perl at all (except to run some some precreated scripts). If someone out of the kindness of their heart could write a short and simple script for just seven characters I could do the rest myself and present it back here.

You might like to look at some of the encoding conversion scripts at: http://homepages.comnet.co.nz/~r-mahoney/scripts/scripts.html N.B. For sorting utf-8 Arabic you might find the perl module `Sort::ArbBiLex' useful Best regards, Richard Mahoney -- Richard MAHONEY | internet: homepages.comnet.net.nz/~r-mahoney Littledene | telephone / telefax (man.): ++64 3 312 1699 Bay Road | cellular: ++64 25 829 986 OXFORD, NZ | e-mail: r.mahoney[use"@"]comnet.net.nz

Idris Samawi Hamid

2:19 a.m.

On Sun, 6 Jun 2004 11:08:46 +1200, Richard MAHONEY wrote:

...

You might like to look at some of the encoding conversion scripts at:

http://homepages.comnet.co.nz/~r-mahoney/scripts/scripts.html

N.B. For sorting utf-8 Arabic you might find the perl module `Sort::ArbBiLex' useful

Thnx Richard; most of the scripts are beyond my capability;->, but the UTF8 to TeX / LaTeX scripts seem very useful for a possible transliteration module, which is something else I need to do for my journal, where people send in different transliteration conventions that I have to convert to the journal's convention. I'm ashamed to say it; I've been doing this kind of thing manually up to now; \startsuperhero must... learn... scripting... language... \stopsuperhero Thnx Idris -- Professor Idris Samawi Hamid Department of Philosophy Colorado State University Fort Collins, CO 80523

George N. White III

3:22 p.m.

On Sat, 5 Jun 2004, Idris Samawi Hamid wrote:

...

Hi gang,

For Arabic we use a Latin transcription in Aleph/(e-)Omega (or even ArabTeX) unless one of the encoding filters like utf-8 is used. Even for utf-8 files, however, it would be very useful to be able to convert a utf-8 file to Latin transcription for further processing by Aleph/(e-)Omega. For example, adding diacritics is much easier to do in Latin than in an Arabic script editor because Latin transcription is one-dimensional and adding diacritics to Arabic is a 2-dimen affair.

The best thing would be a perl script but I don't know perl at all (except to run some some precreated scripts). If someone out of the kindness of their heart could write a short and simple script for just seven characters I could do the rest myself and present it back here.

Can you use (or extend) GNU recode? It does include support for utf-8 and several TeX encodings.

7712

Age (days ago)

7714

Last active (days ago)

List overview

Download

17 comments

7 participants

participants (7)

George N. White III
Giuseppe Bilotta
Henning Hraban Ramm
Idris Samawi Hamid
Richard MAHONEY
Thomas A. Schmitz
Thomas A.Schmitz