[NTG-context] two buglets

Philipp Gesang pgesang at ix.urz.uni-heidelberg.de
Tue Oct 5 14:15:43 CEST 2010


On 2010-10-03 <17:43:21>, Thomas A. Schmitz wrote:
> OK, I'll write something for German and English, but the thing
> is that we need more input what users expect. For mixtures with
> foreign languages, there might not be generally accepted rules at
> all, so people will define something on an ad-hoc basis.

Hi Thomas and others,

technically speaking the problem is solved by ISO 14651.[1]

In praxi multilingual sorting depends on local rules, of
which “One index per script|language.” seems to be the most
common.

Some time ago I made an lpeg from the bnf in [1]. It matches the
collation rules from [2], but as I couldn’t figure out how to map
them onto context’s sorting mechanism I never got around to
actually capture the information. As I won’t be having the time
to try it with the new structure of sort-lan I guess I’ll just
attach the peg grammar for anyone to use as a starting point.
Unicode collation would be great to have in context.

> transliteration. The problem with polytonic Greek is that so many
> different unicode characters need to have the same sort entry. If

Isn’t that just what the Greek rules in sort-lan.lua do? If not
then it would be a bug.

····startsnippet·················································

definitions["gr"] = {
    entries = {
        ["α"] = "α", ["ά"] = "α", ["ὰ"] = "α", ["ᾶ"] = "α", ["ᾳ"] = "α",
        ["ἀ"] = "α", ["ἁ"] = "α", ["ἄ"] = "α", ["ἂ"] = "α", ["ἆ"] = "α",
        ["ἁ"] = "α", ["ἅ"] = "α", ["ἃ"] = "α", ["ἇ"] = "α", ["ᾁ"] = "α",
        ["ᾴ"] = "α", ["ᾲ"] = "α", ["ᾷ"] = "α", ["ᾄ"] = "α", ["ᾂ"] = "α",
        ["ᾅ"] = "α", ["ᾃ"] = "α", ["ᾆ"] = "α", ["ᾇ"] = "α", ["β"] = "β",

····stopsnippet··················································

Always nice to have a decent discussion on sorting ;)

Philipp


[1] http://standards.iso.org/ittf/PubliclyAvailableStandards/c044872_ISO_IEC_14651_2007(E).zip
[2] http://www.iso.org/ittf/ISO14651_2006_TABLE1_En.txt

-- 
()  ascii ribbon campaign - against html e-mail
/\  www.asciiribbon.org   - against proprietary attachments
-------------- next part --------------
require "lpeg"

local C, Cs, Ct, P, R, S, V, match = lpeg.C, lpeg.Cs, lpeg.Ct, lpeg.P, lpeg.R, lpeg.S, lpeg.V, lpeg.match

local iso_parser

rules = P{
    [1] = "weight_table",

    -- Define collation tables as sequences of lines

    weight_table = V"common_template_table" + V"tailored_table",
    common_template_table = V"simple_line"^0,
    tailored_table = V"table_line"^0,

    -- Define the line types

    simple_line = (V"symbol_definition" + V"collating_element" +
                   V"weight_assignment" + V"order_end")^-1 * V"line_completion" --/ function (first) io.write("simple: "..first) end
                   ,
    --table_line = V"simple_line" + V"tailoring_line",
    table_line = V"tailoring_line" + V"simple_line",
    tailoring_line = (V"reorder_after" + V"order_start" + V"reorder_end" +
                      V"section_definition" + V"reorder_section_after") *
                      V"line_completion" --/ function (first) io.write("tailoring: "..first) end
                      ,

    -- Define the basic syntax for collation weighting

    symbol_definition = P"collating-symbol" * V"space"^1 * V"symbol_element",
    symbol_element = V"symbol"-V"symbol_range" + V"symbol_range",
    symbol_range = V"symbol" * P".." * V"symbol",
    symbol = V"simple_symbol" + V"ucs_symbol",
    ucs_symbol = (P"<U"  * V"one_to_eight_digit_hex_string" * P">") +
                 (P"<U-" * V"one_to_eight_digit_hex_string" * P">"),
    simple_symbol = P"<" * V"identifier" * P">",
    collating_element = P"collating-element" * V"space"^1 * V"symbol" * V"space"^1 *
                        P"from" * V"space"^1 * V"quoted_symbol_sequence",
    quoted_symbol_sequence = P'"' * V"simple_weight"^1 * P'"',
    --weight_assignment = V"simple_weight" + V"symbol_weight",
    weight_assignment = V"symbol_weight" + V"simple_weight",
    simple_weight = V"symbol_element" + P"UNDEFINED",
    symbol_weight = V"symbol_element" * V"space"^1 * V"weight_list",
    weight_list = V"level_token" * (V"semicolon" * V"level_token")^0,
    level_token = V"symbol_group" + P"IGNORE",
    symbol_group = V"symbol_element" + V"quoted_symbol_sequence",
    order_end = P"order_end",

    -- Define the tailoring syntax

    reorder_after = P"reorder-after" * V"space"^1 * V"target_symbol",
    target_symbol = V"symbol",
    order_start = P"order_start" * V"space"^1 * V"multiple_level_direction",
    multiple_level_direction = V"direction" * (V"semicolon" * V"direction")^0 * P",position"^-1,
    direction = P"forward" + P"backward",
    reorder_end = P"reorder-end",
    section_definition = V"section_definition_simple" + V"section_definition_list",
    section_definition_simple = P"section" * V"space"^1 * V"section_identifier",
    section_identifier = V"identifier",
    section_definition_list = P"section" * V"space"^1 * V"section_identifier" * V"space"^1 * V"symbol_list",
    symbol_list = V"symbol_element" * (V"semicolon" * V"symbol_element")^0,
    reorder_section_after = P"reorder-section-after" * V"space"^1 * V"section_identifier" * V"space"^1 * V"target_symbol",

    -- Define low-level tokens used by the rest of the syntax

    identifier = (V"letter" + V"digit") * V"id_part"^0,
    id_part = V"letter" + V"digit" + S"-_",
    line_completion = V"space"^0 * V"comment"^-1 * V"EOL",
    comment = V"comment_char" * V"character"^0,
    one_to_eight_digit_hex_string = V"hex_upper"^-8,
    hex_numeric_string = V"hex_upper"^1,
    space = S" \t",
    semicolon = P";",
    comment_char = P"%",
    digit = R"09",
    hex_upper = V"digit" + S"ABCDEF",
    letter = R"az" + R"AZ",
    EOL = P"\n",
    character = 1-V"EOL",
}

f = io.open("iso14651.txt", "r")
tab = f:read("*all")
f:close()

--rules:print()
print(rules:match(tab))
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://www.ntg.nl/pipermail/ntg-context/attachments/20101005/64c85227/attachment.pgp>


More information about the ntg-context mailing list