[Fedora-i18n-bugs] [Bug 514110] [ta_IN] Tamil collation rules are not working in other locales

Thu Feb 4 18:07:40 UTC 2010

Please do not reply directly to this email. All additional
comments should be made in the comments box of this bug.

https://bugzilla.redhat.com/show_bug.cgi?id=514110

--- Comment #9 from K. Sethu <skhome at gmail.com> 2010-02-04 13:07:34 EST ---
Pravin

Excuse me for participating rather late. My apologies.

In the attachment to your comment above in # Comment 5, I see the following

[..]
ஃ
க்
க
கடந்த
கா
காலையில்
கி
கீ
கு
கூ
கெ
கே
கேஎஸ்
கேஎஸ்ரவிக்குமார்
கை
கொ
கோ
கௌ
க்க
க்கௗ
க்ங
ங்
ங
[..]

To be correct க்க, க்கள, க்ங in the above should be in between க் and க  and
not as indicated above in which they are shown to be between கௌ and ங்  

For example if we sort { அ, அக், அகம் அக்கம் அக்கு, அகால } to be in ascending
order, then the correct result should be  {அ, அக், அக்கம், அக்கு, அகம், அகால }

(அக் in above is of course a meaningless word - but for making a sorting order
that has no bearing) 

The above  sorting order from your attachment suggests it will sort to
{அ,அக்,அகம், அகால, அக்கம், அக்கு } which is wrong.

Compare with English alphabets in ascending order:

a,b,c,d,e,f,.....x,y,z

Now ask where should be aa, ab, ac, ad, ....,az be ? Obviously they are between
a and b - like a,aa,ab,ac,ad,...,ax,ay,az,b,c,d,e... Right?

Well similarly க் followed by any letter should be somewhere in between க் and 
க. i.e., if க் is followed by a character,  sort order is as follows:

க் < க்அ < க்ஆ < க்இ < க்ஈ < ..........க்ஒ < க்ஓ < க்ஔ < க்ஃ < க்க் < க்க <
க்கா < க்கி ............< க்கொ < க்கோ < க்கௌ < க்ங் < க்ங < க்ஙா ........

(Wherever I indicate ....... they are to mean "and so on")

In English sorting of words, if two words have same first letter then they are
next to each other and their relative placement is determined by second letter
in each - if they are also same then by the 3rd letter, and ad-nauseum . 

Essentially first sort on first letter, then wherever necessary on second
letter and so on but for each of those sorting the order of sorting used is
same. So the increasing order (or weight-ages) of characters is like a one
dimensional vector set - the repeated application of which wherever needed in a
collating of words.

The principle is same for Tamil too but there are some differences. In case of
English in Unicode each character is represented by a single code point. 

But in case of Tamil, I presume,  the following differences  could make
collation algorithm conceptualisation to some degree more complex:

i) Canonical Decompositions - in each case of ஔ (0B94), ொ (0BCA), ோ (0BCB), ௌ
(0BCC), the NFC and the NFD forms should have equality.

ii) Grantha ligatures SHRII , KSHA as well as KSHA+Virama - each of these  of
course are not represented by single code point but of 3 or 4 code point
sequence.

iii) While the 11 vowel  modifiers from ா (0BBE) to ௌ (0BCC) can be placed as
singletons in ascending order (with provision for the canonical decompositions
of the last 3 - 0BCA, 0BCB and 0BCC being made equal to each of their composed
form) in case of Pulli (i.e., Virama) - 0BCD) it has to be placed along with
each base consonant immediately preceding corresponding based consonant - i.e.,
க் (0B95 0BCD) preceding க (0B95) and similarly for each base consonant.

Hope I my observations so far are clear and correct ; if not I would like to
know where I am wrong. I will write more after seeing responses

K. Sethu
(11:30 PM , Colombo-Sri Lanka)

-- 
Configure bugmail: https://bugzilla.redhat.com/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.