https://bugzilla.redhat.com/show_bug.cgi?id=857967
--- Comment #1 from Mike FABIAN <mfabian(a)redhat.com> ---
Discussion with upstream author of ibus-table how to fix the problem:
1:00 PM me: Hi Yuwei!
Yuwei YU: hi
hi
1:01 PM me:
http://code.google.com/p/ibus/issues/detail?id=1492 <- have you
seen this?
晞 cannot be typed. with wubi.
I think the reason is that it is misdetected as traditional Chinese only.
1:02 PM Because that character is not in gb2312 but in big5hkscs.
So the test conversion to gb2312 fails but the test conversion to big5hkscs
succeeds in tabsqlitedb.py
1:03 PM # first whether in gb2312
try:
tmp_phrase.encode('gb2312')
category |= 1
except:
if '〇'.decode('utf8') in tmp_phrase:
# we add '〇' into SC as well
category |= 1
# second check big5-hkscs
try:
tmp_phrase.encode('big5hkscs')
category |= 1 << 1
...
1:04 PM But 晞 is actually the same in simplified and traditional Chinese.
So detecting it as traditional Chinese only seems wrong.
1:05 PM I think to fix this, the detection of simplified and traditional
Chinese needs to be improved.
I thought about using the Unihan database for this.
1:06 PM For example:
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=%E6%9D%B1
東 contains:
kSimplifiedVariant U+4E1C 东
and the entry for 东 contains:
kTraditionalVariant U+6771 東
1:07 PM The entry for 晞,
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=%E6%99%9E contains
neither kSimplifiedVariant nor kTraditionalVariant.
1:08 PM My idea is that this can be used for a better detection of simplified
and traditional Characters.
1:09 PM I.e. parse this data out of the plain text version of the Unihan
database during build time of ibus-table and generate some hash table in python
which contains the information whether a Chinese character is simplified,
traditional or both.
Yuwei YU: ok, a new method of encode detection is needed here
1:10 PM me: What do you think of the idea of using the data available in the
Unihan database for this?
The Unihan database seems to contain the necessary information.
Yuwei YU: yes, we can try it
1:11 PM me: I can try to implement this.
Do you think generating a hashtable from the Unihan database at build time of
ibus-table is a good idea? I think I could probably do that.
1:12 PM Keys of the hash table would be all possible Chinese characters, values
would contain whether it is simplified, traditional or both.
The current detection code could then be replaced by iterating over the
phrase and looking up all Chinese characters in the phrase in that hash table.
1:14 PM If the phrase contains only characters which are both simplified and
traditional, then the phrase gets both bit 1 and bit 2 set.
If a character is there which is only simplified, then the phrase gets only
bit 1 set.
If a character is there which is only traditional, the phrase gets bit 2 set.
1:15 PM If both a character which is exclusively simplified and a character
which is exclusively traditional is in the same phrase, that is kind of weird
but probably then the phrase would have to be classified as both as well, i.e.
set bit 1 and bit 2.
6 minutes
1:21 PM Yuwei YU: that's a good idea, but the phrases are not only single
characters. At this point, you need to use loop to go through each character,
and this would be much slower
1:22 PM maybe we can try to use gbk instead of gb2312
1:23 PM me: That won't really help because with gbk we would get the opposite
problem.
1:24 PM For example, the character 晞 is in gbk and just replacing gb2312 with
gbk in the current detection code would classify that character as simplified
Chinese and one would not be able to type it anymore in traditional Chinese
mode.
1:26 PM Yes, I thought about looping over all characters in the phrase.
1:27 PM That should be fast enough because phrases are not very long, lookup in
such a simple hash table is fast and this detection is only done when creating
the .db file, i.e. at build time and not while the user is typing. So this does
not need to be extremely fast.
1:28 PM Yuwei YU: detection would be use in user's phrase as well
it need speed
1:29 PM me: user’s phrase is what is in ~/.ibus/tables/ ?
Yuwei YU: If I remenber correctly
1:30 PM me: 东 for example is convertible to both gbk and big5hkscs:
$ echo -n 东 | iconv -f utf-8 -t big5hkscs
mfabian@ari:~
$ echo -n 东 | iconv -f utf-8 -t gbk
¶«mfabian@ari:~
$
So that doesn't really tell you that the character is simplified Chinese.
1:31 PM I feel that trying to detect whether characters are simplified or
traditional Chinese by test converting to legacy encodings cannot really work
well.
1:33 PM Yuwei YU: yes, I see. so the method need improvement.
me: My guess at the moment is that iterating over the characters in a phrase
and checking a hash table should be plenty fast.
1:34 PM But I can be sure only when I really do it and benchmark it ...
1:35 PM Yuwei YU: it can be faster than encode, unless you write it in c
can't
1:39 PM me: About the user phrase thing, when is this used?
1:40 PM For example, if I delete all .db tables in ~/.ibus/tables, then restart
ibus, then type something with wubi, then restart ibus again, then
~/.ibus/tables/wubi-jidian86-user.db contains the phrases I just typed.
And when I type the same phrases again, they are preferred.
1:41 PM Are the user phrases you mentioned above “detection would be use in
user's phrase as well” these phrases which get inserted in to
~/.ibus/tables/wubi-jidian86-user.db ?
1:42 PM Yuwei YU: yes, and user's new phrases are stored as well
me: Then I wonder why I do not see debug messages from this
simplfied/traditional detection when phrases get inserted into the user db.
1:43 PM I added 2 print statements:
mfabian@ari:/usr/share/ibus-table/engine
$ diff u tabsqlitedb.py.~1~ tabsqlitedb.py
-- tabsqlitedb.py.~1~ 2012-09-13 15:51:30.000000000 +0200
+++ tabsqlitedb.py 2012-09-17 13:39:05.504724080 +0200
@@ -475,6 +475,7 @@
user_freq = 0
# now we will set the category bits if this is chinese
if self._is_chinese:
+ print "mike simplified traditional detection"
# this is the bitmask we will use,
# from low to high, 1st bit is simplify Chinese,
# 2nd bit is traditional Chinese,
@@ -676,6 +677,7 @@
This method is called in table.py by passing UserInput held data
Return result[:]
'''
+ print "mike in select_words"
# firstly, we make sure the len we used is equal or less than the max key
length
_len = min( len(tabkeys),self._mlen )
_condition = ''
mfabian@ari:/usr/share/ibus-table/engine
$
Yuwei YU: new phrase, not the already known
1:44 PM me: Then deleted all user .db files, restarted ibus, typed something,
restarted ibus.
1:45 PM Yuwei YU: you type something, but use left shift to form a new phrase
me: The phrases I typed are in the user .db after doing that.
Ah, and I saw the debug messages now.
So you are right, this detection code is run when inserting phrases into the
user .db.
1:49 PM I probably missed the debug messages because I retyped phrases which
already were in the user .db.
1:50 PM Yuwei YU: yes
5 minutes
1:56 PM me:
http://libunihan.sourceforge.net/ is interesting.
1:58 PM libunihan was apparently created to solve
https://bugzilla.redhat.com/show_bug.cgi?id=227792 which looks similar to our
problem.
2:03 PM Yuwei YU: yes
me: Thinking about using libunihan ...
2:04 PM Yuwei YU: a good idea
--
You are receiving this mail because:
You are on the CC list for the bug.