[Fwd: Re: [lingu-dev] Spell check dictionary update]

Nicolas Mailhot nicolas.mailhot at laposte.net
Mon Oct 1 08:00:25 UTC 2007


Maybe something that can be integrated in translate.fedoraproject.org

-------------------------- Message original --------------------------
Objet:    Re: [lingu-dev] Spell check dictionary update
De:       Harri Pitkänen <hatapitk at cc.jyu.fi>
Date:     Dim 30 septembre 2007 20:34
À:       dev at lingucomponent.openoffice.org
----------------------------------------------------------------------

Hi!

On Sunday 30 September 2007, Robert Ludvik wrote:
> ...
> In just a few words: people can send words, that are not yet in spell
> check dictionary trough a web form or with a help of a macro, which is
> for now only available for OOo but could be ported to MSO, KOffice(?).
> Relevant people (linguists) would then review sent words and accept
> them for inclusion in dictionaries or reject them.
> Dictionaries are in form that can be used for Mozilla and KOffice
> products as well.
> I'd like to open a discussion about this. If you are interested, you
> can read some more at http://r.aufbix.org/spell/, especially a *draft*
> of proposal how this could be done
> (http://r.aufbix.org/spell/spell-workflow.pdf or
> http://r.aufbix.org/spell/spell-workflow.odg, if you prefer)

I can offer some comments, because our development workflow for
Finnish spell
checker shares some features with your draft and has been in use for
about a
year now.

- We do not have an OOo macro for sending suggestions, but I think it
is a
great idea. We do have a web form [1] though. The form consists of a
field to
enter the word, a drop-down box for selecting the type of the word
("general
vocabulary", "computing vocabulary", "medical vocabulary", ... , "foreign
words", "dialects", "words that should be removed from current
vocabulary")
and a free-form text box for explaining the word if it needs an
explanation.
The form has not been very popular, on average we get about one word
per day
through it. Could be that we should have advertised it more.

Previously we had a form that only contained a field to enter the word
and a
drop-down box for word class. That one was initially perhaps too
popular, it
was occasionally misused by spamming it with useless strings. We have
never
collected any personal information through these forms. We only track the
user ip address to limit incoming suggestions to 20 words/ip/day to
prevent
misuse. But some smart person worked around that limitation by using
Tor to
access the form... So I recommend to build the system so that the
database
can be easily cleaned up if something like this happens.

It should be noted that Finland has only a population of 5 million
people. And
the majority of Finnish OOo users (especially on Windows) are still
using a
non-free spell checker (released around 2002) for which our word
suggestion
form is useless. Therefore most language teams could probably expect this
type of form to be more popular than what we have experienced.

- The review system we use is a lot simpler than the one in your
draft. We
only have one compulsory review step for the suggested words, where a
registered user of the system either rejects the suggestion or moves
it to
the master database, and populates the new record with necessary meta
information (inflection class etc.) However, the system maintains a
change
log [2] of all changes made to the master database. Our project has three
active contributors, and we more or less regularly check each other's
changes
from the log. So in practice there is an extra round of reviews,
although it
is not enforced by the software.

I think that for a small team like ours this simplified review works just
fine. We do not have any professional linguists in this project anyway. I
suppose this is the case for many other languages too. So if possible, it
would be nice to be able to merge the non-linguist and linguist
reviews in
case some teams cannot afford to have both.

- The role of the technician at the end of the process is more or less
similar
in our process and your draft. Only problem we have is that our spell
checker
implementation does not allow merging dictionaries at runtime. This is
why
there is currently no easy way for the users to add medical etc.
dictionaries, which in turn discourages people from contributing to them.
This is a technical problem that we must solve later. I believe that
Hunspell
does not have this problem.


Of course the code of our web application is available to any teams
who wish
to use it, since it is under the GPL. The core code has been designed
to be
language independent and the application itself can be localised using po
files. But it does have a major limitation in that the same database
cannot
be used simultaneously for multiple languages, and technical
documentation
mostly just has not been written. And it is written in Python, not
PHP, and
there is not (yet) export capability for Hunspell format. So I think that
your proposed workflow, macros and PHP scripts will offer a better
initial
design for solving the dictionary update and maintenance problem for many
languages.

Harri

[1] http://joukahainen.lokalisointi.org/ehdotasanoja
[2] http://joukahainen.lokalisointi.org/query/listchanges

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe at lingucomponent.openoffice.org
For additional commands, e-mail: dev-help at lingucomponent.openoffice.org




-- 
Nicolas Mailhot





More information about the trans mailing list