Pseudo-locales for i18n testing by English speakers

Thu Oct 2 08:01:55 UTC 2008

Ding-Yi Chen wrote:
> The pseudo locale is intriguing, and I assume it helps at some degree.
> However, this approach does have its own limitation:

Of course, pseudo-localisation testing is not the same as localisation
testing in every Fedora language, but it's something!

> 1. Lack of font support: as the attachment "lack_of_font.png" shows, the
> pseudo locale might be rendered useless if all developers can see are
> unicode boxes. :-P 

That tells me that the developers should install better fonts, or how
else can they test an internationalised application?  But to be honest,
I probably shouldn't have used
http://en.wikipedia.org/wiki/Mathematical_alphanumeric_symbols since
they're only guaranteed to be available in certain mathematical fonts
such as Code2001.  I really need to find some latinesque characters that
don't come from the BMP, nor from the maths section!

Apparently Zimbra loses (without trace) the 'e' characters in my
pseudotranslation.  Bad Zimbra!

As long as it's only a couple of characters, I think having some unusual
characters is okay, since you can still work out what's going on, at
least enough to resolve the problem by installing more fonts.

> Perhaps we should specifiy the minimal font set as
> remedy.

Before running pseudo-localised apps, you mean?  Good idea.  I found a
webapp that gives the names of unicode characters -
<http://rishida.net/scripts/uniview/uniview.php>.  Just paste text into
the "cut & paste" field and hit enter.

But how can I find the name of the font which provides a given
character?  I can tell you that all my pseudo-characters are readable on
my computer, but I can't tell you where they come from.

Once I work out what fonts my pseudo-locale requires, I'd be happy to
share the info as a dependency list.

Perhaps it would make sense to define a small Fedora package which
specified certain Unicode fonts as dependencies, as well as enabling the
hypothetical pseudo-locale support in glibc.

> 2. It doesn't really solve the language specific problem. Take Chinese
> characters sorting for example, they can be sorted by
> Pinyin, Zhuyin, radical, number of strokes, and "natural" order such as
> numberial characters. The sorting is impossible to verify without the
> knowledge.

True, but a pseudo-locale which uses reverse sorting can at least show
up whether an app is using internationalised sorting, or plain old ASCII
ordering.

And we're not limited to what Microsoft did - I don't know much about
Chinese character sorting, but we could probably come up with a couple
of alternative sorts that could be understood by an English-speaking
developer.  But I don't want to tackle that just yet!

> Still, the idea itself is good. And surely it filters out some of the
> bugs without the help of translaters.

I expect a lot of i18n/L10n bugs are not picked up until someone tests
one of the affected languages.  Some of those bugs could show up in a
pseudo-locale much earlier, which has to be an improvement.

For instance, I've already found bugs where Eclipse and joe mess up the
cursor position when editing SMP characters, without personally knowing
any SMP languages.

As an English-only developer I think it's also pretty cool to see if my
code is at least partly internationalised, which otherwise I can't see
for myself at all, except in a foreign language.   I think some
English-only developers might take more interest in i18n issues if they
could easily see the results for themselves.

And for those i18n issues which can be demonstrated with a
pseudo-locale, it can be easier for multiple developers to talk about
something which is in "English", since most developers speak English,
even if they have differing native languages.

> Since the main purpose of pseudo locale is for testing, shall we agree
> on a list of pseudo locales which have their own specified behaviour?

I think it would be good if we could fit in with Vista's chosen
pseudo-locale IDs, as listed here:
http://blogs.msdn.com/shawnste/archive/2006/06/27/647915.aspx

As I said, we certainly don't have to emulate MS completely, but I think
 we should use qps for the language code.  See
http://blogs.msdn.com/michkap/archive/2007/02/04/1596987.aspx

As for the behaviours, I expect that they will change as we learn more
from testing feedback, but here are some ideas:

a. simple character substitution, rendered text to be about the same size
b. character substitution with expansion (eg "[--- original text ---]")
to make strings longer
c. maybe swapping upper and lower case.  Sometimes it's handy to have
more than one pseudo-locale, eg to make sure a web client is not seeing
the server's locale, so having spare locales might be handy.

And we could have options like different sort orders.  But I'd be happy
to start with (a) or (b) and leave sort orders until a bit later.  At
least with (a) and (b) it's easy to see whether someone forgot to call
gettext(), because the plain English strings will stick out.

-- 
Sean Flanigan

Senior Software Engineer
Engineering - Internationalisation
Red Hat

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 551 bytes
Desc: OpenPGP digital signature
Url : http://lists.fedoraproject.org/pipermail/devel/attachments/20081002/783f52e5/attachment.bin