Automatic extraction of translations files

[Fedocal] Reminder meeting : i18n...

Jean-Baptiste Holcroft

Sunday, 9 August 2020 Sun, 9 Aug '20

4:22 p.m.

Hello internationalization team, I would like to provide translation memories for translators and measure localization progress over versions, and I need to learn how to interact with Fedora package builds. This would basically looks like: 0. for each new version of a Fedora package 1. identify and extract the srpm content 2. identify the localization files 3. download every existing po files 4. produce translation memories and statistics Steps 1 to 4 are "easy". But for step 0, I have no idea how to do it. How can I get some kind of notification when a new package is created (whatever a new one or an update of an existing one)? I feel like transtats does something similar, I think I could take some hint from this project, but how to get started? All of this should probably run in one or multiple openshift scripts, but I know how to get help for this. Thanks for your help, Jean-Baptiste

Show replies by date

Sundeep Anand

Monday, 10 August Mon, 10 Aug

12:29 a.m.

On Mon, Aug 10, 2020 at 2:52 AM Jean-Baptiste Holcroft < jean-baptiste(a)holcroft.fr> wrote:

...

"Sync Package Build System" job at https://transtats.fedoraproject.org/jobs/yml-based does the same! And now its results seem quite accurate. Kindly refer: https://www.youtube.com/watch?v=RHPtsIHNIgg [*at* 4:20] for demo. In fact this feature is available through: - API: https://transtats.fedoraproject.org/api-docs/#job - CLI: http://docs.transtats.org/en/latest/client.html I really feel if we extend this functionality a bit maybe that's this requirement. I mean if we get an option to append "Translation Memory" along with "Transtats Statistics" for each job.. I guess that would be the answer? How do you feel?

...

All of this should probably run in one or multiple openshift scripts, but I know how to get help for this. Thanks for your help, Jean-Baptiste _______________________________________________ i18n mailing list -- i18n(a)lists.fedoraproject.org To unsubscribe send an email to i18n-leave(a)lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/i18n@lists.fedoraproject.org

jean-baptiste＠holcroft.fr

1:03 a.m.

thanks for your answer, but i don't want to be linked to transtats. This tool requires much more technical knowledge to contribute than what I can do. in addition, it requires per project configuration, which makes it covering 80 project while we have thousands times more srpm in Fedora. -- Jean-Baptiste

Sundeep Anand

2:07 a.m.

On Mon, Aug 10, 2020 at 11:34 AM <jean-baptiste(a)holcroft.fr> wrote:

...

thanks for your answer, but i don't want to be linked to transtats. This tool requires much more technical knowledge to contribute than what I can do.

I guess, it's just about consuming a couple of APIs for you! Obviously I'll do the 'extend' part in Transtats. Moreover, if that helps we can do a demo as well. in addition, it requires per project configuration, which makes it covering

...

80 project while we have thousands times more srpm in Fedora.

"Sync Package Build System" job works for any package which is built in fedora irrespective of its presence in Transtats! Just that currently the job is limited to `PO` files[1]. Idea behind developing "Transtats Jobs Framework" is to have a centralized processor through which we can solve multiple sets of problems in a flexible way. Because, it has information about Fedora Release (and its Schedule), Package Source Repo, Package Translation Platform, and Package Build System; and interestingly we can expand them[2]. Multiple sets of problems can be captured and executed in the form of Job Templates. And jobs are YAML based, hence flexible - we can edit them. Hence, jobs can be unique though they originate from the same template. I agree, the development pace of Transtats is slow, however, it can deliver better results if tweaked in the required way! thanks Jean-Baptiste! [1] https://github.com/transtats/transtats/issues/149 [2] https://speakerdeck.com/sundeep/use-cases-for-transtats-in-the-fedora-com... --

...

Jean-Baptiste _______________________________________________ i18n mailing list -- i18n(a)lists.fedoraproject.org To unsubscribe send an email to i18n-leave(a)lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/i18n@lists.fedoraproject.org

jean-baptiste＠holcroft.fr

3 a.m.

ok, i understand, but I'm using https://pypi.org/project/translation-finder/ to find files. It allows to detect any kind of files supported by Weblate which makes it quite efficient to get translations. And the dev is really reactive (see issues I opened last year). Any bug reported also improve our translation platform. In addition, with this, it is easy for me to run test on a single package and then to extract files. what would be the output of the API? An information on where to find the package? or will it tell where to find the translations files? Or provide the translations files directly? what would be the information required for this API to get results? the package name? what's the benefit of using an API to talk with transtats instead of subscribing to Fedora tooling to know when a new package goes to the pipeline and download it? in the future, I would like to reuse the translation files to produce other packages such as system wide langpacks. Eventually with new translations coming from translation memory/translation machine or even manual modification. -- Jean-Baptiste

Sundeep Anand

4:58 a.m.

On Mon, Aug 10, 2020 at 1:31 PM <jean-baptiste(a)holcroft.fr> wrote:

...

that's awesome - we're also thinking to evaluate the same for https://github.com/transtats/transtats/issues/149

...

In addition, with this, it is easy for me to run test on a single package and then to extract files.

What kind of tests? We can add a step in YAML for those I guess! Currently, we execute all those written in *.spec* file before we extract translation files.

...

what would be the output of the API? An information on where to find the package? or will it tell where to find the translations files? Or provide the translations files directly?

There are two job related APIs: (1) to run a job: POST /api/job/run for example: $ curl -d '{"job_type": "syncdownstream", "package_name": "anaconda", "build_system": "koji", "build_tag": "f33"}' -H 'Content-Type: application/json' -H 'Authorization: Token <your-transtats-api-token>' -X POST http://localhost:8080/api/job/run {"Success":"Job created and logged. URL: http://localhost:8080/jobs/log/2a5966a9-3e5e-4ad1-b89e-1ee0e3b1651b/detail ","job_id":"2a5966a9-3e5e-4ad1-b89e-1ee0e3b1651b"} Here, the job is executed and logged. You can get your api-token after logging with fedora id. (*see in the drop-down of user's email*) Here, latest build for "f33" will be picked. (2) to fetch job details: GET /api/job/{job_id}/log (you can get job_id from the response of above api or from https://transtats.fedoraproject.org/jobs/logs page) for example: $ curl https://transtats.fedoraproject.org/api/job/f483b6fd-824e-4e5b-96ae-d29fd... Output would be the json, containing all the info you see in "Job Log (output)" on details page of this job-id: https://transtats.fedoraproject.org/jobs/log/f483b6fd-824e-4e5b-96ae-d29f... Please try that on https://transtats.fedoraproject.org/api-docs/ As this output already contains translation stats, we can add translation memory (*in the API response*) as well. In CLI demo of https://www.youtube.com/watch?v=jgXJZRj43M0 [*at* 21:20] you may see that working.

...

what would be the information required for this API to get results? the package name?

Please see curl command above!

...

what's the benefit of using an API to talk with transtats instead of subscribing to Fedora tooling to know when a new package goes to the pipeline and download it?

Well, its matter of choice. Transtats run syncdownstream job in the background every time a new package build is detected to keep package stats latest. GET /api/package/<package-name> HTTP/1.1 So, through APIs it can be integrated in CI and test pipelines easily. You may explain more; what API response you expect? We can work on a new template also!

...

in the future, I would like to reuse the translation files to produce other packages such as system wide langpacks. Eventually with new translations coming from translation memory/translation machine or even manual modification. -- Jean-Baptiste

Ben Cotton

7:32 a.m.

On Sun, Aug 9, 2020 at 5:22 PM Jean-Baptiste Holcroft <jean-baptiste(a)holcroft.fr> wrote:

...

0. for each new version of a Fedora package 1. identify and extract the srpm content 2. identify the localization files 3. download every existing po files 4. produce translation memories and statistics Steps 1 to 4 are "easy". But for step 0, I have no idea how to do it. How can I get some kind of notification when a new package is created (whatever a new one or an update of an existing one)?

Could you use datagrepper[1] to watch the buildsys category? You'd want to narrow the filter down to only include completed builds, and I'm not sure what the best approach would be, but the Infra or RelEng teams can help with that. That could be incorporated into some kind of automated process. You could also watch for Bodhi updates instead (again using datagrepper), but that would mean missing out on Rawhide package updates. Koji builds don't necessarily get released, but you'd avoid missing anything. It depends which kind of error you prefer. There's also the notifications[1] app, but that would require you to manually act on each notification, which probably would not scale very well. :-) [1] https://apps.fedoraproject.org/datagrepper/ [2] https://apps.fedoraproject.org/notifications/ -- Ben Cotton He / Him / His Senior Program Manager, Fedora & CentOS Stream Red Hat TZ=America/Indiana/Indianapolis

Jean-Baptiste Holcroft

Wednesday, 12 August Wed, 12 Aug

12:40 a.m.

Le 2020-08-10 14:32, Ben Cotton a écrit :

...

Could you use datagrepper[1] to watch the buildsys category?

Thank you for this alternative Ben, I decided to write a wiki page about my project [1] I realized an alternate implementation could be to get the list of all available RPMs. This would prevent registering to messages making it easier to develop and reduce computation costs. Can we achieve this easily with our infrastructure? [1] https://fedoraproject.org/wiki/User:Jibecfed/LinuxLocalizationMeasurement

Ben Cotton

7:20 a.m.

On Wed, Aug 12, 2020 at 1:46 AM Jean-Baptiste Holcroft <jean-baptiste(a)holcroft.fr> wrote:

...

I realized an alternate implementation could be to get the list of all available RPMs. This would prevent registering to messages making it easier to develop and reduce computation costs. Can we achieve this easily with our infrastructure?

The simple way is to run `dnf list --all` against all supported versions. You'd have to store the output somewhere and diff it on your next run. There might be a better way, but the Infrastructure team would be better equipped to answer this. The advantage of the dnf way is that you can run it in batches once a week (for example) on your own machine and, like you said, it's much easier to code against. The advantage of the datagrepper way is that you can have the actions happen automatically as a package changes. -- Ben Cotton He / Him / His Senior Program Manager, Fedora & CentOS Stream Red Hat TZ=America/Indiana/Indianapolis

Jean-Baptiste Holcroft

Monday, 12 October Mon, 12 Oct

7:21 a.m.

New subject: Fedora 32 translation memories

TL;DR: here are the translation memories for the 318 languages, built from all software available in Fedora 32: https://jibecfed.fedorapeople.org/partage/compendium-full/ Last august, I talked about my project "to provide translation memories for translators and measure localization progress over version" [1]. Thanks to darknao's help with automation, I'm now able to analyze the whole Fedora Linux distribution. I would like to make it a Fedora initiative and publish these files in an official Fedora website. Would someone be willing to help? Constraints is to use Hugo to allow this website to be localized. For Fedora 32, it means: * 21 000 srpm extracted (source of rpm packages) * 121 000 po files detected (other formats exists, but I'm starting by this) which represents 7Gio of data From that, I deducted: * 318 languages. For each of them, it produce: ** a compendium [2] ** a terminology [3] ** a translation memory (tmx file) Please do not hesitate to suggest improvements in the generation of these files. I used very basic commands and did no cleaning. Next steps (by priority): * measure each language status for a Fedora release * display results in a static website for a Fedora release * allow to compare releases * all other ideas we may have -- Jean-Baptiste [1] https://lists.fedoraproject.org/archives/list/i18n@lists.fedoraproject.or... [2] http://docs.translatehouse.org/projects/translate-toolkit/en/latest/comma... [3] http://docs.translatehouse.org/projects/translate-toolkit/en/latest/comma... [4] http://docs.translatehouse.org/projects/translate-toolkit/en/latest/comma...

Benson Muite

7:31 a.m.

New subject: Fedora 32 translation memories

On 10/12/20 3:21 PM, Jean-Baptiste Holcroft wrote:

...

I would like to make it a Fedora initiative and publish these files in an official Fedora website. Would someone be willing to help? Constraints is to use Hugo to allow this website to be localized.

This is of interest.

...

Please do not hesitate to suggest improvements in the generation of these files. I used very basic commands and did no cleaning.

Do you have a these steps in a set of scripts/programs?

...

Next steps (by priority): * measure each language status for a Fedora release * display results in a static website for a Fedora release * allow to compare release

Can probably schedule an automated task to do this every few days. > * all other ideas we may have >

Jean-Baptiste Holcroft

Tuesday, 13 October Tue, 13 Oct

4:15 a.m.

New subject: Fedora 32 translation memories

Le 2020-10-12 14:31, Benson Muite a écrit :

...

On 10/12/20 3:21 PM, Jean-Baptiste Holcroft wrote: > > Please do not hesitate to suggest improvements in the generation of > these files. > I used very basic commands and did no cleaning. Do you have a these steps in a set of scripts/programs?

Yes, look at "compute_lang" function in this PR: https://pagure.io/fedora-localization-statistics/pull-request/8#request_diff

Ben Cotton

Monday, 12 October Mon, 12 Oct

8:48 a.m.

New subject: Fedora 32 translation memories

On Mon, Oct 12, 2020 at 8:22 AM Jean-Baptiste Holcroft <jean-baptiste(a)holcroft.fr> wrote:

...

I would like to make it a Fedora initiative and publish these files in an official Fedora website. Would someone be willing to help? Constraints is to use Hugo to allow this website to be localized.

The easiest route, if you don't already have something in mind, is to have a "group" site on fedorapeople. You can control access through a FAS group and then deploy to the site via SFTP. That's what we do for the schedule at https://fedorapeople.org/groups/schedule/ . The Infrastructure team can help with that. -- Ben Cotton He / Him / His Senior Program Manager, Fedora & CentOS Stream Red Hat TZ=America/Indiana/Indianapolis

Jean-Baptiste Holcroft

9:28 a.m.

New subject: Fedora 32 translation memories

Le 2020-10-12 15:48, Ben Cotton a écrit :

...

On Mon, Oct 12, 2020 at 8:22 AM Jean-Baptiste Holcroft <jean-baptiste(a)holcroft.fr> wrote: > > I would like to make it a Fedora initiative and publish these files in > an official Fedora website. > Would someone be willing to help? Constraints is to use Hugo to allow > this website to be localized. > The easiest route, if you don't already have something in mind, is to have a "group" site on fedorapeople. You can control access through a FAS group and then deploy to the site via SFTP. That's what we do for the schedule at https://fedorapeople.org/groups/schedule/ . The Infrastructure team can help with that.

Interesting, thank you. Unfortunately, the url isn't nice enough, I'm thinking at something.fedoraproject.org Most important for me is to have have access to statistics, so that we know if someone download the files or not :p

Jens-Ulrik Petersen

Tuesday, 13 October Tue, 13 Oct

5:34 a.m.

New subject: Fedora 32 translation memories

On Mon, Oct 12, 2020 at 8:22 PM Jean-Baptiste Holcroft < jean-baptiste(a)holcroft.fr> wrote:

...

Thanks Jean-Baptiste, this is very interesting indeed.

...

I would like to make it a Fedora initiative and publish these files in an official Fedora website. Would someone be willing to help? Constraints is to use Hugo to allow this website to be localized.

I can't help wondering if there is any way to integrate this with Transtats in the future.

...

For Fedora 32, it means: * 21 000 srpm extracted (source of rpm packages) * 121 000 po files detected (other formats exists, but I'm starting by this) which represents 7Gio of data From that, I deducted: * 318 languages. For each of them, it produce: ** a compendium [2] ** a terminology [3] ** a translation memory (tmx file)

Sundeep Anand

Wednesday, 14 October Wed, 14 Oct

1:51 a.m.

New subject: Fedora 32 translation memories

On Tue, Oct 13, 2020 at 4:20 PM Jens-Ulrik Petersen <petersen(a)redhat.com> wrote:

...

On Mon, Oct 12, 2020 at 8:22 PM Jean-Baptiste Holcroft < jean-baptiste(a)holcroft.fr> wrote: > TL;DR: here are the translation memories for the 318 languages, built > from all software available in Fedora 32: > https://jibecfed.fedorapeople.org/partage/compendium-full/ > > Last august, I talked about my project "to provide translation memories > for translators and measure localization progress over version" [1]. > > Thanks to darknao's help with automation, I'm now able to analyze the > whole Fedora Linux distribution. > Thanks Jean-Baptiste, this is very interesting indeed. > I would like to make it a Fedora initiative and publish these files in > an official Fedora website. > Would someone be willing to help? Constraints is to use Hugo to allow > this website to be localized. > I can't help wondering if there is any way to integrate this with Transtats in the future.

this may align with https://github.com/transtats/transtats/issues/178

...

For Fedora 32, it means: > > * 21 000 srpm extracted (source of rpm packages) > * 121 000 po files detected (other formats exists, but I'm starting by > this) which represents 7Gio of data > > From that, I deducted: > > * 318 languages. For each of them, it produce: > ** a compendium [2] > ** a terminology [3] > ** a translation memory (tmx file) _______________________________________________ i18n mailing list -- i18n(a)lists.fedoraproject.org To unsubscribe send an email to i18n-leave(a)lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/i18n@lists.fedoraproject.org

Jean-Baptiste Holcroft

2:27 a.m.

New subject: Fedora 32 translation memories

Le 2020-10-14 08:51, Sundeep Anand a écrit :

...

On Tue, Oct 13, 2020 at 4:20 PM Jens-Ulrik Petersen <petersen(a)redhat.com> wrote: > On Mon, Oct 12, 2020 at 8:22 PM Jean-Baptiste Holcroft >> I would like to make it a Fedora initiative and publish these >> files in >> an official Fedora website. >> Would someone be willing to help? Constraints is to use Hugo to >> allow >> this website to be localized. > > I can't help wondering if there is any way to integrate this with > Transtats in the future. this may align with https://github.com/transtats/transtats/issues/178

As said on IRC: * transtats is focused on package out of sync, which isn't something I really worry about * the technology used by transtats is too complex for me to easily step in * I'm also unsure about the usecases transtats covers and it probably requires more promotion and measurement of the impact it have * I'll be happy to have transtats front-end and have a dev to develop contribution features. For example, I have files with missing encoding, uniq keys in duplicates, obvious errors to fix etc. * merging these two initiatives probably means to rewrite transtats, which is a hard decision to take But I'm talking as an individual here, if transtats doesn't answers my usecase, it can still be useful for other users/personas. How can we seriously discuss this and arrive with realistic options? Next Flock? A dedicated event? Jean-Baptiste

Sundeep Anand

3:30 a.m.

New subject: Fedora 32 translation memories

Hi, Thank you and congratulations Jean-Baptiste for collating the data, that's huge! The next big thing would be to make them usable to consume. Talking about Transtats: It tries to bring a lot of things together that's its drawback and strength at the same time. --- which makes it complex! On Wed, Oct 14, 2020 at 12:57 PM Jean-Baptiste Holcroft < jean-baptiste(a)holcroft.fr> wrote:

...

Le 2020-10-14 08:51, Sundeep Anand a écrit : > On Tue, Oct 13, 2020 at 4:20 PM Jens-Ulrik Petersen > <petersen(a)redhat.com> wrote: > >> On Mon, Oct 12, 2020 at 8:22 PM Jean-Baptiste Holcroft >>> I would like to make it a Fedora initiative and publish these >>> files in >>> an official Fedora website. >>> Would someone be willing to help? Constraints is to use Hugo to >>> allow >>> this website to be localized. >> >> I can't help wondering if there is any way to integrate this with >> Transtats in the future. > > this may align with https://github.com/transtats/transtats/issues/178 As said on IRC: * transtats is focused on package out of sync, which isn't something I really worry about

As Transtats is evolving, package out of sync is one of the focused areas.

...

* the technology used by transtats is too complex for me to easily step in

It's a plain django application. https://github.com/transtats/transtats/blob/devel/CONTRIBUTING.md could be a good starting point.

...

* I'm also unsure about the usecases transtats covers and it probably requires more promotion and measurement of the impact it have

An effort to refine those use cases and tweak them is underway!

...

* I'll be happy to have transtats front-end and have a dev to develop contribution features. For example, I have files with missing encoding, uniq keys in duplicates, obvious errors to fix etc.

Probably Transtats can see the https://jibecfed.fedorapeople.org/partage/compendium-full/ as one big source of translations? In the past, I had been doing something with map-filter-reduce (in hadoop) in the same scenario for the same desired results.

...

* merging these two initiatives probably means to rewrite transtats, which is a hard decision to take

I guess rewrite is not required. We may just need a job consuming your datasets.

...

But I'm talking as an individual here, if transtats doesn't answers my usecase, it can still be useful for other users/personas.

Somehow, Transtats deals with multi-product / multi-tenancy environments and may be developed for multiple teams / stakeholders, hence defining development priority in one direction sometimes looks challenging. How can we seriously discuss this and arrive with realistic options?

...

Next Flock? A dedicated event? Jean-Baptiste

1283

days inactive

1349

days old

i18n@lists.fedoraproject.org

Manage subscription

17 comments

6 participants

tags (0)

participants (6)

Ben Cotton
Benson Muite
Jean-Baptiste Holcroft
jean-baptiste＠holcroft.fr
Jens-Ulrik Petersen
Sundeep Anand

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Automatic extraction of translations files