Hello internationalization team,
I would like to provide translation memories for translators and measure localization progress over versions, and I need to learn how to interact with Fedora package builds.
This would basically looks like:
0. for each new version of a Fedora package 1. identify and extract the srpm content 2. identify the localization files 3. download every existing po files 4. produce translation memories and statistics
Steps 1 to 4 are "easy". But for step 0, I have no idea how to do it. How can I get some kind of notification when a new package is created (whatever a new one or an update of an existing one)?
I feel like transtats does something similar, I think I could take some hint from this project, but how to get started?
All of this should probably run in one or multiple openshift scripts, but I know how to get help for this.
Thanks for your help, Jean-Baptiste
On Mon, Aug 10, 2020 at 2:52 AM Jean-Baptiste Holcroft < jean-baptiste@holcroft.fr> wrote:
Hello internationalization team,
I would like to provide translation memories for translators and measure localization progress over versions, and I need to learn how to interact with Fedora package builds.
This would basically looks like:
- for each new version of a Fedora package
- identify and extract the srpm content
- identify the localization files
- download every existing po files
- produce translation memories and statistics
Steps 1 to 4 are "easy". But for step 0, I have no idea how to do it. How can I get some kind of notification when a new package is created (whatever a new one or an update of an existing one)?
I feel like transtats does something similar, I think I could take some hint from this project, but how to get started?
"Sync Package Build System" job at https://transtats.fedoraproject.org/jobs/yml-based does the same! And now its results seem quite accurate. Kindly refer: https://www.youtube.com/watch?v=RHPtsIHNIgg [*at* 4:20] for demo. In fact this feature is available through: - API: https://transtats.fedoraproject.org/api-docs/#job - CLI: http://docs.transtats.org/en/latest/client.html
I really feel if we extend this functionality a bit maybe that's this requirement. I mean if we get an option to append "Translation Memory" along with "Transtats Statistics" for each job.. I guess that would be the answer? How do you feel?
All of this should probably run in one or multiple openshift scripts, but I know how to get help for this.
Thanks for your help, Jean-Baptiste _______________________________________________ i18n mailing list -- i18n@lists.fedoraproject.org To unsubscribe send an email to i18n-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/i18n@lists.fedoraproject.org
thanks for your answer, but i don't want to be linked to transtats.
This tool requires much more technical knowledge to contribute than what I can do.
in addition, it requires per project configuration, which makes it covering 80 project while we have thousands times more srpm in Fedora. -- Jean-Baptiste
On Mon, Aug 10, 2020 at 11:34 AM jean-baptiste@holcroft.fr wrote:
thanks for your answer, but i don't want to be linked to transtats.
This tool requires much more technical knowledge to contribute than what I can do.
I guess, it's just about consuming a couple of APIs for you! Obviously I'll do the 'extend' part in Transtats. Moreover, if that helps we can do a demo as well.
in addition, it requires per project configuration, which makes it covering
80 project while we have thousands times more srpm in Fedora.
"Sync Package Build System" job works for any package which is built in fedora irrespective of its presence in Transtats! Just that currently the job is limited to `PO` files[1].
Idea behind developing "Transtats Jobs Framework" is to have a centralized processor through which we can solve multiple sets of problems in a flexible way. Because, it has information about Fedora Release (and its Schedule), Package Source Repo, Package Translation Platform, and Package Build System; and interestingly we can expand them[2].
Multiple sets of problems can be captured and executed in the form of Job Templates. And jobs are YAML based, hence flexible - we can edit them. Hence, jobs can be unique though they originate from the same template.
I agree, the development pace of Transtats is slow, however, it can deliver better results if tweaked in the required way!
thanks Jean-Baptiste!
[1] https://github.com/transtats/transtats/issues/149 [2] https://speakerdeck.com/sundeep/use-cases-for-transtats-in-the-fedora-commun...
--
Jean-Baptiste _______________________________________________ i18n mailing list -- i18n@lists.fedoraproject.org To unsubscribe send an email to i18n-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/i18n@lists.fedoraproject.org
ok, i understand, but I'm using https://pypi.org/project/translation-finder/ to find files.
It allows to detect any kind of files supported by Weblate which makes it quite efficient to get translations. And the dev is really reactive (see issues I opened last year). Any bug reported also improve our translation platform.
In addition, with this, it is easy for me to run test on a single package and then to extract files.
what would be the output of the API? An information on where to find the package? or will it tell where to find the translations files? Or provide the translations files directly?
what would be the information required for this API to get results? the package name?
what's the benefit of using an API to talk with transtats instead of subscribing to Fedora tooling to know when a new package goes to the pipeline and download it?
in the future, I would like to reuse the translation files to produce other packages such as system wide langpacks. Eventually with new translations coming from translation memory/translation machine or even manual modification. -- Jean-Baptiste
On Mon, Aug 10, 2020 at 1:31 PM jean-baptiste@holcroft.fr wrote:
ok, i understand, but I'm using https://pypi.org/project/translation-finder/ to find files.
It allows to detect any kind of files supported by Weblate which makes it quite efficient to get translations. And the dev is really reactive (see issues I opened last year). Any bug reported also improve our translation platform.
that's awesome - we're also thinking to evaluate the same for https://github.com/transtats/transtats/issues/149
In addition, with this, it is easy for me to run test on a single package and then to extract files.
What kind of tests? We can add a step in YAML for those I guess! Currently, we execute all those written in *.spec* file before we extract translation files.
what would be the output of the API? An information on where to find the package? or will it tell where to find the translations files? Or provide the translations files directly?
There are two job related APIs: (1) to run a job: POST /api/job/run
for example: $ curl -d '{"job_type": "syncdownstream", "package_name": "anaconda", "build_system": "koji", "build_tag": "f33"}' -H 'Content-Type: application/json' -H 'Authorization: Token <your-transtats-api-token>' -X POST http://localhost:8080/api/job/run {"Success":"Job created and logged. URL: http://localhost:8080/jobs/log/2a5966a9-3e5e-4ad1-b89e-1ee0e3b1651b/detail ","job_id":"2a5966a9-3e5e-4ad1-b89e-1ee0e3b1651b"}
Here, the job is executed and logged. You can get your api-token after logging with fedora id. (*see in the drop-down of user's email*) Here, latest build for "f33" will be picked.
(2) to fetch job details: GET /api/job/{job_id}/log
(you can get job_id from the response of above api or from https://transtats.fedoraproject.org/jobs/logs page)
for example: $ curl https://transtats.fedoraproject.org/api/job/f483b6fd-824e-4e5b-96ae-d29fd2df... Output would be the json, containing all the info you see in "Job Log (output)" on details page of this job-id: https://transtats.fedoraproject.org/jobs/log/f483b6fd-824e-4e5b-96ae-d29fd2d... Please try that on https://transtats.fedoraproject.org/api-docs/
As this output already contains translation stats, we can add translation memory (*in the API response*) as well. In CLI demo of https://www.youtube.com/watch?v=jgXJZRj43M0 [*at* 21:20] you may see that working.
what would be the information required for this API to get results? the package name?
Please see curl command above!
what's the benefit of using an API to talk with transtats instead of subscribing to Fedora tooling to know when a new package goes to the pipeline and download it?
Well, its matter of choice. Transtats run syncdownstream job in the background every time a new package build is detected to keep package stats latest. GET /api/package/<package-name> HTTP/1.1 So, through APIs it can be integrated in CI and test pipelines easily. You may explain more; what API response you expect? We can work on a new template also!
in the future, I would like to reuse the translation files to produce other packages such as system wide langpacks. Eventually with new translations coming from translation memory/translation machine or even manual modification. -- Jean-Baptiste
On Sun, Aug 9, 2020 at 5:22 PM Jean-Baptiste Holcroft jean-baptiste@holcroft.fr wrote:
- for each new version of a Fedora package
- identify and extract the srpm content
- identify the localization files
- download every existing po files
- produce translation memories and statistics
Steps 1 to 4 are "easy". But for step 0, I have no idea how to do it. How can I get some kind of notification when a new package is created (whatever a new one or an update of an existing one)?
Could you use datagrepper[1] to watch the buildsys category? You'd want to narrow the filter down to only include completed builds, and I'm not sure what the best approach would be, but the Infra or RelEng teams can help with that. That could be incorporated into some kind of automated process. You could also watch for Bodhi updates instead (again using datagrepper), but that would mean missing out on Rawhide package updates. Koji builds don't necessarily get released, but you'd avoid missing anything. It depends which kind of error you prefer.
There's also the notifications[1] app, but that would require you to manually act on each notification, which probably would not scale very well. :-)
[1] https://apps.fedoraproject.org/datagrepper/ [2] https://apps.fedoraproject.org/notifications/
Le 2020-08-10 14:32, Ben Cotton a écrit :
Could you use datagrepper[1] to watch the buildsys category?
Thank you for this alternative Ben, I decided to write a wiki page about my project [1]
I realized an alternate implementation could be to get the list of all available RPMs. This would prevent registering to messages making it easier to develop and reduce computation costs.
Can we achieve this easily with our infrastructure?
[1] https://fedoraproject.org/wiki/User:Jibecfed/LinuxLocalizationMeasurement
On Wed, Aug 12, 2020 at 1:46 AM Jean-Baptiste Holcroft jean-baptiste@holcroft.fr wrote:
I realized an alternate implementation could be to get the list of all available RPMs. This would prevent registering to messages making it easier to develop and reduce computation costs.
Can we achieve this easily with our infrastructure?
The simple way is to run `dnf list --all` against all supported versions. You'd have to store the output somewhere and diff it on your next run. There might be a better way, but the Infrastructure team would be better equipped to answer this. The advantage of the dnf way is that you can run it in batches once a week (for example) on your own machine and, like you said, it's much easier to code against. The advantage of the datagrepper way is that you can have the actions happen automatically as a package changes.
TL;DR: here are the translation memories for the 318 languages, built from all software available in Fedora 32: https://jibecfed.fedorapeople.org/partage/compendium-full/
Last august, I talked about my project "to provide translation memories for translators and measure localization progress over version" [1].
Thanks to darknao's help with automation, I'm now able to analyze the whole Fedora Linux distribution.
I would like to make it a Fedora initiative and publish these files in an official Fedora website. Would someone be willing to help? Constraints is to use Hugo to allow this website to be localized.
For Fedora 32, it means:
* 21 000 srpm extracted (source of rpm packages) * 121 000 po files detected (other formats exists, but I'm starting by this) which represents 7Gio of data
From that, I deducted:
* 318 languages. For each of them, it produce: ** a compendium [2] ** a terminology [3] ** a translation memory (tmx file)
Please do not hesitate to suggest improvements in the generation of these files. I used very basic commands and did no cleaning.
Next steps (by priority): * measure each language status for a Fedora release * display results in a static website for a Fedora release * allow to compare releases * all other ideas we may have
-- Jean-Baptiste
[1] https://lists.fedoraproject.org/archives/list/i18n@lists.fedoraproject.org/t... [2] http://docs.translatehouse.org/projects/translate-toolkit/en/latest/commands... [3] http://docs.translatehouse.org/projects/translate-toolkit/en/latest/commands... [4] http://docs.translatehouse.org/projects/translate-toolkit/en/latest/commands...
On 10/12/20 3:21 PM, Jean-Baptiste Holcroft wrote:
I would like to make it a Fedora initiative and publish these files in an official Fedora website. Would someone be willing to help? Constraints is to use Hugo to allow this website to be localized.
This is of interest.
Please do not hesitate to suggest improvements in the generation of these files. I used very basic commands and did no cleaning.
Do you have a these steps in a set of scripts/programs?
Next steps (by priority):
- measure each language status for a Fedora release
- display results in a static website for a Fedora release
- allow to compare release
Can probably schedule an automated task to do this every few days.
- all other ideas we may have
Le 2020-10-12 14:31, Benson Muite a écrit :
On 10/12/20 3:21 PM, Jean-Baptiste Holcroft wrote:
Please do not hesitate to suggest improvements in the generation of these files. I used very basic commands and did no cleaning.
Do you have a these steps in a set of scripts/programs?
Yes, look at "compute_lang" function in this PR: https://pagure.io/fedora-localization-statistics/pull-request/8#request_diff
On Mon, Oct 12, 2020 at 8:22 AM Jean-Baptiste Holcroft jean-baptiste@holcroft.fr wrote:
I would like to make it a Fedora initiative and publish these files in an official Fedora website. Would someone be willing to help? Constraints is to use Hugo to allow this website to be localized.
The easiest route, if you don't already have something in mind, is to have a "group" site on fedorapeople. You can control access through a FAS group and then deploy to the site via SFTP. That's what we do for the schedule at https://fedorapeople.org/groups/schedule/ . The Infrastructure team can help with that.
Le 2020-10-12 15:48, Ben Cotton a écrit :
On Mon, Oct 12, 2020 at 8:22 AM Jean-Baptiste Holcroft jean-baptiste@holcroft.fr wrote:
I would like to make it a Fedora initiative and publish these files in an official Fedora website. Would someone be willing to help? Constraints is to use Hugo to allow this website to be localized.
The easiest route, if you don't already have something in mind, is to have a "group" site on fedorapeople. You can control access through a FAS group and then deploy to the site via SFTP. That's what we do for the schedule at https://fedorapeople.org/groups/schedule/ . The Infrastructure team can help with that.
Interesting, thank you. Unfortunately, the url isn't nice enough, I'm thinking at something.fedoraproject.org Most important for me is to have have access to statistics, so that we know if someone download the files or not :p
On Mon, Oct 12, 2020 at 8:22 PM Jean-Baptiste Holcroft < jean-baptiste@holcroft.fr> wrote:
TL;DR: here are the translation memories for the 318 languages, built from all software available in Fedora 32: https://jibecfed.fedorapeople.org/partage/compendium-full/
Last august, I talked about my project "to provide translation memories for translators and measure localization progress over version" [1].
Thanks to darknao's help with automation, I'm now able to analyze the whole Fedora Linux distribution.
Thanks Jean-Baptiste, this is very interesting indeed.
I would like to make it a Fedora initiative and publish these files in an official Fedora website. Would someone be willing to help? Constraints is to use Hugo to allow this website to be localized.
I can't help wondering if there is any way to integrate this with Transtats in the future.
For Fedora 32, it means:
- 21 000 srpm extracted (source of rpm packages)
- 121 000 po files detected (other formats exists, but I'm starting by
this) which represents 7Gio of data
From that, I deducted:
- 318 languages. For each of them, it produce:
** a compendium [2] ** a terminology [3] ** a translation memory (tmx file)
On Tue, Oct 13, 2020 at 4:20 PM Jens-Ulrik Petersen petersen@redhat.com wrote:
On Mon, Oct 12, 2020 at 8:22 PM Jean-Baptiste Holcroft < jean-baptiste@holcroft.fr> wrote:
TL;DR: here are the translation memories for the 318 languages, built from all software available in Fedora 32: https://jibecfed.fedorapeople.org/partage/compendium-full/
Last august, I talked about my project "to provide translation memories for translators and measure localization progress over version" [1].
Thanks to darknao's help with automation, I'm now able to analyze the whole Fedora Linux distribution.
Thanks Jean-Baptiste, this is very interesting indeed.
I would like to make it a Fedora initiative and publish these files in an official Fedora website. Would someone be willing to help? Constraints is to use Hugo to allow this website to be localized.
I can't help wondering if there is any way to integrate this with Transtats in the future.
this may align with https://github.com/transtats/transtats/issues/178
For Fedora 32, it means:
- 21 000 srpm extracted (source of rpm packages)
- 121 000 po files detected (other formats exists, but I'm starting by
this) which represents 7Gio of data
From that, I deducted:
- 318 languages. For each of them, it produce:
** a compendium [2] ** a terminology [3] ** a translation memory (tmx file)
i18n mailing list -- i18n@lists.fedoraproject.org To unsubscribe send an email to i18n-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/i18n@lists.fedoraproject.org
Le 2020-10-14 08:51, Sundeep Anand a écrit :
On Tue, Oct 13, 2020 at 4:20 PM Jens-Ulrik Petersen petersen@redhat.com wrote:
On Mon, Oct 12, 2020 at 8:22 PM Jean-Baptiste Holcroft
I would like to make it a Fedora initiative and publish these files in an official Fedora website. Would someone be willing to help? Constraints is to use Hugo to allow this website to be localized.
I can't help wondering if there is any way to integrate this with Transtats in the future.
this may align with https://github.com/transtats/transtats/issues/178
As said on IRC:
* transtats is focused on package out of sync, which isn't something I really worry about * the technology used by transtats is too complex for me to easily step in * I'm also unsure about the usecases transtats covers and it probably requires more promotion and measurement of the impact it have * I'll be happy to have transtats front-end and have a dev to develop contribution features. For example, I have files with missing encoding, uniq keys in duplicates, obvious errors to fix etc. * merging these two initiatives probably means to rewrite transtats, which is a hard decision to take
But I'm talking as an individual here, if transtats doesn't answers my usecase, it can still be useful for other users/personas.
How can we seriously discuss this and arrive with realistic options? Next Flock? A dedicated event?
Jean-Baptiste
Hi,
Thank you and congratulations Jean-Baptiste for collating the data, that's huge! The next big thing would be to make them usable to consume.
Talking about Transtats: It tries to bring a lot of things together that's its drawback and strength at the same time. --- which makes it complex!
On Wed, Oct 14, 2020 at 12:57 PM Jean-Baptiste Holcroft < jean-baptiste@holcroft.fr> wrote:
Le 2020-10-14 08:51, Sundeep Anand a écrit :
On Tue, Oct 13, 2020 at 4:20 PM Jens-Ulrik Petersen petersen@redhat.com wrote:
On Mon, Oct 12, 2020 at 8:22 PM Jean-Baptiste Holcroft
I would like to make it a Fedora initiative and publish these files in an official Fedora website. Would someone be willing to help? Constraints is to use Hugo to allow this website to be localized.
I can't help wondering if there is any way to integrate this with Transtats in the future.
this may align with https://github.com/transtats/transtats/issues/178
As said on IRC:
- transtats is focused on package out of sync, which isn't something I
really worry about
As Transtats is evolving, package out of sync is one of the focused areas.
- the technology used by transtats is too complex for me to easily step
in
It's a plain django application. https://github.com/transtats/transtats/blob/devel/CONTRIBUTING.md could be a good starting point.
- I'm also unsure about the usecases transtats covers and it probably
requires more promotion and measurement of the impact it have
An effort to refine those use cases and tweak them is underway!
- I'll be happy to have transtats front-end and have a dev to develop
contribution features. For example, I have files with missing encoding, uniq keys in duplicates, obvious errors to fix etc.
Probably Transtats can see the https://jibecfed.fedorapeople.org/partage/compendium-full/ as one big source of translations? In the past, I had been doing something with map-filter-reduce (in hadoop) in the same scenario for the same desired results.
- merging these two initiatives probably means to rewrite transtats,
which is a hard decision to take
I guess rewrite is not required. We may just need a job consuming your datasets.
But I'm talking as an individual here, if transtats doesn't answers my usecase, it can still be useful for other users/personas.
Somehow, Transtats deals with multi-product / multi-tenancy environments and may be developed for multiple teams / stakeholders, hence defining development priority in one direction sometimes looks challenging.
How can we seriously discuss this and arrive with realistic options?
Next Flock? A dedicated event?
Jean-Baptiste