Hello Fedora developers,
I'd like to show you a proposal for a new XML format of modular metadata which reside in YUM repositories.
In short I propose replacing YAML syntax with XML syntax while removing features which where never implemented or used, while providing a detailed specification leaving small place for implementer's invention. The proposed specification is the "reduced" variant under https://github.com/fedora-modularity/libmodulemd/tree/main/xml_specs, for instance https://github.com/fedora-modularity/libmodulemd/blob/main/xml_specs/reduced/overview.xml.
Bear in mind that this change is only about how the modules are stored in YUM repositories which are fetched by DNF. It does not change how modules are defined by module maintainers (YAML modulemd-packager-v3 or modulemd-v2 format) and how it is built by MBS and handled by Bodhi.
Those who should be concerned most are DNF5 developers and relengs producing composes.
Long story:
Original modulemd format had a noble property, and that was an input format for MBS is the same as the output format. This is not true anymore because of modulemd-packager-v3 format. It also makes validation difficult as fields optional in an input format are mandatory in the output format, or vice versa.
Original modulemd format drags in YAML format into YUM repository which is otherwise XML-only. That requires a YAML parser.
Original modulemd format is not handled by DNF directly. Instead, DNF uses libmodulemd library. That library is heavily based on glib. In fact it embeds glib types into its API. Why do I mention it? Because new DNF5 aims to eradicate glib. Mostly to shrink container installations. librepo and libmodulemd are the last pieces with glib. Because it's impossible to remove glib from libmodulemd, there has to be a new library for parsing modular metadata. If there has to be a new library, there could be a transition from YAML to XML which would shrink the minimal installation more by removing libyaml.
Original modulemd format possesses some features which nobody uses, or nobody implements, or if implements, than not fully. Do you remember a deprecation of intents from modularity https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/thread/RXDP2WMPR3HHBRTQAKPSTRU6KABTJSMA/#RXDP2WMPR3HHBRTQAKPSTRU6KABTJSMA? There are more things that can be removed and make the format and its parser simpler.
Original format is not well specified. DNF and Satellite people complained a lot when they were implementing it. The specification looks more like an example. E.g. a module stream name is probably a string. An arbitrary string. With spaces, with new lines. I think you do not want to see a stream named " :\n". Well, DNF does not even allow you to identify a module like that. There is definitely room for tightening the format. But each change like that is technically an incompatible change. To materialze the change we need at least a new modulemd format version. But if we need a new format version, we can actually come a completely new format.
As you can see, there are good reasons to come up with a new in-repository format. Hence here it is https://github.com/fedora-modularity/libmodulemd/tree/main/xml_specs.
I originally developed the XML format to be able to encode all features we have in the old YAML format. That's kept for your reference in "complete" subdirectory https://github.com/fedora-modularity/libmodulemd/tree/main/xml_specs/complete.
Then I removed all unnecessary features and put it into "reduced" subdirectory https://github.com/fedora-modularity/libmodulemd/tree/main/xml_specs/reduced.
If you are interested in it, I recommend starting with overview.xml file. It shows a skeleton of the format. It's so small I can quote it here:
<index xmlns="http://fedoraproject.org/metadata/moduleindex" version="" revision=""> <module name=""> <stream name=""> <!-- DNF wants versions and contexts to differ in @summary etc. --> <build version="" context="" static="" arch="" summary="" description=""> <!-- @static defaults to false. --> <dependency name=""> <requires></requires> <!-- Only one for modulemd-packager-v3 --> <conflicts></conflicts> <!-- Not supported by modulemd-packager-v3 --> </dependency> <dependency name=""/> <!-- An unspecified stream. Not supported by modulemd-packager-v3. --> <license> <module></module> <content></content> </license> <references comunity="" documentation="" tracker=""/> <profile name="" description=""> <package></package> </profile> <api></api> <demodularized></demodularized> <nevra name="" epoch="" version="" release="" arch=""/> </build>
<default-profile modified=""> <!-- @modified could be renamed to version --> <profile></profile> <!-- With a value replaces, missing unsets. --> </default-profile>
<obsolete modified="" context=""> <!-- @modified in seconds since the epoch. Missing or empty @context means all contexts. --> <eol when="" message=""> <!-- Missing element means unsetting. --> <!-- @when in seconds since the epoch, missing means now. --> <replacement module="" stream=""/> </eol> </obsolete>
<translation modified=""> <!-- @modified could be renamed to version --> <locale name=""> <!-- Each of the child is optional, but there must be at least one. --> <build summary="" description=""/> <!-- missing @summary, @description unsets --> <profile name="" description=""/> <!-- missing @description unsets --> <obsolete context="" message=""/> <!-- missing or empty @context means all contexts, missing @message unsets, unsupported in YAML. --> </locale> </translation> </stream>
<default-stream modified="" stream=""/> <!-- @modified could be renamed to version --> <!-- Existing @stream sets a default, missing or empty unsets. --> </module>
</index>
As you can see, there are no separate documents for modules and default streams. Everything is kept inside one document. That enables properties (e.g. obsoletes or default profiles) pertaining the same entity (e.g. a stream) to be placed together. That prevents from repeating the identifiers (e.g. stream names) and makes the format more succinct and easier for querying. That's especially import for DNF which needs quickly to know list of modules, streams of modules, to find out the latest build etc.
An example.xml file shows how a real data would look https://github.com/fedora-modularity/libmodulemd/blob/main/xml_specs/reduced/example.xml. You can see e.g. see that time stamps are encoded as a number of seconds since a Unix epoch. That will save DNF from parsing e-mail date notations, handling time zones etc.
There is also a formal specification in a form or XML Schema https://github.com/fedora-modularity/libmodulemd/blob/main/xml_specs/reduced/schema.xsd. And tests subdirectory with a preliminary sets of good and bad examples that validates and fails a validation.
I'd be glad to hear any comments on the format.
A grand plan how to implement and deploy this format is outlined in top-level README.md https://github.com/fedora-modularity/libmodulemd/blob/main/xml_specs/README.md. Basically it will be injected into createrepo_c tool to produce the XML data in YUM repositories. Then the format will be consumed by DNF5. (Just to clarify, currently missing support for modules in DNF5 is not caused by this new XML format. DNF5 will support modules in the old YAML format soon through libmodulemd library.) According to my consultation with DNF team, DNF5 plans to prefer the XML format if both XML and YAML would exist in a repository.
-- Petr
On Wed, Dec 7, 2022 at 8:23 AM Petr Pisar ppisar@redhat.com wrote:
Hello Fedora developers,
I'd like to show you a proposal for a new XML format of modular metadata which reside in YUM repositories.
In short I propose replacing YAML syntax with XML syntax while removing features which where never implemented or used, while providing a detailed specification leaving small place for implementer's invention. The proposed specification is the "reduced" variant under https://github.com/fedora-modularity/libmodulemd/tree/main/xml_specs, for instance https://github.com/fedora-modularity/libmodulemd/blob/main/xml_specs/reduced/overview.xml.
Bear in mind that this change is only about how the modules are stored in YUM repositories which are fetched by DNF. It does not change how modules are defined by module maintainers (YAML modulemd-packager-v3 or modulemd-v2 format) and how it is built by MBS and handled by Bodhi.
Those who should be concerned most are DNF5 developers and relengs producing composes.
Long story:
Original modulemd format had a noble property, and that was an input format for MBS is the same as the output format. This is not true anymore because of modulemd-packager-v3 format. It also makes validation difficult as fields optional in an input format are mandatory in the output format, or vice versa.
Original modulemd format drags in YAML format into YUM repository which is otherwise XML-only. That requires a YAML parser.
Original modulemd format is not handled by DNF directly. Instead, DNF uses libmodulemd library. That library is heavily based on glib. In fact it embeds glib types into its API. Why do I mention it? Because new DNF5 aims to eradicate glib. Mostly to shrink container installations. librepo and libmodulemd are the last pieces with glib. Because it's impossible to remove glib from libmodulemd, there has to be a new library for parsing modular metadata. If there has to be a new library, there could be a transition from YAML to XML which would shrink the minimal installation more by removing libyaml.
Original modulemd format possesses some features which nobody uses, or nobody implements, or if implements, than not fully. Do you remember a deprecation of intents from modularity https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/thread/RXDP2WMPR3HHBRTQAKPSTRU6KABTJSMA/#RXDP2WMPR3HHBRTQAKPSTRU6KABTJSMA? There are more things that can be removed and make the format and its parser simpler.
Original format is not well specified. DNF and Satellite people complained a lot when they were implementing it. The specification looks more like an example. E.g. a module stream name is probably a string. An arbitrary string. With spaces, with new lines. I think you do not want to see a stream named " :\n". Well, DNF does not even allow you to identify a module like that. There is definitely room for tightening the format. But each change like that is technically an incompatible change. To materialze the change we need at least a new modulemd format version. But if we need a new format version, we can actually come a completely new format.
As you can see, there are good reasons to come up with a new in-repository format. Hence here it is https://github.com/fedora-modularity/libmodulemd/tree/main/xml_specs.
I originally developed the XML format to be able to encode all features we have in the old YAML format. That's kept for your reference in "complete" subdirectory https://github.com/fedora-modularity/libmodulemd/tree/main/xml_specs/complete.
Then I removed all unnecessary features and put it into "reduced" subdirectory https://github.com/fedora-modularity/libmodulemd/tree/main/xml_specs/reduced.
If you are interested in it, I recommend starting with overview.xml file. It shows a skeleton of the format. It's so small I can quote it here:
<index xmlns="http://fedoraproject.org/metadata/moduleindex" version="" revision=""> <module name=""> <stream name=""> <!-- DNF wants versions and contexts to differ in @summary etc. --> <build version="" context="" static="" arch="" summary="" description=""> <!-- @static defaults to false. --> <dependency name=""> <requires></requires> <!-- Only one for modulemd-packager-v3 --> <conflicts></conflicts> <!-- Not supported by modulemd-packager-v3 --> </dependency> <dependency name=""/> <!-- An unspecified stream. Not supported by modulemd-packager-v3. --> <license> <module></module> <content></content> </license> <references comunity="" documentation="" tracker=""/> <profile name="" description=""> <package></package> </profile> <api></api> <demodularized></demodularized> <nevra name="" epoch="" version="" release="" arch=""/> </build>
<default-profile modified=""> <!-- @modified could be renamed to version --> <profile></profile> <!-- With a value replaces, missing unsets. --> </default-profile> <obsolete modified="" context=""> <!-- @modified in seconds since the epoch. Missing or empty @context means all contexts. --> <eol when="" message=""> <!-- Missing element means unsetting. --> <!-- @when in seconds since the epoch, missing means now. --> <replacement module="" stream=""/> </eol> </obsolete> <translation modified=""> <!-- @modified could be renamed to version --> <locale name=""> <!-- Each of the child is optional, but there must be at least one. --> <build summary="" description=""/> <!-- missing @summary, @description unsets --> <profile name="" description=""/> <!-- missing @description unsets --> <obsolete context="" message=""/> <!-- missing or empty @context means all contexts, missing @message unsets, unsupported in YAML. --> </locale> </translation> </stream> <default-stream modified="" stream=""/> <!-- @modified could be renamed to version --> <!-- Existing @stream sets a default, missing or empty unsets. --> </module></index>
As you can see, there are no separate documents for modules and default streams. Everything is kept inside one document. That enables properties (e.g. obsoletes or default profiles) pertaining the same entity (e.g. a stream) to be placed together. That prevents from repeating the identifiers (e.g. stream names) and makes the format more succinct and easier for querying. That's especially import for DNF which needs quickly to know list of modules, streams of modules, to find out the latest build etc.
An example.xml file shows how a real data would look https://github.com/fedora-modularity/libmodulemd/blob/main/xml_specs/reduced/example.xml. You can see e.g. see that time stamps are encoded as a number of seconds since a Unix epoch. That will save DNF from parsing e-mail date notations, handling time zones etc.
There is also a formal specification in a form or XML Schema https://github.com/fedora-modularity/libmodulemd/blob/main/xml_specs/reduced/schema.xsd. And tests subdirectory with a preliminary sets of good and bad examples that validates and fails a validation.
I'd be glad to hear any comments on the format.
A grand plan how to implement and deploy this format is outlined in top-level README.md https://github.com/fedora-modularity/libmodulemd/blob/main/xml_specs/README.md. Basically it will be injected into createrepo_c tool to produce the XML data in YUM repositories. Then the format will be consumed by DNF5. (Just to clarify, currently missing support for modules in DNF5 is not caused by this new XML format. DNF5 will support modules in the old YAML format soon through libmodulemd library.) According to my consultation with DNF team, DNF5 plans to prefer the XML format if both XML and YAML would exist in a repository.
At first glance, this looks great! I'll try to spend some time to dig into it more when I get time, but I'm really happy to finally see this!
On Wed, Dec 07, 2022 at 02:23:18PM +0100, Petr Pisar wrote:
Those who should be concerned most are DNF5 developers and relengs producing composes.
There are third party repositories which also publish modular metadata. I know this because in yum.theforeman.org we do this. Do we fall under relengs producing composes?
If you are interested in it, I recommend starting with overview.xml file. It shows a skeleton of the format. It's so small I can quote it here:
<index xmlns="http://fedoraproject.org/metadata/moduleindex" version="" revision=""> <module name=""> <stream name=""> <!-- DNF wants versions and contexts to differ in @summary etc. --> <build version="" context="" static="" arch="" summary="" description=""> <!-- @static defaults to false. --> ... <references comunity="" documentation="" tracker=""/>
Is comunity a typo for community?
A grand plan how to implement and deploy this format is outlined in top-level README.md https://github.com/fedora-modularity/libmodulemd/blob/main/xml_specs/README.md. Basically it will be injected into createrepo_c tool to produce the XML data in YUM repositories.
I really like integration into the existing tooling like createrepo_c. That was a massive gap in functionality. The plan is rather light on these details so I'd be interested to see how you plan to expose this functionality.
V Wed, Dec 07, 2022 at 02:43:32PM +0100, Ewoud Kohl van Wijngaarden napsal(a):
On Wed, Dec 07, 2022 at 02:23:18PM +0100, Petr Pisar wrote:
Those who should be concerned most are DNF5 developers and relengs producing composes.
There are third party repositories which also publish modular metadata. I know this because in yum.theforeman.org we do this. Do we fall under relengs producing composes?
Yes. If you produce a repository with modular metadata, then you are the target audience.
I believe that DNF will retain support for YAML format for some (rather long) period.
If you are interested in it, I recommend starting with overview.xml file. It shows a skeleton of the format. It's so small I can quote it here:
<index xmlns="http://fedoraproject.org/metadata/moduleindex" version="" revision=""> <module name=""> <stream name=""> <!-- DNF wants versions and contexts to differ in @summary etc. --> <build version="" context="" static="" arch="" summary="" description=""> <!-- @static defaults to false. --> ... <references comunity="" documentation="" tracker=""/>
Is comunity a typo for community?
Yes. A typo. Thanks for the review. I fixed it now.
A grand plan how to implement and deploy this format is outlined in top-level README.md https://github.com/fedora-modularity/libmodulemd/blob/main/xml_specs/README.md. Basically it will be injected into createrepo_c tool to produce the XML data in YUM repositories.
I really like integration into the existing tooling like createrepo_c. That was a massive gap in functionality. The plan is rather light on these details so I'd be interested to see how you plan to expose this functionality.
Yes, this has not yet been planned in details. It's indeed handy that crearerepo_c automatically recognizes YAML files.
My plan is to enhance createrepo_c so that when it sees the YAML files, it will put them into repodata as now and in addition convert them into XML and put the XML files into repodata.
Naturally there will be a new option to disable this feature and to specify a version of the XML format. That's to maintain a repository format for older distributions. Whether producing XML will be the default mode of operation, or it will be on demand will depend on createrepo_c maintainers. I guess they have better insight on what the default behaviour should look like than me.
-- Petr
I will always applaud any attempt at standardizing & documenting the metadata format, and I was never thrilled with glib, so this sounds great to me - I only wish that it had been this way from the beginning :)
In practice I am not certain that Satellite (and similar tools) can prefer the XML metadata precisely because it is cut down, so in repos which contain both Yaml and XML metadata it will not be possible to recreate the original Yaml from the metadata in the XML without losing the "packager" specific bits, should they exist. Perhaps that is actually fine, but it makes me uncomfortable, and we have to support the Yaml parsing anyway due to the distros which will only ever support Yaml, which unfortunately makes it is the greatest common denominator.
I presume there are no plans to remove the Yaml metadata from repos entirely?
V Wed, Dec 07, 2022 at 02:45:28PM -0000, Daniel Alley napsal(a):
In practice I am not certain that Satellite (and similar tools) can prefer the XML metadata precisely because it is cut down, so in repos which contain both Yaml and XML metadata it will not be possible to recreate the original Yaml from the metadata in the XML without losing the "packager" specific bits, should they exist. Perhaps that is actually fine, but it makes me uncomfortable, and we have to support the Yaml parsing anyway due to the distros which will only ever support Yaml, which unfortunately makes it is the greatest common denominator.
I presume there are no plans to remove the Yaml metadata from repos entirely?
There are no plans to remove YAML from repositories. The modular metadata are relatively small comparing to other data in the repository, so there is no pressure on removing the YAML files.
-- Petr
On Wed, Dec 7, 2022 at 8:40 AM Petr Pisar ppisar@redhat.com wrote: ...
As you can see, there are no separate documents for modules and default streams. Everything is kept inside one document. That enables properties (e.g. obsoletes or default profiles) pertaining the same entity (e.g. a stream) to be placed together. That prevents from repeating the identifiers (e.g. stream names) and makes the format more succinct and easier for querying.
To provide a bit of context here: the output format containing all of the modules, streams and defaults together makes perfect sense. Please make sure to keep in mind that the input format still needs to recognize at least some of these differences. The reason is that the default stream must be specified on a per-distribution/release basis. Its input file has to therefore be independent from the module stream definition. Initially, the modulemd design had both default streams and default profiles specified as content that was to be managed by the distribution, rather than the module maintainer. We later realized that the default profile selection should be left up to that stream's maintainer. Unfortunately, our output format still maintained it as part of the modulemd-defaults document. This is part of why we created the modulemd-packager format. This format enabled maintainers to specify their preferred stream defaults in the packager document and the result would be output that translated that into the modulemd-defaults format.
If we're going the route of entirely replacing the output format, then this is definitely a place we can improve upon. But please keep the default stream selection independent from the stream definition.
Regarding glib: it was chosen entirely because DNF (at the time) was using already using it, so it would theoretically simplify the consumption of libmodulemd. If DNF5 has moved away from glib, I don't see any reason why libmodulemd couldn't do the same. However, since DNF isn't the only consumer of libmodulemd, I'd very much like to see this new parser implementation made available as an external library with a public API that DNF5 consumes, rather than as an internal detail of DNF. While I understand (and even like) that XML gives you syntax validation capabilities, libmodulemd was also capable of recognizing logical errors (such as specifying a default profile that doesn't exist in the stream). A library that can provide such hand-holding would be very valuable to anyone who intends to consume the new format. The API can also guide the consumer to the data they care about, rather than forcing them to parse the XML directly.
V Wed, Dec 07, 2022 at 04:31:36PM -0500, Stephen Gallagher napsal(a):
On Wed, Dec 7, 2022 at 8:40 AM Petr Pisar ppisar@redhat.com wrote: ...
As you can see, there are no separate documents for modules and default streams. Everything is kept inside one document. That enables properties (e.g. obsoletes or default profiles) pertaining the same entity (e.g. a stream) to be placed together. That prevents from repeating the identifiers (e.g. stream names) and makes the format more succinct and easier for querying.
To provide a bit of context here: the output format containing all of the modules, streams and defaults together makes perfect sense. Please make sure to keep in mind that the input format still needs to recognize at least some of these differences. The reason is that the default stream must be specified on a per-distribution/release basis. Its input file has to therefore be independent from the module stream definition. Initially, the modulemd design had both default streams and default profiles specified as content that was to be managed by the distribution, rather than the module maintainer. We later realized that the default profile selection should be left up to that stream's maintainer. Unfortunately, our output format still maintained it as part of the modulemd-defaults document. This is part of why we created the modulemd-packager format. This format enabled maintainers to specify their preferred stream defaults in the packager document and the result would be output that translated that into the modulemd-defaults format.
If we're going the route of entirely replacing the output format, then this is definitely a place we can improve upon. But please keep the default stream selection independent from the stream definition.
Thanks for explaining the reasons behind the YAML fromat design.
Now I do not plan changing the input YAML format. The output XML is capable to deliver definitions of default streams independently of module build definitions:
<index xmlns="http://fedoraproject.org/metadata/moduleindex" version="1" revision="0"> <module name="perl"> <default-stream modified="202205040810" stream="5.34"/> </module> </index>
When DNF will see a document like that it will know that a default perl stream is 5.34. It does not imply that there is a perl module build for installation or even presentation in "dnf module list" command.
The default profile can be similarly encoded like that:
<index xmlns="http://fedoraproject.org/metadata/moduleindex" version="1" revision="0"> <module name="perl"> <stream name="5.26"> <default-profile modified="202205040810"> <profile>default</profile> </default-profile> </stream> </module> </index>
If a module maintainer specified a default profile in modulemd-packager document, it would be encoded in XML like this:
<index xmlns="http://fedoraproject.org/metadata/moduleindex" version="1" revision="0"> <module name="perl"> <stream name="5.26"> <build version="3720220503131308" ...> ... </build> <default-profile modified="20220503131308"> <profile>default</profile> </default-profile> </stream> </module> </index>
This is similar how MBS is supposed to produce YAML output documents. (I write supposed because MBS does not implement this feature yet.)
Regarding glib: it was chosen entirely because DNF (at the time) was using already using it, so it would theoretically simplify the consumption of libmodulemd. If DNF5 has moved away from glib, I don't see any reason why libmodulemd couldn't do the same. However, since DNF isn't the only consumer of libmodulemd, I'd very much like to see this new parser implementation made available as an external library with a public API that DNF5 consumes, rather than as an internal detail of DNF. While I understand (and even like) that XML gives you syntax validation capabilities, libmodulemd was also capable of recognizing logical errors (such as specifying a default profile that doesn't exist in the stream). A library that can provide such hand-holding would be very valuable to anyone who intends to consume the new format. The API can also guide the consumer to the data they care about, rather than forcing them to parse the XML directly.
Yes, the XML parser will be a separate library from libmodulemd library. Regarding validator, there probably will be a validating tool for the XML format. I'm aware that XML Schema is unable to grasp all contrains and recommendations and that for users it's easier to execute a dedicated tool than invoke xmllint with a path to the schema burried somewhere deep in a file system.
-- Petr