https://fedoraproject.org/wiki/Changes/RPMCoW
== Summary ==
RPM Copy on Write provides a better experience for Fedora Users as it reduces the amount of I/O and offsets CPU cost of package decompression. RPM Copy on Write uses reflinking capabilities in btrfs, which is the default filesystem in Fedora 33.
== Owners ==
* Name: [[User:malmond|Matthew Almond]], [[User:dcavalca|Davide Cavalca]] * Email: malmond@fb.com, dcavalca@fb.com
== Detailed description ==
Installing and upgrading software packages is a standard part of managing the lifecycle of any operating system. For the entire lifecycle of Fedora, all software is packaged and distributed using the RPM file fomat. This proposal changes how software is downloaded and installed, leaving the distribution process unmodified.
=== Current process ===
# Resolve packaging request into a list of packages and operations # Download and verify new packages # Install and/or upgrade packages sequentially using RPM files, decompressing, and writing a copy of the new files to storage.
=== New process ===
# Resolve packaging request into a list of packages and operations # Download and '''decompress''' packages into a '''locally optimized''' rpm file # Install and/or upgrade packages sequentially using RPM files, using '''reference linking''' (reflinking) to reuse data already on disk.
The outcome is intended to be the same, but the order of operations is different.
# Decompression happens inline with download. This has a positive effect on resource usage: downloads are typically limited by bandwidth. Decompression and writing the full data into a single file per rpm is essentially free. Additionally: if there is more than one download at a time, a multi-CPU system can be better utilized. All compression types supported in RPM work because this uses the rpm I/O functions. # RPMs are cached on local storage between downloading and installation time as normal. This allows DNF to defer actual RPM installation to when all the RPM are available. This is unchanged. # The file format for RPMs is different with Copy on Write. The headers are identical, but the payload is different. There is also a footer. ## Files are converted (“transcoded”) locally during download using <code>/usr/bin/rpm2extents</code> (part of rpm codebase). The format is not intended to be “portable” - i.e. copying the files from the cache is not supported. ## Regular RPMs use a compressed .cpio based payload. In contrast, extent based RPMs contain uncompressed data aligned to the fundamental page size of the architecture, e.g. 4KiB on x86_64. This alignment is required for <code>FICLONERANGE</code> to work. Only files are represented in the payload, other directory entries like symlinks, device nodes etc are constructed entirely from rpm header information. Files are referenced by their digest, so identical files are de-duplicated. ## The footer currently has three sections ### Table of original (rpm) file digests, used to validate the integrity of the download in dnf. ### Table of digest → offset used when actually installing files. ### Signature 8 bytes at the end of the file, used to differentiate between traditional RPMs and extent based.
=== Notes ===
# The headers are preserved bit for bit during transcoding. This preserves signatures. The signatures cover the main header blob, and the main header blob ensures the integrity of data in two ways: ## Each file with content has a digest. Originally this was md5, but today it’s usually sha256. In normal RPM this is only used to verify the integrity of files, e.g. <code>rpm -V</code>. With CoW we use this as a content key. ## There is/are one or two digests (<code>PAYLOADDIGEST</code> and <code>PAYLOADDIGESTALT</code>) covering the payload archive (compressed cpio). The header value is preserved, but transcoded RPMs do not preserve the original structure so RPM’s pre-installation verification (controlled by <code>%_pkgverify_level</code> will fail. <code>dnf-plugin-cow</code> disables this check in dnf because it verifies the whole file digest which is captured during download/transcoding. The second one is likely used for delta rpm. # This is untested, and possibly incompatible with delta RPM (drpm). The process for reconstructing an rpm to install from a delta is expensive from both a CPU and I/O perspective, while only providing marginal benefits on download size. It is expected that having delta rpm enabled (which is the default) will be handled gracefully. # Disk space requirements are expected to be marginally higher than before: all new packages or updates will consume their installed size before installation instead of about half their size (regular rpms with payloads still cost space). # <code>rpm-plugin-reflink</code> will fall back to simple file copying when the destination path is not on the same filesystem/subvolume. A common example is <code>/boot</code> and/or <code>/boot/efi</code>. # The system will still work on other filesystem types, but will ''always'' fall back to simple copying. This is expected to be slightly slower than not enabling CoW because the source for copying will be the decompressed data. # For systems that enable transparent filesystem compression: every file will continue to be decompressed from the original rpm, and then transparently re-compressed by the filesystem. There is no effective change here. There is a future project to investigate alternate distribution mechanics to provide parallel versions of file content pre-compressed in a filesystem specific format, reducing both CPU costs and I/O. It is expected that this will result in slightly higher network utilization because filesystem compression is purposely restricted to allow random I/O. # Current implementation of <code>dnf-plugin-cow</code> is in Python, but it looks possible to implement this in <code>libdnf</code> instead which would make it work in <code>packagekit</code>.
=== Performance Metrics ===
Ballpark performance difference is about half the duration for file download+install time. A lot of rpms are very small, so it’s difficult to see/measure. Larger RPMs give much clearer signal.
(Actual numbers/charts will be supplied in Jan 2021)
=== Terminology ===
* '''Copy on Write (CoW)''' is a broad description of any technology that reduces or eliminates data duplication by sharing the data behind the scenes until one of the references makes changes. This has been a cornerstone technology in memory management in Unix systems. Here we are using it to specifically reference Copy on Write as supported in modern filesystems, e.g. btrfs, xfs and potentially others. * '''Reflink''' is the verb for duplicating stored data on a filesystem. See [https://man7.org/linux/man-pages/man2/ioctl_ficlonerange.2.html ioctl_ficlonerange(2)] for the specific call we use on Linux * '''Extent''' (based RPMs) refers to how payload file data is stored in within an RPM. Normal RPMs simply contain a compressed CPIO archive. Extent based RPMs contain the raw data uncompressed, which can be referenced with reflink.
== Benefit to Fedora ==
Faster package installs and upgrades
== Scope ==
* Proposal owners: ** Merge changes to rpm, librepo to enable capabilities ** Add dnf-plugin-cow to available packages ** Test days ** Aid with documentation * Other developers: ** rpm, librepo: review PRs as needed * Release engineering: https://pagure.io/releng/issue/9914 * Policies and guidelines: N/A * Trademark approval: N/A
== Upgrade/compatibility impact ==
None, RPM with CoW is not enabled by default.
Upgrades with <code>keepcache</code> in dnf.conf will be able to use existing packages, but it will not convert them. This only happens at download time.
If a system is configured to keep packages in the cache (<code>keepcache</code> in <code>dnf.conf</code>) and <code>dnf-plugin-cow</code> is removed then the packages will be unusable. Recommend <code>dnf clean packages</code> to resolve this.
== How to test ==
Enable RPM with CoW with
<pre>$ sudo dnf install dnf-plugin-cow ... $ sudo dnf install hello ... $ hello Hello, world!</pre> There should be no end user visible changes, except timing.
== User experience ==
No anticipated user visible changes in this change proposal. This makes the feature available, but does not enable it by default.
== Dependencies ==
# A copy-on-write filesystem; this Change is primarily targeting btrfs, but RPM with CoW should work with XFS as well (untested) # Most package install paths and the dnf package cache on the same filesystem / subvolume. # <code>rpm</code> with Copy on Write patch set: https://github.com/malmond77/rpm/tree/cow # <code>librepo</code> with transcoding support: https://github.com/malmond77/librepo/tree/transcode_cow # dnf-plugin-reflink (a new package): https://github.com/facebookincubator/dnf-plugin-cow/
== Contingency plan ==
* Contingency mechanism: will not include PR patches if not merged upstream, skip <code>dnf-plugin-cow</code> * Contingency deadline: Final freeze * Blocks release? No * Blocks product? No
== Documentation ==
Documentation will be available at https://github.com/facebookincubator/dnf-plugin-cow in the coming weeks
== Release Notes ==
RPM with CoW is not enabled by default. To enable it:
<pre>$ sudo dnf install dnf-plugin-cow</pre>
On Mon, Dec 21, 2020 at 11:29 AM Ben Cotton bcotton@redhat.com wrote:
https://fedoraproject.org/wiki/Changes/RPMCoW
== Summary ==
RPM Copy on Write provides a better experience for Fedora Users as it reduces the amount of I/O and offsets CPU cost of package decompression. RPM Copy on Write uses reflinking capabilities in btrfs, which is the default filesystem in Fedora 33.
== Owners ==
- Name: [[User:malmond|Matthew Almond]], [[User:dcavalca|Davide Cavalca]]
- Email: malmond@fb.com, dcavalca@fb.com
== Detailed description ==
Installing and upgrading software packages is a standard part of managing the lifecycle of any operating system. For the entire lifecycle of Fedora, all software is packaged and distributed using the RPM file fomat. This proposal changes how software is downloaded and installed, leaving the distribution process unmodified.
=== Current process ===
# Resolve packaging request into a list of packages and operations # Download and verify new packages # Install and/or upgrade packages sequentially using RPM files, decompressing, and writing a copy of the new files to storage.
=== New process ===
# Resolve packaging request into a list of packages and operations # Download and '''decompress''' packages into a '''locally optimized''' rpm file # Install and/or upgrade packages sequentially using RPM files, using '''reference linking''' (reflinking) to reuse data already on disk.
The outcome is intended to be the same, but the order of operations is different.
# Decompression happens inline with download. This has a positive effect on resource usage: downloads are typically limited by bandwidth. Decompression and writing the full data into a single file per rpm is essentially free. Additionally: if there is more than one download at a time, a multi-CPU system can be better utilized. All compression types supported in RPM work because this uses the rpm I/O functions. # RPMs are cached on local storage between downloading and installation time as normal. This allows DNF to defer actual RPM installation to when all the RPM are available. This is unchanged. # The file format for RPMs is different with Copy on Write. The headers are identical, but the payload is different. There is also a footer. ## Files are converted (“transcoded”) locally during download using <code>/usr/bin/rpm2extents</code> (part of rpm codebase). The format is not intended to be “portable” - i.e. copying the files from the cache is not supported. ## Regular RPMs use a compressed .cpio based payload. In contrast, extent based RPMs contain uncompressed data aligned to the fundamental page size of the architecture, e.g. 4KiB on x86_64. This alignment is required for <code>FICLONERANGE</code> to work. Only files are represented in the payload, other directory entries like symlinks, device nodes etc are constructed entirely from rpm header information. Files are referenced by their digest, so identical files are de-duplicated. ## The footer currently has three sections ### Table of original (rpm) file digests, used to validate the integrity of the download in dnf. ### Table of digest → offset used when actually installing files. ### Signature 8 bytes at the end of the file, used to differentiate between traditional RPMs and extent based.
=== Notes ===
# The headers are preserved bit for bit during transcoding. This preserves signatures. The signatures cover the main header blob, and the main header blob ensures the integrity of data in two ways: ## Each file with content has a digest. Originally this was md5, but today it’s usually sha256. In normal RPM this is only used to verify the integrity of files, e.g. <code>rpm -V</code>. With CoW we use this as a content key. ## There is/are one or two digests (<code>PAYLOADDIGEST</code> and <code>PAYLOADDIGESTALT</code>) covering the payload archive (compressed cpio). The header value is preserved, but transcoded RPMs do not preserve the original structure so RPM’s pre-installation verification (controlled by <code>%_pkgverify_level</code> will fail. <code>dnf-plugin-cow</code> disables this check in dnf because it verifies the whole file digest which is captured during download/transcoding. The second one is likely used for delta rpm. # This is untested, and possibly incompatible with delta RPM (drpm). The process for reconstructing an rpm to install from a delta is expensive from both a CPU and I/O perspective, while only providing marginal benefits on download size. It is expected that having delta rpm enabled (which is the default) will be handled gracefully. # Disk space requirements are expected to be marginally higher than before: all new packages or updates will consume their installed size before installation instead of about half their size (regular rpms with payloads still cost space). # <code>rpm-plugin-reflink</code> will fall back to simple file copying when the destination path is not on the same filesystem/subvolume. A common example is <code>/boot</code> and/or <code>/boot/efi</code>. # The system will still work on other filesystem types, but will ''always'' fall back to simple copying. This is expected to be slightly slower than not enabling CoW because the source for copying will be the decompressed data. # For systems that enable transparent filesystem compression: every file will continue to be decompressed from the original rpm, and then transparently re-compressed by the filesystem. There is no effective change here. There is a future project to investigate alternate distribution mechanics to provide parallel versions of file content pre-compressed in a filesystem specific format, reducing both CPU costs and I/O. It is expected that this will result in slightly higher network utilization because filesystem compression is purposely restricted to allow random I/O. # Current implementation of <code>dnf-plugin-cow</code> is in Python, but it looks possible to implement this in <code>libdnf</code> instead which would make it work in <code>packagekit</code>.
=== Performance Metrics ===
Ballpark performance difference is about half the duration for file download+install time. A lot of rpms are very small, so it’s difficult to see/measure. Larger RPMs give much clearer signal.
(Actual numbers/charts will be supplied in Jan 2021)
=== Terminology ===
- '''Copy on Write (CoW)''' is a broad description of any technology
that reduces or eliminates data duplication by sharing the data behind the scenes until one of the references makes changes. This has been a cornerstone technology in memory management in Unix systems. Here we are using it to specifically reference Copy on Write as supported in modern filesystems, e.g. btrfs, xfs and potentially others.
- '''Reflink''' is the verb for duplicating stored data on a
filesystem. See [https://man7.org/linux/man-pages/man2/ioctl_ficlonerange.2.html ioctl_ficlonerange(2)] for the specific call we use on Linux
- '''Extent''' (based RPMs) refers to how payload file data is stored
in within an RPM. Normal RPMs simply contain a compressed CPIO archive. Extent based RPMs contain the raw data uncompressed, which can be referenced with reflink.
== Benefit to Fedora ==
Faster package installs and upgrades
== Scope ==
- Proposal owners:
** Merge changes to rpm, librepo to enable capabilities ** Add dnf-plugin-cow to available packages ** Test days ** Aid with documentation
- Other developers:
** rpm, librepo: review PRs as needed
- Release engineering: https://pagure.io/releng/issue/9914
- Policies and guidelines: N/A
- Trademark approval: N/A
== Upgrade/compatibility impact ==
None, RPM with CoW is not enabled by default.
Upgrades with <code>keepcache</code> in dnf.conf will be able to use existing packages, but it will not convert them. This only happens at download time.
If a system is configured to keep packages in the cache (<code>keepcache</code> in <code>dnf.conf</code>) and <code>dnf-plugin-cow</code> is removed then the packages will be unusable. Recommend <code>dnf clean packages</code> to resolve this.
== How to test ==
Enable RPM with CoW with
<pre>$ sudo dnf install dnf-plugin-cow ... $ sudo dnf install hello ... $ hello Hello, world!</pre>
There should be no end user visible changes, except timing.
== User experience ==
No anticipated user visible changes in this change proposal. This makes the feature available, but does not enable it by default.
== Dependencies ==
# A copy-on-write filesystem; this Change is primarily targeting btrfs, but RPM with CoW should work with XFS as well (untested) # Most package install paths and the dnf package cache on the same filesystem / subvolume. # <code>rpm</code> with Copy on Write patch set: https://github.com/malmond77/rpm/tree/cow # <code>librepo</code> with transcoding support: https://github.com/malmond77/librepo/tree/transcode_cow # dnf-plugin-reflink (a new package): https://github.com/facebookincubator/dnf-plugin-cow/
== Contingency plan ==
- Contingency mechanism: will not include PR patches if not merged
upstream, skip <code>dnf-plugin-cow</code>
- Contingency deadline: Final freeze
- Blocks release? No
- Blocks product? No
== Documentation ==
Documentation will be available at https://github.com/facebookincubator/dnf-plugin-cow in the coming weeks
== Release Notes ==
RPM with CoW is not enabled by default. To enable it:
<pre>$ sudo dnf install dnf-plugin-cow</pre>
This is very exciting! There is one thing, though: we need a libdnf plugin for PackageKit to use too. "DNF plugins" are at the Python layer, and libdnf has its own plugin system that C/C++ consumers can use. So if both a libdnf and a dnf plugin exist, then the experience is consistent between PK and DNF.
But that leads to my other question: why not just integrate this into libdnf and turn it into an option that can be activated in /etc/dnf/dnf.conf? That seems to be the most straightforward way to do this.
On Mon, Dec 21, 2020 at 11:39 am, Neal Gompa ngompa13@gmail.com wrote:
This is very exciting! There is one thing, though: we need a libdnf plugin for PackageKit to use too. "DNF plugins" are at the Python layer, and libdnf has its own plugin system that C/C++ consumers can use. So if both a libdnf and a dnf plugin exist, then the experience is consistent between PK and DNF.
But that leads to my other question: why not just integrate this into libdnf and turn it into an option that can be activated in /etc/dnf/dnf.conf? That seems to be the most straightforward way to do this.
From the change proposal:
# Current implementation of <code>dnf-plugin-cow</code> is in Python, but it looks possible to implement this in <code>libdnf</code> instead which would make it work in <code>packagekit</code>
On Mon, Dec 21, 2020 at 11:58 AM Michael Catanzaro mcatanzaro@gnome.org wrote:
On Mon, Dec 21, 2020 at 11:39 am, Neal Gompa ngompa13@gmail.com wrote:
This is very exciting! There is one thing, though: we need a libdnf plugin for PackageKit to use too. "DNF plugins" are at the Python layer, and libdnf has its own plugin system that C/C++ consumers can use. So if both a libdnf and a dnf plugin exist, then the experience is consistent between PK and DNF.
But that leads to my other question: why not just integrate this into libdnf and turn it into an option that can be activated in /etc/dnf/dnf.conf? That seems to be the most straightforward way to do this.
From the change proposal:
# Current implementation of <code>dnf-plugin-cow</code> is in Python, but it looks possible to implement this in <code>libdnf</code> instead which would make it work in <code>packagekit</code>
Gah! I missed that. :)
Dne 21. 12. 20 v 17:39 Neal Gompa napsal(a):
On Mon, Dec 21, 2020 at 11:29 AM Ben Cotton bcotton@redhat.com wrote:
# Decompression happens inline with download. This has a positive effect on resource usage: downloads are typically limited by bandwidth. Decompression and writing the full data into a single file per rpm is essentially free. Additionally: if there is more than one download at a time, a multi-CPU system can be better utilized. All compression types supported in RPM work because this uses the rpm I/O functions. # RPMs are cached on local storage between downloading and installation time as normal. This allows DNF to defer actual RPM installation to when all the RPM are available. This is unchanged. # The file format for RPMs is different with Copy on Write. The headers are identical, but the payload is different. There is also a footer.
## Files are converted (“transcoded”) locally during download using <code>/usr/bin/rpm2extents</code> (part of rpm codebase). The format
I cannot find it anywhere in rpm codebase.
# Disk space requirements are expected to be marginally higher than before: all new packages or updates will consume their installed size before installation instead of about half their size (regular rpms with payloads still cost space).
The size is alreay an issue (for me) on small cloud images. But I do not use BTRFS there. So at the end I do not care :)
Ballpark performance difference is about half the duration for file download+install time. A lot of rpms are very small, so it’s difficult to see/measure. Larger RPMs give much clearer signal.
Hmm, I, personally, see much better perfomance (and storage) improvements in enabling %_minimize_writes however there is still https://bugzilla.redhat.com/show_bug.cgi?id=1872141 to be resolved before this got enabled by default.
I cannot find it anywhere in rpm codebase.
The current status section of the proposal describes this as pending two PRs, and in the dependencies list, they're enumerated. Most of the code is in https://github.com/malmond77/rpm/tree/cow and enabled through work in https://github.com/malmond77/librepo/tree/transcode_cow
Hmm, I, personally, see much better perfomance (and storage) improvements in enabling %_minimize_writes however there is still https://bugzilla.redhat.com/show_bug.cgi?id=1872141 to be resolved before this got enabled by default.
I'm curious about this so I'll look at it, but at first glance it seems tangential to this proposal.
Thanks, Matthew.
On 12/21/20 12:28 PM, Ben Cotton wrote:
...
=== New process ===
# Resolve packaging request into a list of packages and operations # Download and '''decompress''' packages into a '''locally optimized''' rpm file # Install and/or upgrade packages sequentially using RPM files, using '''reference linking''' (reflinking) to reuse data already on disk.
This sound great because free space requirements can be reduced, specially when installing new packages.
I have experimented building very small appliances using btrfs compression on things like /usr/share. So I think this could disrupt this because if I am correct the extends will be first downloaded to a temporary directory without compression enabled.
I am happy with an option to disable this behavior.
The outcome is intended to be the same, but the order of operations is different.
# Decompression happens inline with download. This has a positive effect on resource usage: downloads are typically limited by bandwidth. Decompression and writing the full data into a single file per rpm is essentially free. Additionally: if there is more than one download at a time, a multi-CPU system can be better utilized. All compression types supported in RPM work because this uses the rpm I/O functions. # RPMs are cached on local storage between downloading and installation time as normal. This allows DNF to defer actual RPM installation to when all the RPM are available. This is unchanged. # The file format for RPMs is different with Copy on Write. The headers are identical, but the payload is different. There is also a footer. ## Files are converted (“transcoded”) locally during download using <code>/usr/bin/rpm2extents</code> (part of rpm codebase). The format is not intended to be “portable” - i.e. copying the files from the cache is not supported. ## Regular RPMs use a compressed .cpio based payload. In contrast, extent based RPMs contain uncompressed data aligned to the fundamental page size of the architecture, e.g. 4KiB on x86_64. This alignment is required for <code>FICLONERANGE</code> to work. Only files are represented in the payload, other directory entries like symlinks, device nodes etc are constructed entirely from rpm header information. Files are referenced by their digest, so identical files are de-duplicated. ## The footer currently has three sections ### Table of original (rpm) file digests, used to validate the integrity of the download in dnf. ### Table of digest → offset used when actually installing files. ### Signature 8 bytes at the end of the file, used to differentiate between traditional RPMs and extent based.
=== Notes ===
# The headers are preserved bit for bit during transcoding. This preserves signatures. The signatures cover the main header blob, and the main header blob ensures the integrity of data in two ways: ## Each file with content has a digest. Originally this was md5, but today it’s usually sha256. In normal RPM this is only used to verify the integrity of files, e.g. <code>rpm -V</code>. With CoW we use this as a content key. ## There is/are one or two digests (<code>PAYLOADDIGEST</code> and <code>PAYLOADDIGESTALT</code>) covering the payload archive (compressed cpio). The header value is preserved, but transcoded RPMs do not preserve the original structure so RPM’s pre-installation verification (controlled by <code>%_pkgverify_level</code> will fail. <code>dnf-plugin-cow</code> disables this check in dnf because it verifies the whole file digest which is captured during download/transcoding. The second one is likely used for delta rpm. # This is untested, and possibly incompatible with delta RPM (drpm). The process for reconstructing an rpm to install from a delta is expensive from both a CPU and I/O perspective, while only providing marginal benefits on download size. It is expected that having delta rpm enabled (which is the default) will be handled gracefully. # Disk space requirements are expected to be marginally higher than before: all new packages or updates will consume their installed size before installation instead of about half their size (regular rpms with payloads still cost space). # <code>rpm-plugin-reflink</code> will fall back to simple file copying when the destination path is not on the same filesystem/subvolume. A common example is <code>/boot</code> and/or <code>/boot/efi</code>. # The system will still work on other filesystem types, but will ''always'' fall back to simple copying. This is expected to be slightly slower than not enabling CoW because the source for copying will be the decompressed data. # For systems that enable transparent filesystem compression: every file will continue to be decompressed from the original rpm, and then transparently re-compressed by the filesystem. There is no effective change here. There is a future project to investigate alternate distribution mechanics to provide parallel versions of file content pre-compressed in a filesystem specific format, reducing both CPU costs and I/O. It is expected that this will result in slightly higher network utilization because filesystem compression is purposely restricted to allow random I/O. # Current implementation of <code>dnf-plugin-cow</code> is in Python, but it looks possible to implement this in <code>libdnf</code> instead which would make it work in <code>packagekit</code>.
=== Performance Metrics ===
Ballpark performance difference is about half the duration for file download+install time. A lot of rpms are very small, so it’s difficult to see/measure. Larger RPMs give much clearer signal.
(Actual numbers/charts will be supplied in Jan 2021)
=== Terminology ===
- '''Copy on Write (CoW)''' is a broad description of any technology
that reduces or eliminates data duplication by sharing the data behind the scenes until one of the references makes changes. This has been a cornerstone technology in memory management in Unix systems. Here we are using it to specifically reference Copy on Write as supported in modern filesystems, e.g. btrfs, xfs and potentially others.
- '''Reflink''' is the verb for duplicating stored data on a
filesystem. See [https://man7.org/linux/man-pages/man2/ioctl_ficlonerange.2.html ioctl_ficlonerange(2)] for the specific call we use on Linux
- '''Extent''' (based RPMs) refers to how payload file data is stored
in within an RPM. Normal RPMs simply contain a compressed CPIO archive. Extent based RPMs contain the raw data uncompressed, which can be referenced with reflink.
== Benefit to Fedora ==
Faster package installs and upgrades
== Scope ==
- Proposal owners:
** Merge changes to rpm, librepo to enable capabilities ** Add dnf-plugin-cow to available packages ** Test days ** Aid with documentation
- Other developers:
** rpm, librepo: review PRs as needed
- Release engineering: https://pagure.io/releng/issue/9914
- Policies and guidelines: N/A
- Trademark approval: N/A
== Upgrade/compatibility impact ==
None, RPM with CoW is not enabled by default.
Upgrades with <code>keepcache</code> in dnf.conf will be able to use existing packages, but it will not convert them. This only happens at download time.
If a system is configured to keep packages in the cache (<code>keepcache</code> in <code>dnf.conf</code>) and <code>dnf-plugin-cow</code> is removed then the packages will be unusable. Recommend <code>dnf clean packages</code> to resolve this.
== How to test ==
Enable RPM with CoW with
<pre>$ sudo dnf install dnf-plugin-cow ... $ sudo dnf install hello ... $ hello Hello, world!</pre>
There should be no end user visible changes, except timing.
== User experience ==
No anticipated user visible changes in this change proposal. This makes the feature available, but does not enable it by default.
== Dependencies ==
# A copy-on-write filesystem; this Change is primarily targeting btrfs, but RPM with CoW should work with XFS as well (untested) # Most package install paths and the dnf package cache on the same filesystem / subvolume. # <code>rpm</code> with Copy on Write patch set: https://github.com/malmond77/rpm/tree/cow # <code>librepo</code> with transcoding support: https://github.com/malmond77/librepo/tree/transcode_cow # dnf-plugin-reflink (a new package): https://github.com/facebookincubator/dnf-plugin-cow/
== Contingency plan ==
- Contingency mechanism: will not include PR patches if not merged
upstream, skip <code>dnf-plugin-cow</code>
- Contingency deadline: Final freeze
- Blocks release? No
- Blocks product? No
== Documentation ==
Documentation will be available at https://github.com/facebookincubator/dnf-plugin-cow in the coming weeks
== Release Notes ==
RPM with CoW is not enabled by default. To enable it:
<pre>$ sudo dnf install dnf-plugin-cow</pre>
On Mon, 2020-12-21 at 12:54 -0400, Robert Marcano via devel wrote:
On 12/21/20 12:28 PM, Ben Cotton wrote:
...
=== New process ===
# Resolve packaging request into a list of packages and operations # Download and '''decompress''' packages into a '''locally optimized''' rpm file # Install and/or upgrade packages sequentially using RPM files, using '''reference linking''' (reflinking) to reuse data already on disk.
This sound great because free space requirements can be reduced, specially when installing new packages.
I have experimented building very small appliances using btrfs compression on things like /usr/share. So I think this could disrupt this because if I am correct the extends will be first downloaded to a temporary directory without compression enabled.
For CoW to be beneficial, the package cache should be on the same filesystem used for the bulk of the system. In this scenario, compression should work just fine, as long as it's enabled on the appropriate subvolumes.
I am happy with an option to disable this behavior.
To be clear, for this Change we do not plan to enable CoW by default. If would be a user opt-in via the dnf-plugin-cow package.
Cheers Davide
On Mon, Dec 21, 2020, 8:19 PM Davide Cavalca via devel < devel@lists.fedoraproject.org> wrote:
On Mon, 2020-12-21 at 12:54 -0400, Robert Marcano via devel wrote:
On 12/21/20 12:28 PM, Ben Cotton wrote:
...
=== New process ===
# Resolve packaging request into a list of packages and operations # Download and '''decompress''' packages into a '''locally optimized''' rpm file # Install and/or upgrade packages sequentially using RPM files, using '''reference linking''' (reflinking) to reuse data already on disk.
This sound great because free space requirements can be reduced, specially when installing new packages.
I have experimented building very small appliances using btrfs compression on things like /usr/share. So I think this could disrupt this because if I am correct the extends will be first downloaded to a temporary directory without compression enabled.
For CoW to be beneficial, the package cache should be on the same filesystem used for the bulk of the system. In this scenario, compression should work just fine, as long as it's enabled on the appropriate subvolumes.
On btrfs there is a compression file flag so you can set compression on a directory without having compression on the DNF cache directory on the same volume
I am happy with an option to disable this behavior.
To be clear, for this Change we do not plan to enable CoW by default. If would be a user opt-in via the dnf-plugin-cow package.
Good, thanks
Cheers Davide _______________________________________________ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org lol
=== New process === # Resolve packaging request into a list of packages and operations # Download and '''decompress''' packages into a '''locally optimized''' rpm file # Install and/or upgrade packages sequentially using RPM files, using ''reference linking''' (reflinking) to reuse data already on
disk.
This sound great because free space requirements can be reduced, specially when installing new packages.
I need to re-word this: the "reuse" of data is between the locally downloaded rpm and the installed destination. I do have a plan to investigate making rpm2extents enumerate the dnf/rpm cache (if you enable it) and reflink any shared data between rpms, saving writes.
Today this proposal explains that disk space requirements during updates are expected to be higher. See https://fedoraproject.org/wiki/Changes/RPMCoW#Notes item 3.
I have experimented building very small appliances using btrfs compression on things like /usr/share. So I think this could disrupt this because if I am correct the extends will be first downloaded to a temporary directory without compression enabled.
There is also some confusion between compressed data in the rpm and the transcoded one, and filesystem level compression. This proposal affects the former, but not the latter. I'd caution against using btrfs specific attributes to disable compression the dnf/rpm cache directory tree, because then the extents written/shared to the installed file locations will also not be compressed. (this is my interpretation of what I expect to see with FICLONERANGE ioctl etc: it'd be slower if it honored filesystem level compression because it'd need to re-write the data.)
I am happy with an option to disable this behavior.
I'm unclear on which behavior you're referring to. This proposal is add support for Copy on Write in Fedora, but not make it default at this time.
Thanks, Matthew.
On Tue, Dec 22, 2020 at 12:58 PM Matthew Almond via devel devel@lists.fedoraproject.org wrote:
There is also some confusion between compressed data in the rpm and the transcoded one, and filesystem level compression. This proposal affects the former, but not the latter. I'd caution against using btrfs specific attributes to disable compression the dnf/rpm cache directory tree, because then the extents written/shared to the installed file locations will also not be compressed. (this is my interpretation of what I expect to see with FICLONERANGE ioctl etc: it'd be slower if it honored filesystem level compression because it'd need to re-write the data.)
It shouldn't need to rewrite the data. ficlonerange offset and length is based on the Btrfs logical address space, and this is uncompressed. That behind the scene it happens to be compressed is a sort of "last mile" detail, similar to where the file is actually located. Btrfs logical address for a file suggests there is exactly one copy of the file and one copy of its metadata, but via chunk tree lookup it may be that this file has two copies (raid1) or it may be located on any one of a number of devices. Yet ficlonerange still works as expected regardless of those details.
On Mon, Dec 21, 2020 at 11:28:51AM -0500, Ben Cotton wrote:
# dnf-plugin-reflink (a new package): https://github.com/facebookincubator/dnf-plugin-cow/
Does not exists, but I've just noticed it mentioned in Current Status on Wiki: 3.2 Github repo needs to be published
On Mon, 2020-12-21 at 18:00 +0100, Tomasz Torcz wrote:
On Mon, Dec 21, 2020 at 11:28:51AM -0500, Ben Cotton wrote:
https://fedoraproject.org/wiki/Changes/RPMCoW # dnf-plugin-reflink (a new package): https://github.com/facebookincubator/dnf-plugin-cow/
Does not exists, but I've just noticed it mentioned in Current Status on Wiki: 3.2 Github repo needs to be published
Yeah, apologies for that, we wanted to get the Change proposal out asap to start the discussion and gather feedback, but a few of the pieces are still in the works. Specifically, the repo is currently pending internal review and should be out soon.
Cheers Davide
On Mon, Dec 21, 2020, at 11:28 AM, Ben Cotton wrote:
== Summary ==
RPM Copy on Write provides a better experience for Fedora Users as it reduces the amount of I/O and offsets CPU cost of package decompression. RPM Copy on Write uses reflinking capabilities in btrfs, which is the default filesystem in Fedora 33.
A bunch of points here:
- No, it's the default for one Edition. Others don't default to it. And even for Workstation we can't *require* it because it's definitely supported to use other filesystems and storage layouts.
- Orthogonal to this, I'd also note that xfs supports reflinks too.
Combining those I'd say instead e.g.: "Most Fedora Editions default to a filesystem that support reflinks, e.g. btrfs or xfs" (actually I think IoT defaults to ext4 for...probably they didn't consider it?)
- When talking about RPMs we need to think about container images, which use overlayfs by default, which defers to the underlying filesystem for reflinks - so should be fine, but should be explicitly written down (and tested)
- Generally incompatible RPM payload changes cause pain proportional to how far they're "not backported", e.g. if support for this isn't in Fedora N-1 (e.g. Fedora 32) it will be harder for current Koji/mock model. Nowadays many more people use podman than mock, which e.g. if using a RHEL8 host will naturally avoid the dependency on an updated RPM. But
# Decompression happens inline with download.
rpm-ostree does this by default today BTW (rpms are unpacked into local ostree commits in parallel even).
## Regular RPMs use a compressed .cpio based payload. In contrast, extent based RPMs contain uncompressed data aligned to the fundamental page size of the architecture, e.g. 4KiB on x86_64. This alignment is required for <code>FICLONERANGE</code> to work. Only files are represented in the payload, other directory entries like symlinks, device nodes etc are constructed entirely from rpm header information.
This is the core change; some interesting tradeoffs here. Python projects in particular ship a lot of files smaller than 4k (classic example is `__init__.py` which is zero sized). And ppc64le is 64KiB pages right? So there will be "zero space" to align, right? Would need some math to see how much this would add up to, although I guess the implementation could instead use holes?
Files are referenced by their digest, so identical files are de-duplicated.
But just inside a single RPM, right? It's interesting to compare with ostree which does this by default; conceptually this is using reflinks inside a single RPM to do what ostree does system wide with hardlinks.
BTW we learned a few things, notably zero sized files are tricky because there can be a *lot* of them - see e.g. https://github.com/ostreedev/ostree/pull/2197 That one was too many hardlinks, but how well do filesystems like btrfs/xfs handle thousands of reflinks instead? The Python __init__.py thing is such a pathological case...
# Disk space requirements are expected to be marginally higher than before: all new packages or updates will consume their installed size before installation instead of about half their size (regular rpms with payloads still cost space).
This won't matter much for small updates but could be quite noticeable for larger system upgrades.
This all said the more I think about this, wouldn't it be way simpler to change rpm to support a "temporary root directory", e.g. `/usr/.rpmtemp` or whatever. Then dnf/zypper/etc cam do the unpack-and-download model without any format changes to RPM - instead of reflinking it'd just be rename() into place. This is effectively what rpm-ostree is doing today except with ostree commits instead of a temporary directory.
On Mon, Dec 21, 2020 at 12:49 PM Colin Walters walters@verbum.org wrote:
On Mon, Dec 21, 2020, at 11:28 AM, Ben Cotton wrote:
== Summary ==
RPM Copy on Write provides a better experience for Fedora Users as it reduces the amount of I/O and offsets CPU cost of package decompression. RPM Copy on Write uses reflinking capabilities in btrfs, which is the default filesystem in Fedora 33.
A bunch of points here:
No, it's the default for one Edition. Others don't default to it. And even for Workstation we can't *require* it because it's definitely supported to use other filesystems and storage layouts.
Orthogonal to this, I'd also note that xfs supports reflinks too.
Combining those I'd say instead e.g.: "Most Fedora Editions default to a filesystem that support reflinks, e.g. btrfs or xfs" (actually I think IoT defaults to ext4 for...probably they didn't consider it?)
It'd be more accurate to say most Fedora variants default to Btrfs. The only exceptions right now are Cloud, Server, and CoreOS. But yes, Fedora Server's current default of XFS on LVM means it also supports reflinks.
As an aside, I *really* hate this split of terminology we have among Editions, Spins, and Labs. It's confusing to everyone. :(
When talking about RPMs we need to think about container images, which use overlayfs by default, which defers to the underlying filesystem for reflinks - so should be fine, but should be explicitly written down (and tested)
Generally incompatible RPM payload changes cause pain proportional to how far they're "not backported", e.g. if support for this isn't in Fedora N-1 (e.g. Fedora 32) it will be harder for current Koji/mock model. Nowadays many more people use podman than mock, which e.g. if using a RHEL8 host will naturally avoid the dependency on an updated RPM. But
Incomplete statement here?
That said, we don't have a problem in the Koji/Mock model anymore, as bootstrap mode is now activated. Additionally, Mock uses systemd-nspawn by default for all cases except for with Koji (which overrides this because it can't handle nspawn mode at the moment).
# Decompression happens inline with download.
rpm-ostree does this by default today BTW (rpms are unpacked into local ostree commits in parallel even).
## Regular RPMs use a compressed .cpio based payload. In contrast, extent based RPMs contain uncompressed data aligned to the fundamental page size of the architecture, e.g. 4KiB on x86_64. This alignment is required for <code>FICLONERANGE</code> to work. Only files are represented in the payload, other directory entries like symlinks, device nodes etc are constructed entirely from rpm header information.
This is the core change; some interesting tradeoffs here. Python projects in particular ship a lot of files smaller than 4k (classic example is `__init__.py` which is zero sized). And ppc64le is 64KiB pages right? So there will be "zero space" to align, right? Would need some math to see how much this would add up to, although I guess the implementation could instead use holes?
Files are referenced by their digest, so identical files are de-duplicated.
But just inside a single RPM, right? It's interesting to compare with ostree which does this by default; conceptually this is using reflinks inside a single RPM to do what ostree does system wide with hardlinks.
BTW we learned a few things, notably zero sized files are tricky because there can be a *lot* of them - see e.g. https://github.com/ostreedev/ostree/pull/2197 That one was too many hardlinks, but how well do filesystems like btrfs/xfs handle thousands of reflinks instead? The Python __init__.py thing is such a pathological case...
# Disk space requirements are expected to be marginally higher than before: all new packages or updates will consume their installed size before installation instead of about half their size (regular rpms with payloads still cost space).
This won't matter much for small updates but could be quite noticeable for larger system upgrades.
This all said the more I think about this, wouldn't it be way simpler to change rpm to support a "temporary root directory", e.g. `/usr/.rpmtemp` or whatever. Then dnf/zypper/etc cam do the unpack-and-download model without any format changes to RPM - instead of reflinking it'd just be rename() into place. This is effectively what rpm-ostree is doing today except with ostree commits instead of a temporary directory.
Sure, this makes some degree of sense, but it doesn't reduce the IOPS for actually *doing* the installation. My understanding is that this Change is intended to reduce the thrashing when doing package transactions.
This is also a flaw with RPM-OSTree, since you have to fetch everything individually and construct the root by shifting hardlinks or reflinks around.
-- 真実はいつも一つ!/ Always, there's only one truth!
On Mon, Dec 21, 2020 at 01:07:42PM -0500, Neal Gompa wrote:
As an aside, I *really* hate this split of terminology we have among Editions, Spins, and Labs. It's confusing to everyone. :(
The website hasn't been changed, but officially all of these are Fedora Solutions, with only Editions being a special case. Other outputs can call themselves Spin, Lab, Image, or whatever, as they like for their own marketing.
https://docs.fedoraproject.org/en-US/council/policy/guiding-policy/#_what_do...
On Mon, Dec 21, 2020, at 1:07 PM, Neal Gompa wrote:
Sure, this makes some degree of sense, but it doesn't reduce the IOPS for actually *doing* the installation.
Yes it does. It avoids writing the compressed data and then copying it back out uncompressed, which is the same amount of savings as the reflink approach.
(It's also equally incompatible with deltarpm)
This is also a flaw with RPM-OSTree, since you have to fetch everything individually
No - static deltas exist, plus layered RPMs work on the wire the same. But this isn't really relevant here.
and construct the root by shifting hardlinks or reflinks around.
Adding a hardlink indeed requires updating inodes proportional to the number of files, but that's more an implementation of the transactional update approach, not of the "download and unpack in parallel" part which is more what we're discussing here. (Though they are entangled a bit)
Anyways, I'd still stand by my summary that the much lower tech "files in temporary directory that get rename()d" approach would be all of *more* efficient on disk, simpler to implement and much less disruptive than an RPM format change. (The main cost would be a new temporary directory path that would need cleanup as part of e.g. `yum clean` etc.)
On Mon, Dec 21, 2020, at 1:07 PM, Neal Gompa wrote:
Yes it does. It avoids writing the compressed data and then copying it back out uncompressed, which is the same amount of savings as the reflink approach.
(It's also equally incompatible with deltarpm)
No - static deltas exist, plus layered RPMs work on the wire the same. But this isn't really relevant here.
Adding a hardlink indeed requires updating inodes proportional to the number of files, but that's more an implementation of the transactional update approach, not of the "download and unpack in parallel" part which is more what we're discussing here. (Though they are entangled a bit)
Anyways, I'd still stand by my summary that the much lower tech "files in temporary directory that get rename()d" approach would be all of *more* efficient on disk, simpler to implement and much less disruptive than an RPM format change. (The main cost would be a new temporary directory path that would need cleanup as part of e.g. `yum clean` etc.)
I'm replying to a bunch of topics in the same thread (via the web ui because I wasn't subscribed to the mailing list until today, yikes)
On editions: I wrote fedora-workstation because that's the same one that has btrfs as root by default
Zero byte files: I think reflinking is specifically fine here because reflinking is about contents, not inodes. A zero byte reflink should be a no-op (on the filesystem level, but I should check, if it's not, I can special case it easily enough). The process of installing files based on reflinks involves actually opening new files, then reflinking content.
On small files and alignment/waste: I believe most mutable filesystems do "waste some space". I call it out here because it's explicitly in the file format, the same as in .tar (without compression) and it's because FICLONERANGE and the filesystems demand it. I account for it as (number of files) x (native block size) / 2 - i.e. assume 50% usage of the tail of every file. The block size of ppc64 is unfortunate, but I expect the same level of waste happens whether you're using reflinking or not.
Talking about the topic more broadly:
The hardlinking approach in rpm-ostree depends on either a completely read-only system, or the use of a layered filesystem like overlayfs. I think it's a completely valid approach, and to my understanding, is the technology that underpins Fedora CoreOS and Project Atomic. These are different distro builds and have specific use cases in mind. As I understand it, they also have very different management policies: they are intended to be managed in a specific way, and that updates seem to require a reboot.
My hope for CoW for RPM is to bring a similar set of capabilities and benefits to Fedora, and eventually CentOS, RHEL without requiring any changes to how the system works or is managed. The new requirements are fairly simple: one filesystem for the rootfs and dnf cache, and that this filesystems supports reflinking.
Today data deduplication is within a given rpm. Looking forwards, I would like to extend the rpm2extents processor to read and re-use other blocks from the dnf/rpm cache and then we get full system level de-duplication.
I am really grateful for all this feedback, hopefully what I write makes sense - Matthew.
On Tue, Dec 22, 2020 at 09:41:35PM -0000, Matthew Almond via devel wrote:
On Mon, Dec 21, 2020, at 1:07 PM, Neal Gompa wrote:
Yes it does. It avoids writing the compressed data and then copying it back out uncompressed, which is the same amount of savings as the reflink approach.
(It's also equally incompatible with deltarpm)
This part doesn't seem to have been answered...
I'll restate what Colin said (please chime in if I misunderstand the proposal):
During the download, packages are unpacked into a temporary root (/usr/.rpmtemp...), and the rpm headers are stored to disk in normal download location. During the installation, files are rename()d from this temporary location to the final destination.
I fail to see why this would be significantly better... The logic to handle the split rpm contents would seem to be more complicated than the rewrite with /usr/bin/rpm2extents. Other comments?
Zbyszek
On Sat, Jan 2, 2021, at 10:03 AM, Zbigniew Jędrzejewski-Szmek wrote:
I fail to see why this would be significantly better...
I don't claim that the "separate temporary directory of unpacked content" is *better* - just that it's as easy to implement *and* doesn't require an RPM format change (with all the consequent pain) or support for reflinks from the underlying filesystem.
The logic to handle the split rpm contents would seem to be more complicated than the rewrite with /usr/bin/rpm2extents. Other comments?
Hard to really say for sure I guess without trying to write both. Probably the biggest impediment is that changes like that would end up needing to be split across the librpm + zypper/rpm-ostree/dnf tools. It wasn't an accident really that for rpm-ostree /usr/bin/rpm is read-only - we effectively squash those layers togther and can thus make deep changes as a single unit.
Anyways, none of this really *requires* reflinks in any way and so calling the Change "RPMCoW" is misleading from that perspective. "DnfParallelUnpack" would probably be a better title, with a dependency on "RPMFormatCowReady" or something. And then my point is that one could do "DnfParallelUnpack" without changing the RPM format without much more complexity, if any.
On Sun, 2021-01-03 at 16:16 -0500, Colin Walters wrote:
On Sat, Jan 2, 2021, at 10:03 AM, Zbigniew Jędrzejewski-Szmek wrote:
I fail to see why this would be significantly better...
I don't claim that the "separate temporary directory of unpacked content" is *better* - just that it's as easy to implement *and* doesn't require an RPM format change (with all the consequent pain) or support for reflinks from the underlying filesystem.
The logic to handle the split rpm contents would seem to be more complicated than the rewrite with /usr/bin/rpm2extents. Other comments?
Hard to really say for sure I guess without trying to write both. Probably the biggest impediment is that changes like that would end up needing to be split across the librpm + zypper/rpm- ostree/dnf tools. It wasn't an accident really that for rpm-ostree /usr/bin/rpm is read-only - we effectively squash those layers togther and can thus make deep changes as a single unit.
Anyways, none of this really *requires* reflinks in any way and so calling the Change "RPMCoW" is misleading from that perspective. "DnfParallelUnpack" would probably be a better title, with a dependency on "RPMFormatCowReady" or something. And then my point is that one could do "DnfParallelUnpack" without changing the RPM format without much more complexity, if any.
Early on in this project I looked at creating all the files during download in a temporary directory. It would work. It is more filesystem type agnostic. If moving the decompression to an earlier step were the sole goal, it's reasonable.
The goal of RPMCoW is to write once, and re-use data multiple times. This comes up in a number of circumstances for this proposal:
1. Reflinking allows for de-duplication of file content. Today this is only within a single RPM. I am looking at changing rpm2extents to reuse data across (cached) rpms to achieve something kind of like delta rpm. That is: if you already have file X, you don't write it, you clone it from any other rpm. 2. Reflinking allows sharing of file contents, without side effects from the installed copy. Each copy is a real, distinct file, can be deleted and or modified. Only the differences cost something, and 99% of rpms files don't get modified. The net result is that the rpm cache costs very little. 3. If you can keep a rpm cache, you can reuse the data very quickly, either to build a new rootfs in a subdir/subvolume with the same or different packages, and you can use those files for containers. This sounds similar to using snapshots, but with snapshots you're operating on a filesystem at a time, and you can only go backwards. Here you can decide what you want, and you get maximum reuse automatically.
By contrast "DnfParallelUnpack" by itself, without CoW, is less useful because you will need to re-fetch and re-decompress data.
Lastly, I'd like to emphasize that I'm not trying to change the "normal rpm format". Doing so would orphan every previously built and signed rpm, and would present a serious backward compatibility problem. I aim to only change how they're downloaded and stored in the cache, locally, and consumed in rpm itself within the confines of hosts that (can) enable this.
- Matthew
On Mon, 2020-12-21 at 12:48 -0500, Colin Walters wrote:
On Mon, Dec 21, 2020, at 11:28 AM, Ben Cotton wrote:
== Summary ==
RPM Copy on Write provides a better experience for Fedora Users as it reduces the amount of I/O and offsets CPU cost of package decompression. RPM Copy on Write uses reflinking capabilities in btrfs, which is the default filesystem in Fedora 33.
A bunch of points here:
- No, it's the default for one Edition. Others don't default to
it. And even for Workstation we can't *require* it because it's definitely supported to use other filesystems and storage layouts.
- Orthogonal to this, I'd also note that xfs supports reflinks too.
Combining those I'd say instead e.g.: "Most Fedora Editions default to a filesystem that support reflinks, e.g. btrfs or xfs" (actually I think IoT defaults to ext4 for...probably they didn't consider it?)
Thanks for surfacing this, we'll make the language clearer. About XFS: it should work, but we haven't tested it extensively, and this work has been developed primarily with btrfs in mind.
- When talking about RPMs we need to think about container images,
which use overlayfs by default, which defers to the underlying filesystem for reflinks - so should be fine, but should be explicitly written down (and tested)
If reflinking isn't possible (which can also happen if e.g. the package cache and the system are on different filesystems) things work as normal, albeit with a performance penalty (because more I/O is required to install the package).
I'll let Matthew weigh in on the other points you raised. Thanks for the feedback!
Cheers Davide
On Mon, Dec 21, 2020 at 10:49 AM Colin Walters walters@verbum.org wrote:
On Mon, Dec 21, 2020, at 11:28 AM, Ben Cotton wrote:
## Regular RPMs use a compressed .cpio based payload. In contrast, extent based RPMs contain uncompressed data aligned to the fundamental page size of the architecture, e.g. 4KiB on x86_64. This alignment is required for <code>FICLONERANGE</code> to work. Only files are represented in the payload, other directory entries like symlinks, device nodes etc are constructed entirely from rpm header information.
This is the core change; some interesting tradeoffs here. Python projects in particular ship a lot of files smaller than 4k (classic example is `__init__.py` which is zero sized). And ppc64le is 64KiB pages right? So there will be "zero space" to align, right? Would need some math to see how much this would add up to, although I guess the implementation could instead use holes?
I'm not sure about XFS or ext4 zero length file handling.
On Btrfs, it's a few hundred bytes. The file has no EXTENT_DATA item, therefore it's the same whether you write a new zero length file or reflink copy it.
Files bigger than 0 bytes but less than 2KiB will tend to result in inline extents, i.e. EXTENT_DATA item contains the data in the same metadata leaf as the inode rather than referencing some 4KiB data block elsewhere.
Hardlinks take around 100 bytes, they are slightly more efficient space wise. But can't have separate selinux labels, acl, permissions, or be located in different subvolumes, and max hardlinks 65536 per file). Reflinks don't have those limitations.
Files are referenced by their digest, so identical files are de-duplicated.
But just inside a single RPM, right? It's interesting to compare with ostree which does this by default; conceptually this is using reflinks inside a single RPM to do what ostree does system wide with hardlinks.
BTW we learned a few things, notably zero sized files are tricky because there can be a *lot* of them - see e.g. https://github.com/ostreedev/ostree/pull/2197 That one was too many hardlinks, but how well do filesystems like btrfs/xfs handle thousands of reflinks instead? The Python __init__.py thing is such a pathological case...
Thousands aren't a problem, nor are tens of thousands. A reflink is a normal file that just so happens to have extents shared with another file. It's the shared extent part that makes them sorta special, but there's nothing in the structure of the file that says it's a reflink. Whereas for a symlink or hard link, there is.
Shared extents are also produced by snapshots and dedup. It's the same on-disk manifestation in all three cases. And at least on Btrfs there are examples of millions of shared extents. But the workload will dictate the extent layout, to what degree extents are shared, become unshared, result in COW for modifications, and how much file and free space fragmentation ensues. Those can be much bigger issues than the number of reflinks.
-- Chris Murphy
Cool. A few questions inline...
On Mon, Dec 21, 2020 at 11:28:51AM -0500, Ben Cotton wrote:
https://fedoraproject.org/wiki/Changes/RPMCoW
== Summary ==
RPM Copy on Write provides a better experience for Fedora Users as it reduces the amount of I/O and offsets CPU cost of package decompression. RPM Copy on Write uses reflinking capabilities in btrfs, which is the default filesystem in Fedora 33.
What happens if you enable this on non btrfs installs? Does it just not work gracefully? Does it fail somehow? I think we need to be sure it doesn't actually do anything bad for other non CoW filesystems.
...snip...
### Signature 8 bytes at the end of the file, used to differentiate between traditional RPMs and extent based.
So, there's no change to rpm building or signing, as all thats done in transcoding them on download?
=== Notes ===
# The headers are preserved bit for bit during transcoding. This preserves signatures. The signatures cover the main header blob, and the main header blob ensures the integrity of data in two ways: ## Each file with content has a digest. Originally this was md5, but today it’s usually sha256. In normal RPM this is only used to verify the integrity of files, e.g. <code>rpm -V</code>. With CoW we use this as a content key. ## There is/are one or two digests (<code>PAYLOADDIGEST</code> and <code>PAYLOADDIGESTALT</code>) covering the payload archive (compressed cpio). The header value is preserved, but transcoded RPMs do not preserve the original structure so RPM’s pre-installation verification (controlled by <code>%_pkgverify_level</code> will fail. <code>dnf-plugin-cow</code> disables this check in dnf because it verifies the whole file digest which is captured during
Could rpm learn about this and still do it's verify in this case?
download/transcoding. The second one is likely used for delta rpm. # This is untested, and possibly incompatible with delta RPM (drpm). The process for reconstructing an rpm to install from a delta is expensive from both a CPU and I/O perspective, while only providing marginal benefits on download size. It is expected that having delta rpm enabled (which is the default) will be handled gracefully.
I imagine drpms could still be used, just once you have constructed the final rpm, you transcode it as if you just downloaded it?
But in general perhaps we should decide how much value drpms provide these days and either make sure we are making more of them, or drop them.
...snip...
=== Performance Metrics ===
Ballpark performance difference is about half the duration for file download+install time. A lot of rpms are very small, so it’s difficult to see/measure. Larger RPMs give much clearer signal.
(Actual numbers/charts will be supplied in Jan 2021)
Nice!
kevin
Am 21.12.20 um 18:53 schrieb Kevin Fenzi:
But in general perhaps we should decide how much value drpms provide these days and either make sure we are making more of them, or drop them.
delta rpms safe so much time in form of bandwidth on the client side.
If something really needs to change, it is the 50+ MB repo database that gets downloaded. It takes ages on slow connections to download and than you want to increase the size of the rpms too.. Doesn't sound like a good idea.
best regards, Marius Schwarz
On Mon, Dec 21, 2020 at 1:14 PM Marius Schwarz fedoradev@cloud-foo.de wrote:
Am 21.12.20 um 18:53 schrieb Kevin Fenzi:
But in general perhaps we should decide how much value drpms provide these days and either make sure we are making more of them, or drop them.
delta rpms safe so much time in form of bandwidth on the client side.
If something really needs to change, it is the 50+ MB repo database that gets downloaded. It takes ages on slow connections to download and than you want to increase the size of the rpms too.. Doesn't sound like a good idea.
You should be getting delta fetching of repository metadata with zchunk metadata, which we've had enabled since Fedora 30: https://fedoraproject.org/wiki/Changes/Zchunk_Metadata
Is this not working for you or something?
Neal Gompa writes:
On Mon, Dec 21, 2020 at 1:14 PM Marius Schwarz fedoradev@cloud-foo.de wrote:
If something really needs to change, it is the 50+ MB repo database that gets downloaded. It takes ages on slow connections to download and than you want to increase the size of the rpms too.. Doesn't sound like a good idea.
You should be getting delta fetching of repository metadata with zchunk metadata, which we've had enabled since Fedora 30: https://fedoraproject.org/wiki/Changes/Zchunk_Metadata
Is this not working for you or something?
Well, I don't know what's working for me, or not working for me. All I know is that:
1) I'm rsyncing the updates repo to my LAN, and other machines in my lan have the default updates repo disablde and a replacement repo pointing at my local copy.
2) Even on the LAN, an update downloads something from the local repo, giving me a zippy progress indication of the download. After the download it sits for a noticeable amount of time before it decides exactly what it's going to update and then gives me the list. This is especially noticable for a Fedora VM guest that I'm running in a VM that's emulating an aarch64 platform. In the emulated aarch64 VM downloading of RPMs goes a bit slow, but the subsequent pause after download is quite noticable.
Except for rsyncing a mirror of the updates repo locally and then pointing everyone to my local mirror, I am not doing any other customization and that's the behavior I've seen.
Having said all that, I don't find the update process to be that much of a pain point right now, and in any dire need of improvement. It works. It is fairly reliable. A bit slow, but who cares. The important thing is that except for a burst of segfaults downloading rpms earlier this year (haven't had any in a while) it's been rock stable and hiccups are very, very rare. I don't exactly see what's the big value-added from the described feature enhancement, I'd only want to make sure it's just as stable.
On Mon, Dec 21, 2020 at 07:14:08PM +0100, Marius Schwarz wrote:
delta rpms safe so much time in form of bandwidth on the client side. If something really needs to change, it is the 50+ MB repo database that gets downloaded. It takes ages on slow connections to download
This needs a followup. I didn't push on it because the DNF team was super-busy with modularity, but if someone wants to pick this up, it'd be a significant improvement:
https://pagure.io/packaging-committee/issue/714
In short, 95% of the dependency data is full filename paths. That's not hyperbole. It's literally 95% by count. Actually probably even more by _space_ since they tend to be long.
Only a tiny fraction of packages use these at all, and almost all of the packages using file deps outside of /usr/bin, /usr/sbin, or /etc could use something else — and of the few using something else, many are actually doing so only in error.
It remains convenient to be able to do
dnf install /usr/share/fonts/jetbrains-mono-fonts/JetBrainsMono-Regular.ttf
or whatever, but that seems like it could be covered by a DNF plugin.
Previously, there was a chicken-and-egg scenario where the DNF folks didn't want to touch this while people were still making packages relying on this feature, but since 2018 that's a "SHOULD NOT" in the guidelines. So, I think there's room to move forward, should anyone like to take this on.
https://docs.fedoraproject.org/en-US/packaging-guidelines/#_file_and_directo...
On Mon, Dec 21, 2020 at 1:42 PM Matthew Miller mattdm@fedoraproject.org wrote:
On Mon, Dec 21, 2020 at 07:14:08PM +0100, Marius Schwarz wrote:
delta rpms safe so much time in form of bandwidth on the client side. If something really needs to change, it is the 50+ MB repo database that gets downloaded. It takes ages on slow connections to download
This needs a followup. I didn't push on it because the DNF team was super-busy with modularity, but if someone wants to pick this up, it'd be a significant improvement:
https://pagure.io/packaging-committee/issue/714
In short, 95% of the dependency data is full filename paths. That's not hyperbole. It's literally 95% by count. Actually probably even more by _space_ since they tend to be long.
Only a tiny fraction of packages use these at all, and almost all of the packages using file deps outside of /usr/bin, /usr/sbin, or /etc could use something else — and of the few using something else, many are actually doing so only in error.
It remains convenient to be able to do
dnf install /usr/share/fonts/jetbrains-mono-fonts/JetBrainsMono-Regular.ttf
or whatever, but that seems like it could be covered by a DNF plugin.
Previously, there was a chicken-and-egg scenario where the DNF folks didn't want to touch this while people were still making packages relying on this feature, but since 2018 that's a "SHOULD NOT" in the guidelines. So, I think there's room to move forward, should anyone like to take this on.
https://docs.fedoraproject.org/en-US/packaging-guidelines/#_file_and_directo...
The main problem is that wiring libsolv to callback to opportunistically fetch and repopulate the solver cache has not been figured out for libdnf. Once we do that, we don't need to do any more work. Most cases will automatically only fetch primary.xml and filelists.xml will only be fetched as requested. This is the behavior that YUM v3 had, and it wasn't ported to DNF because we lacked a mechanism to do this. In *theory*, such a mechanism exists now in libsolv, though the API is sufficiently confusing that I'm not sure how to do it exactly.
As someone who has to package for multiple distributions, I would oppose any attempt to cripple DNF to stop supporting file dependencies properly. I *aggressively* use file dependencies to avoid having to litter my spec files with package name dependencies across RH/Fedora, SUSE, Mandriva/Mageia, and others.
On Mon, Dec 21, 2020 at 01:47:19PM -0500, Neal Gompa wrote:
As someone who has to package for multiple distributions, I would oppose any attempt to cripple DNF to stop supporting file dependencies properly. I *aggressively* use file dependencies to avoid having to litter my spec files with package name dependencies across RH/Fedora, SUSE, Mandriva/Mageia, and others.
Do you have examples outside of /etc, /usr/bin, /usr/sbin?
Also, if you _are_ using arbitrary file dependencies, that renders the other part about opportunistic download of these deps kind of moot, since they'll have to be frequently, right?
Again, I'm not kidding about 95% of the dep points being filenames. It's huge! I don't think that's a good price at all to make everyone pay constantly for packaging convenience. Better to convince packagers to put in cross-distro "Provides" or something.
On Mon, Dec 21, 2020 at 2:14 PM Matthew Miller mattdm@fedoraproject.org wrote:
On Mon, Dec 21, 2020 at 01:47:19PM -0500, Neal Gompa wrote:
As someone who has to package for multiple distributions, I would oppose any attempt to cripple DNF to stop supporting file dependencies properly. I *aggressively* use file dependencies to avoid having to litter my spec files with package name dependencies across RH/Fedora, SUSE, Mandriva/Mageia, and others.
Do you have examples outside of /etc, /usr/bin, /usr/sbin?
Mostly stuff in /usr/libexec and /usr/lib(64).
Also, if you _are_ using arbitrary file dependencies, that renders the other part about opportunistic download of these deps kind of moot, since they'll have to be frequently, right?
For packages I maintain in Fedora *itself*, I don't need to do this, but for packages I maintain *outside* of Fedora, I *must*.
Again, I'm not kidding about 95% of the dep points being filenames. It's huge! I don't think that's a good price at all to make everyone pay constantly for packaging convenience. Better to convince packagers to put in cross-distro "Provides" or something.
Yes, I know. I've looked at the metadata myself before...
The fact that I can't get openSUSE to properly fully enable the Python module dependency generator (that I maintain upstream in rpm!) after almost two years of trying should be indication enough of how difficult what you're asking really is.
-- 真実はいつも一つ!/ Always, there's only one truth!
On Mon, Dec 21, 2020 at 07:14:08PM +0100, Marius Schwarz wrote:
Am 21.12.20 um 18:53 schrieb Kevin Fenzi:
But in general perhaps we should decide how much value drpms provide these days and either make sure we are making more of them, or drop them.
delta rpms safe so much time in form of bandwidth on the client side.
Well, it's tradeoffs. They save bandwith and download time on one side, but use lots of cpu cycles and disk space on the other. It just depends on what each person wants based on their situation and hardware.
Right now we are not making very many drpms at all, due to the way the compose process has changed over the years. If we keep drpms around we should really look into making more of them... as they are now, they seldom matter.
kevin
On Tue, Dec 22, 2020 at 02:02:13PM -0800, Kevin Fenzi wrote:
delta rpms safe so much time in form of bandwidth on the client side.
Well, it's tradeoffs. They save bandwith and download time on one side, but use lots of cpu cycles and disk space on the other. It just depends on what each person wants based on their situation and hardware.
They actually use a lot of cpu cycles on _both_ sides, really.
On Tue, Dec 22, 2020 at 05:09:08PM -0500, Matthew Miller wrote:
On Tue, Dec 22, 2020 at 02:02:13PM -0800, Kevin Fenzi wrote:
delta rpms safe so much time in form of bandwidth on the client side.
Well, it's tradeoffs. They save bandwith and download time on one side, but use lots of cpu cycles and disk space on the other. It just depends on what each person wants based on their situation and hardware.
They actually use a lot of cpu cycles on _both_ sides, really.
I thought that zchunk would obsolete drpm. What's the story here?
Also, in recent times, any dnf upgrade I did reported "savings" from drpm on the level <1% [*]. Am I doing something wrong or is this expected? Is there some usage pattern where there drpm provides real gain with current Fedora?
Maybe the time has come to just disable DRPM entirely for F34.
Zbyszek
[*] Today on F33:
Delta RPMs reduced 836.8 MB of updates to 836.7 MB (0.1% saved)
On Wed, Dec 30, 2020 at 01:18:38PM +0000, Zbigniew Jędrzejewski-Szmek wrote:
On Tue, Dec 22, 2020 at 05:09:08PM -0500, Matthew Miller wrote:
On Tue, Dec 22, 2020 at 02:02:13PM -0800, Kevin Fenzi wrote:
delta rpms safe so much time in form of bandwidth on the client side.
Well, it's tradeoffs. They save bandwith and download time on one side, but use lots of cpu cycles and disk space on the other. It just depends on what each person wants based on their situation and hardware.
They actually use a lot of cpu cycles on _both_ sides, really.
I thought that zchunk would obsolete drpm. What's the story here?
Nope, they are different things.
zchunk = a way to only download changed chunks of repodata.
drpms = a way to only download changed chunks of rpms.
Also, in recent times, any dnf upgrade I did reported "savings" from drpm on the level <1% [*]. Am I doing something wrong or is this expected? Is there some usage pattern where there drpm provides real gain with current Fedora?
This is most likely because we are only making drpms against the most recent updates. So, we are making very few drpms and only against things that recently updated.
For example: https://dl.fedoraproject.org/pub/fedora/linux/updates/33/Everything/x86_64/d... (126 drpms for all of f33 updates).
Maybe the time has come to just disable DRPM entirely for F34.
We could. Or try and make them more usefull again.
kevin
On Wed, Dec 30, 2020 at 10:10:27AM -0800, Kevin Fenzi wrote:
On Wed, Dec 30, 2020 at 01:18:38PM +0000, Zbigniew Jędrzejewski-Szmek wrote:
On Tue, Dec 22, 2020 at 05:09:08PM -0500, Matthew Miller wrote:
On Tue, Dec 22, 2020 at 02:02:13PM -0800, Kevin Fenzi wrote:
delta rpms safe so much time in form of bandwidth on the client side.
Well, it's tradeoffs. They save bandwith and download time on one side, but use lots of cpu cycles and disk space on the other. It just depends on what each person wants based on their situation and hardware.
They actually use a lot of cpu cycles on _both_ sides, really.
I thought that zchunk would obsolete drpm. What's the story here?
Nope, they are different things.
zchunk = a way to only download changed chunks of repodata.
drpms = a way to only download changed chunks of rpms.
Right, it did feel a bit like I was missing some important chunk of the picture ;)
Also, in recent times, any dnf upgrade I did reported "savings" from drpm on the level <1% [*]. Am I doing something wrong or is this expected? Is there some usage pattern where there drpm provides real gain with current Fedora?
This is most likely because we are only making drpms against the most recent updates. So, we are making very few drpms and only against things that recently updated.
So... people who actually care about the total download are likely not to update all the time, which also means that drpms will not work for them.
For example: https://dl.fedoraproject.org/pub/fedora/linux/updates/33/Everything/x86_64/d... (126 drpms for all of f33 updates).
So... that means that drpms wouldn't even make a difference for people who update often.
...and the proposed Change would require additional contortions to allow drpms to work. It sounds like drpms are not worth the trouble anymore. The effort to make them work properly would be large. I think the crucial bit is that we have more packages and updates than ever, and at the same time more people update at custom schedules, so any reasonable subset of drpms will cover a shrinking subset of upgrades.
Zbyszek
Maybe the time has come to just disable DRPM entirely for F34.
We could. Or try and make them more usefull again.
On Sat, 2021-01-02 at 13:42 +0000, Zbigniew Jędrzejewski-Szmek wrote: On Wed, Dec 30, 2020 at 10:10:27AM -0800, Kevin Fenzi wrote:
This is most likely because we are only making drpms against the most recent updates. So, we are making very few drpms and only against things that recently updated.
So... people who actually care about the total download are likely not to update all the time, which also means that drpms will not work for them.
For example: https://dl.fedoraproject.org/pub/fedora/linux/updates/33/Everything/x86_64/d... (126 drpms for all of f33 updates).
So... that means that drpms wouldn't even make a difference for people who update often.
...and the proposed Change would require additional contortions to allow drpms to work. It sounds like drpms are not worth the trouble anymore. The effort to make them work properly would be large. I think the crucial bit is that we have more packages and updates than ever, and at the same time more people update at custom schedules, so any reasonable subset of drpms will cover a shrinking subset of upgrades.
Zbyszek
Maybe the time has come to just disable DRPM entirely for F34.
We could. Or try and make them more usefull again.
FWIW, I also think it's time for drpms to go. Aside from any potential issues with the proposed change, they haven't been useful in Fedora for three years, (see https://pagure.io/releng/issue/7215), and nobody's been able to put in the time to fix it yet. If that changed and someone was willing to step up and commit to fixing this, I'd feel very differently.
In addition, drpms aren't even working at the moment. Something has changed during the last week or so that's broken them (see https://bugzilla.redhat.com/show_bug.cgi?id=1911828). I'll take a look, but, being honest, there's not much motivation to investigate this when drpms are of such marginal use in Fedora at the moment.
Jonathan
On Sat, 2021-01-02 at 18:12 +0000, Jonathan Dieter wrote:
FWIW, I also think it's time for drpms to go. Aside from any potential issues with the proposed change, they haven't been useful in Fedora for three years, (see https://pagure.io/releng/issue/7215), and nobody's been able to put in the time to fix it yet. If that changed and someone was willing to step up and commit to fixing this, I'd feel very differently.
In addition, drpms aren't even working at the moment. Something has changed during the last week or so that's broken them (see https://bugzilla.redhat.com/show_bug.cgi?id=1911828).%C2%A0 I'll take a look, but, being honest, there's not much motivation to investigate this when drpms are of such marginal use in Fedora at the moment.
Jonathan
Apologies for the odd quoting in the previous email; Evolution decided that what you see isn't what you get. :) I've trimmed out everything but my response here.
Jonathan
On Sat, Jan 02, 2021 at 06:17:03PM +0000, Jonathan Dieter wrote:
On Sat, 2021-01-02 at 18:12 +0000, Jonathan Dieter wrote:
FWIW, I also think it's time for drpms to go. Aside from any potential issues with the proposed change, they haven't been useful in Fedora for three years, (see https://pagure.io/releng/issue/7215), and nobody's been able to put in the time to fix it yet. If that changed and someone was willing to step up and commit to fixing this, I'd feel very differently.
It's not been something thats a priority. ;(
I think the way to do it would be to drop making drpms from the bodhi pungi and setup a script to manage them: create them, make the repos, keep N days of old ones from the last repos, etc. I'd be happy to help interested folks with requirements and such, but I don't think I can commit to fixing it.
I remember when drpms landed I heard people say they choose Fedora because of them. That may have changed over the years I guess. :) and there have been only 2 or 3 reports about how few drpms exist in the last few years (ie, most people didn't really notice).
In addition, drpms aren't even working at the moment. Something has changed during the last week or so that's broken them (see https://bugzilla.redhat.com/show_bug.cgi?id=1911828).%C2%A0 I'll take a look, but, being honest, there's not much motivation to investigate this when drpms are of such marginal use in Fedora at the moment.
Yeah, understand...
kevin
On Sun, Jan 03, 2021 at 03:25:29PM -0800, Kevin Fenzi wrote:
I remember when drpms landed I heard people say they choose Fedora because of them. That may have changed over the years I guess. :) and there have been only 2 or 3 reports about how few drpms exist in the last few years (ie, most people didn't really notice).
Hmmm, here's an idea: what if instead of nightly drpms, we made them only every two weeks, but always exactly two weeks, so that people updating on a specific cadence would get them?
On Mon, 2021-01-04 at 11:25 -0500, Matthew Miller wrote:
On Sun, Jan 03, 2021 at 03:25:29PM -0800, Kevin Fenzi wrote:
I remember when drpms landed I heard people say they choose Fedora because of them. That may have changed over the years I guess. :) and there have been only 2 or 3 reports about how few drpms exist in the last few years (ie, most people didn't really notice).
Hmmm, here's an idea: what if instead of nightly drpms, we made them only every two weeks, but always exactly two weeks, so that people updating on a specific cadence would get them?
There's been a lot of interesting talk about the state and future of drpm. I'd like to propose we continue the conversation about that with a different subject line :)
- Matthew
On Mon, Jan 04, 2021 at 10:21:15PM +0000, Matthew Almond via devel wrote:
There's been a lot of interesting talk about the state and future of drpm. I'd like to propose we continue the conversation about that with a different subject line :)
Okay, fair. I have a proposal.
Right now, the problem is that making delta rpms is expensive, and therefore we aren't making very many, which makes them even less useful. Plus, we're only making them between updates and for packages where those updates are frequent, that means you need to keep on top of things, which may be best practice but is most difficult for low-bandwidth users who might most benefit in the first place.
So, the first thing we need to do to fix this is move deltarpm creation out of the updates process. Kevin Fenzi tells me this would mean we'd need a separate delta RPMs repo, which doesn't sound like a bad thing to me, but we're not sure offhand if DNF can handle that without modification.
This would let us make the delta RPMs asynchronously and not block updates. And, it would also give us the ability to roughly see how important they are to users, because we could see how popular that repository is compared to the updates repo.
I also remember when this was a killer feature for Fedora, and without any real way of judging use and demand, I'm hesitant to kill it off. But that's definitely plan B. We can point people who are in low-bandwidth situations at Silverblue, CoreOS, and Kinoite as the preferred approach.
On Mon, Jan 04, 2021 at 06:29:13PM -0500, Matthew Miller wrote:
On Mon, Jan 04, 2021 at 10:21:15PM +0000, Matthew Almond via devel wrote:
There's been a lot of interesting talk about the state and future of drpm. I'd like to propose we continue the conversation about that with a different subject line :)
Okay, fair. I have a proposal.
Right now, the problem is that making delta rpms is expensive, and therefore we aren't making very many, which makes them even less useful. Plus, we're only making them between updates and for packages where those updates are frequent, that means you need to keep on top of things, which may be best practice but is most difficult for low-bandwidth users who might most benefit in the first place.
So, the first thing we need to do to fix this is move deltarpm creation out of the updates process. Kevin Fenzi tells me this would mean we'd need a separate delta RPMs repo, which doesn't sound like a bad thing to me, but we're not sure offhand if DNF can handle that without modification.
Yeah, I don't recall how dnf looks for drpms. Right now they are in the same repo, using the same repodata.
If we moved them to a new repo would they get found correctly?
This would let us make the delta RPMs asynchronously and not block updates. And, it would also give us the ability to roughly see how important they are to users, because we could see how popular that repository is compared to the updates repo.
I also remember when this was a killer feature for Fedora, and without any real way of judging use and demand, I'm hesitant to kill it off. But that's definitely plan B. We can point people who are in low-bandwidth situations at Silverblue, CoreOS, and Kinoite as the preferred approach.
Yeah, I came up with one more possible way we could get more drpms with our current setup, but need to talk to pungi maintainers and see if it's doable. :) After that, it's either split things out or drop drpms I think.
kevin
Dne 05. 01. 21 v 0:50 Kevin Fenzi napsal(a):
On Mon, Jan 04, 2021 at 06:29:13PM -0500, Matthew Miller wrote:
On Mon, Jan 04, 2021 at 10:21:15PM +0000, Matthew Almond via devel wrote:
There's been a lot of interesting talk about the state and future of drpm. I'd like to propose we continue the conversation about that with a different subject line :)
Okay, fair. I have a proposal.
Right now, the problem is that making delta rpms is expensive, and therefore we aren't making very many, which makes them even less useful. Plus, we're only making them between updates and for packages where those updates are frequent, that means you need to keep on top of things, which may be best practice but is most difficult for low-bandwidth users who might most benefit in the first place.
So, the first thing we need to do to fix this is move deltarpm creation out of the updates process. Kevin Fenzi tells me this would mean we'd need a separate delta RPMs repo, which doesn't sound like a bad thing to me, but we're not sure offhand if DNF can handle that without modification.
Yeah, I don't recall how dnf looks for drpms. Right now they are in the same repo, using the same repodata.
If we moved them to a new repo would they get found correctly?
This would let us make the delta RPMs asynchronously and not block updates. And, it would also give us the ability to roughly see how important they are to users, because we could see how popular that repository is compared to the updates repo.
I also remember when this was a killer feature for Fedora, and without any real way of judging use and demand, I'm hesitant to kill it off. But that's definitely plan B. We can point people who are in low-bandwidth situations at Silverblue, CoreOS, and Kinoite as the preferred approach.
Yeah, I came up with one more possible way we could get more drpms with our current setup, but need to talk to pungi maintainers and see if it's doable. :) After that, it's either split things out or drop drpms I think.
To be honest, I don't understand why drpms are related to Pungi at all.
Deltas are optional, if they're not available, a normal RPM is used. They can be processed asynchronously (as mentioned earlier in this thread) and injected in repos once they're ready.
Please note that we're talking about 74 drpms in F33.x86_64 updates repo: http://ftp.fi.muni.cz/pub/linux/fedora/linux/updates/33/Everything/x86_64/dr...
Sometimes I'm wondering if it's worth it and if Fedora shouldn't move away from drpms.
Dne 05. 01. 21 v 0:29 Matthew Miller napsal(a):
So, the first thing we need to do to fix this is move deltarpm creation out of the updates process.
Right.
Kevin Fenzi tells me this would mean we'd need a separate delta RPMs repo,
Why? You can do that in the same repo. You just need once per X days/hours run createrepo_c --deltas --num-deltas X
On Tue, 5 Jan 2021 at 03:50, Miroslav Suchý msuchy@redhat.com wrote:
Dne 05. 01. 21 v 0:29 Matthew Miller napsal(a):
So, the first thing we need to do to fix this is move deltarpm creation
out
of the updates process.
Right.
Kevin Fenzi tells me this would mean we'd need a separate delta RPMs repo,
Why? You can do that in the same repo. You just need once per X days/hours run createrepo_c --deltas --num-deltas X
<rant target="the Fedora build system in general" non-target="msuchy for giving a useful idea">
Get pungi and all the other tools in the build system which touch the repos and expect things to be done in a certain way to not break, corrupt or make releng's life a daily nightmare and you are golden.
Remember the Fedora build system is a Rube Goldberg machine[1] where every group who has a new idea about how to make Linux easier to compose/consume/etc have stuck something in. None of them have been designed really to work with each other and various parts are completely running on luck and mad patching (hi PDC). Everytime someone says 'just do this one thing' it turns into a cascade of broken bits where everyone spends a month or 2 blaming (1) releng for adding in one more thing and (2) every other group for messing with their perfect tool. Since we usually have 1-2 months to get something in place before the next release.. that means we have whatever time left over from the poop-throwing festival to monkey-patch it for another release and then live with the system breaking daily for a couple of months. [Then watch Kevin and Mohan grow more Stockholm syndrome to say that the system is fine.. just like the previous release engineers who have departed from Fedora.]
Patches, ideas and fixes are indeed helpful, but what would be more helpful is getting everyone with a say on the build system and their one thing that they want from Release Engineering in a room to work out what the entire developer and build system experience should be with an idea of how to make it more manageable and more able to slot in things in and out versus 'aaaaah {'mass rebuild','beta release','final release','2 week holiday'} is in 2 days get that tool working' </rant>
[1] https://en.wikipedia.org/wiki/Rube_Goldberg_machine
On Tue, Jan 05, 2021 at 09:49:59AM +0100, Miroslav Suchý wrote:
Dne 05. 01. 21 v 0:29 Matthew Miller napsal(a):
So, the first thing we need to do to fix this is move deltarpm creation out of the updates process.
Right.
Kevin Fenzi tells me this would mean we'd need a separate delta RPMs repo,
Why? You can do that in the same repo. You just need once per X days/hours run createrepo_c --deltas --num-deltas X
On Tue, Jan 05, 2021 at 06:30:37PM +0100, Daniel Mach wrote:
To be honest, I don't understand why drpms are related to Pungi at all.
Deltas are optional, if they're not available, a normal RPM is used. They can be processed asynchronously (as mentioned earlier in this thread) and injected in repos once they're ready.
Please note that we're talking about 74 drpms in F33.x86_64 updates repo: http://ftp.fi.muni.cz/pub/linux/fedora/linux/updates/33/Everything/x86_64/dr...
Sometimes I'm wondering if it's worth it and if Fedora shouldn't move away from drpms.
so, ok then. I guess people are still confused about this. Here's my attempt to explain in detail how it currently works:
When bodhi does an updates push (say for f33-updates), it does a lot of things. It checks the updates that are pending f33 stable updates, it locks them so no one can mess with them in the ui until it's done, it makes sure they are signed, it tells koji to move the tags the packages are tagged into into f33-updates, then it calls pungi to actually do the heavy lifting.
This pungi process then talks to koji and says "hey, give me the latest tagged packages for the 'f33-updates' tag signed with key xyz. It puts them in directories for arch and type and such and runs createrepo_c to make the repodata. This is the point where it makes drpms. In order to make a drpm createpo_c needs to know what you want to make drpms for. It also has to have both the OLD and NEW versions available to make the drpm. createrepo_c also makes the normal repodata
In our above case pungi has the current repos it's making and the f33-updates repo. Thus all the drpms it can make are ones where a new package version is being added to the repos and there is an older version available in f33-updates. It doesn't have access to all those versions before or after the ones it has. It only has those two.
Once pungi is done, bodhi then does more things (emailing people, updating notes, etc) and... importantly, updates the repodata with the security information (so you can know what are security updates, etc).
Then, that entire tree is synced to the master mirrors.
On the next f33-updates push the entire process runs again. It never _updates_ existing repos, it always creates them. This means that if you have foo-1.0-1 in f33-updates, foo-1.1-1 comes along, it will make a drpm between them, that will exist _only_ on the day it added foo-1.1-1. The next day it will be gone. This is why there are so few drpms. It's only generating them for the things that it could at the time of the last push. So if you happen to update during a day where there were things updated that you have installed you would see the drpms. If you happen to update the next day, you would not.
So, my last thought was to teach pungi about all the old updates trees (which are in the same directory as it makes the new one) and have it gather all the old drpms from those and expire them at some configurable time. This would not use more cycles to make them, and would make the chances of a user being able to use them much higher. But I am not sure this is possible/if pungi maintainers are willing to implement this. It would mean that createrepo_c would need to know about those old drpms to add them to metadata.
Failing that we could move the drpm creation to another process/repo, but... drpms have to be in the same repodata as the repo they are for right? Or can they be in another one?
Hope that clarifies more than it confuses...
kevin
Dne 05. 01. 21 v 19:44 Kevin Fenzi napsal(a):
On the next f33-updates push the entire process runs again. It never _updates_ existing repos, it always creates them.
Ahh. So this all worked when we run the the process once per week. But because we run it every day now, the deltas are minimal. It just took us several months to notice.
On Tue, Jan 05, 2021 at 08:01:04PM +0100, Miroslav Suchý wrote:
Dne 05. 01. 21 v 19:44 Kevin Fenzi napsal(a):
On the next f33-updates push the entire process runs again. It never _updates_ existing repos, it always creates them.
Ahh. So this all worked when we run the the process once per week.
It worked differently back when we used mash to create updates, I think because we also did drpms from the GA/base repo. Which we could do now, but It would only help users on their initial 'dnf update'.
But because we run it every day now, the deltas are minimal. It just took us several months to notice.
It's been this way since we moved to pungi. 3+ years.
kevin
* Matthew Miller:
I also remember when this was a killer feature for Fedora, and without any real way of judging use and demand, I'm hesitant to kill it off.
Is it really saving bandwidth, though? The reported savings are generally very small for me. Downloading the metadata costs something as well.
Thanks, Florian
Is it really saving bandwidth, though? The reported savings are generally very small for me. Downloading the metadata costs something as well.
In F33, mostly so. I generally keep up to date (update once a week), but available deltarpms have been lesser compared to earlier versions. I used deltarpm from the day it was introduced in Fedora, and found it quite useful due to limited internet connection (speed/bandwidth/data limit) — for instance TeXLive package updates (which, afait, doesn’t generate deltarpms in f33) etc.
On Tue, Jan 05, 2021 at 11:30:10AM +0100, Florian Weimer wrote:
I also remember when this was a killer feature for Fedora, and without any real way of judging use and demand, I'm hesitant to kill it off.
Is it really saving bandwidth, though? The reported savings are generally very small for me. Downloading the metadata costs something as well.
If we made a lot more of them, they could save significant bandwidth.
* Matthew Miller:
On Tue, Jan 05, 2021 at 11:30:10AM +0100, Florian Weimer wrote:
I also remember when this was a killer feature for Fedora, and without any real way of judging use and demand, I'm hesitant to kill it off.
Is it really saving bandwidth, though? The reported savings are generally very small for me. Downloading the metadata costs something as well.
If we made a lot more of them, they could save significant bandwidth.
The metadata would also be much larger, and so would be the battery usage to recompress the payload. 8-(
Thanks, Florian
On Tue, Jan 5, 2021 at 3:46 PM Florian Weimer fweimer@redhat.com wrote:
The metadata would also be much larger, and so would be the battery usage to recompress the payload. 8-(
And while the bandwidth reduction has value, cpu and wallclock time to rebuild the rpm is substantially increased for low end devices such as ARM SoCs with slow (compared to recent gen x86) cpus, and slow (compared to recent nvme) storage devices such as sd cards compared to just downloading the entire rpm.
On Mon, Jan 4, 2021, at 6:29 PM, Matthew Miller wrote:
I also remember when this was a killer feature for Fedora, and without any real way of judging use and demand, I'm hesitant to kill it off. But that's definitely plan B. We can point people who are in low-bandwidth situations at Silverblue, CoreOS, and Kinoite as the preferred approach.
Please don't use phrasing like this that implies e.g. CoreOS is distinct from "Fedora".
A much technically clearer way to say this would be "traditional dnf Fedora" versus rpm-ostree.
But even then it's not fully distinct because rpm-ostree also links to libdnf and whenever you use package layering, it's all the same RPM tools on the wire. Though...ah right, the deltarpm implementation lives in the dnf Python code, not the libdnf library, so rpm-ostree doesn't do that.
Second - and this should be emphasized - a common case at least on Silverblue is that you run dnf inside a toolbox-style container (or more than one!). So all bandwidth improvements apply there too. In other words this (implict) contrast between the two is false because in both cases there are hybrids.
Now speaking of deltas - really delta implementations are going to benefit from a stronger "cadence" to releases, exactly much like what we do for CoreOS (but not Silverblue/IoT) today. The relationship of such a system and Bodhi is...messy. ostree deltas are also *much* better than deltarpm in various ways (most notably the CPU intensive part is bsdiff which we only use selectively instead of the whole thing).
On the other hand, we really want deltas too for containers; that's https://github.com/containers/image/pull/902
A very tricky case is the intersection of all of these; for my "dev container"/toolbox on my Silverblue workstation I use a custom container built on a server with all of my tools, but I do often `yum update` inside it since that works incrementally and online. (But I do periodically flush and re-pull) If we implemented container deltas I'd be a lot more likely to use `podman` to update it instead.
But anyways, please either explicitly spell out "Fedora CoreOS" to avoid an implicit contrast and making it seem like a separate thing from "Fedora", or go more technical and say "rpm-ostree variant" or so. Thanks!
On Tue, Jan 5, 2021 at 8:34 AM Colin Walters walters@verbum.org wrote:
On Mon, Jan 4, 2021, at 6:29 PM, Matthew Miller wrote:
I also remember when this was a killer feature for Fedora, and without any real way of judging use and demand, I'm hesitant to kill it off. But that's definitely plan B. We can point people who are in low-bandwidth situations at Silverblue, CoreOS, and Kinoite as the preferred approach.
Please don't use phrasing like this that implies e.g. CoreOS is distinct from "Fedora".
But as of right now, it *is*. Perhaps that will change if FCOS realigns with Fedora as part of being promoted to Edition status, but right now, it's different enough content-wise that it's less like Fedora than most of us would want it to be.
A much technically clearer way to say this would be "traditional dnf Fedora" versus rpm-ostree.
But even then it's not fully distinct because rpm-ostree also links to libdnf and whenever you use package layering, it's all the same RPM tools on the wire. Though...ah right, the deltarpm implementation lives in the dnf Python code, not the libdnf library, so rpm-ostree doesn't do that.
Second - and this should be emphasized - a common case at least on Silverblue is that you run dnf inside a toolbox-style container (or more than one!). So all bandwidth improvements apply there too. In other words this (implict) contrast between the two is false because in both cases there are hybrids.
Now speaking of deltas - really delta implementations are going to benefit from a stronger "cadence" to releases, exactly much like what we do for CoreOS (but not Silverblue/IoT) today. The relationship of such a system and Bodhi is...messy. ostree deltas are also *much* better than deltarpm in various ways (most notably the CPU intensive part is bsdiff which we only use selectively instead of the whole thing).
Cadence only matters if the infrastructure around delta fetching requires it to care. In the case of RPM-OSTree and Flatpak, this is not a problem as long as you're using native OSTree remotes. If you're using OCI image remotes instead, then you *do* have to care about cadence because you have to maintain images and generate deltas based on possible options. The latter option is how we deliver Flatpaks, and so we have the same problem we have with DeltaRPMs.
On the other hand, we really want deltas too for containers; that's https://github.com/containers/image/pull/902
A very tricky case is the intersection of all of these; for my "dev container"/toolbox on my Silverblue workstation I use a custom container built on a server with all of my tools, but I do often `yum update` inside it since that works incrementally and online. (But I do periodically flush and re-pull) If we implemented container deltas I'd be a lot more likely to use `podman` to update it instead.
Addressing the underlying issue here: container deltas and OSTree deltas are considerably worse for constrained bandwidth than RPM deltas. Outside of the USA (in the general case) and within the USA (in several parts of the country), it is extremely common to have extremely limited bandwidth availability and even more common to have low throughput. As that will basically never change, we have to work with that framework.
Regular Fedora variants offer users the ability to pick and choose updates based on their situation. If DeltaRPMs were implemented in our infrastructure correctly, those users would be very well-served by being able to publish a sliding window of the last 30 days of delta RPM content. What's particularly galling here is that we have all the necessary inputs to do it, since Koji keeps everything and Bodhi knows everything that's ever been pushed. It's pretty much a Pungi restriction that we've never been able to do this properly.
To be blunt, I would have never done Zchunk metadata if it was going to be used as a tool to kill DeltaRPMs. I firmly believe we need both to have a comprehensive offering that accommodates the needs of Fedora users across the world.
-- 真実はいつも一つ!/ Always, there's only one truth!
On Tue, Jan 05, 2021 at 08:49:20AM -0500, Neal Gompa wrote:
On Tue, Jan 5, 2021 at 8:34 AM Colin Walters walters@verbum.org wrote:
Now speaking of deltas - really delta implementations are going to benefit from a stronger "cadence" to releases, exactly much like what we do for CoreOS (but not Silverblue/IoT) today. The relationship of such a system and Bodhi is...messy. ostree deltas are also *much* better than deltarpm in various ways (most notably the CPU intensive part is bsdiff which we only use selectively instead of the whole thing).
Cadence only matters if the infrastructure around delta fetching requires it to care. In the case of RPM-OSTree and Flatpak, this is not a problem as long as you're using native OSTree remotes. If you're using OCI image remotes instead, then you *do* have to care about cadence because you have to maintain images and generate deltas based on possible options. The latter option is how we deliver Flatpaks, and so we have the same problem we have with DeltaRPMs.
On the other hand, we really want deltas too for containers; that's https://github.com/containers/image/pull/902
A very tricky case is the intersection of all of these; for my "dev container"/toolbox on my Silverblue workstation I use a custom container built on a server with all of my tools, but I do often `yum update` inside it since that works incrementally and online. (But I do periodically flush and re-pull) If we implemented container deltas I'd be a lot more likely to use `podman` to update it instead.
Addressing the underlying issue here: container deltas and OSTree deltas are considerably worse for constrained bandwidth than RPM deltas. Outside of the USA (in the general case) and within the USA (in several parts of the country), it is extremely common to have extremely limited bandwidth availability and even more common to have low throughput. As that will basically never change, we have to work with that framework.
Regular Fedora variants offer users the ability to pick and choose updates based on their situation. If DeltaRPMs were implemented in our infrastructure correctly, those users would be very well-served by being able to publish a sliding window of the last 30 days of delta RPM content. What's particularly galling here is that we have all the necessary inputs to do it, since Koji keeps everything and Bodhi knows everything that's ever been pushed. It's pretty much a Pungi restriction that we've never been able to do this properly.
Another thought: we could use popcon-like information to generate delta rpms only for N% most popular packages (10%?). This would significantly cut down on the cost of generation, without really affecting average user savings.
Yet another reason why popcon would be useful.
To be blunt, I would have never done Zchunk metadata if it was going to be used as a tool to kill DeltaRPMs. I firmly believe we need both to have a comprehensive offering that accommodates the needs of Fedora users across the world.
Zbyszek
Dne 05. 01. 21 v 15:31 Zbigniew Jędrzejewski-Szmek napsal(a):
Yet another reason why popcon would be useful.
https://github.com/xsuchy/popcon-for-fedora-old Feel free to take it :)
On Tue, 2021-01-05 at 08:49 -0500, Neal Gompa wrote:
To be blunt, I would have never done Zchunk metadata if it was going to be used as a tool to kill DeltaRPMs. I firmly believe we need both to have a comprehensive offering that accommodates the needs of Fedora users across the world.
Hey Neal, I'm not sure where you're going with that first sentence, but I think it's pretty obvious that zchunk and deltarpms solve different problems and, following this thread, I don't think anyone has suggested that we should kill deltarpms *because* we have zchunk metadata.
When we first brought deltarpms into Fedora, a savings of 60-90% when doing updates was normal. Now that we're losing the deltarpms after each push (as we have been for the last three years), the savings is significantly lower (I normally see less than 10%) and that makes it hard to be motivated to fix the bugs that inevitably arise.
It sounds like there might be a plan to keep deltarpms beyond a single push, and, if that happens, I will be more than happy to keep on dealing with deltarpm bugs. :)
Thanks, Jonathan
Hi,
we aren't making very many, which makes them even less useful. Plus, we're only making them between updates and for packages where those updates are frequent, that means you need to keep on top of things, which may be best practice but is most difficult for low-bandwidth users who might most benefit in the first place.
I'm a low bandwidth user. And my setup basically is:
(1) route all updates though a squid caching proxy. (2) configure all fedora machines to use the same fixed mirror. (3) disable drpms. (4) disable zchunk.
There are always cases where you need the full rpm anyway (for example fresh installs with update repo enabled), so just loading (+caching!) the full rpms and don't bother with drpms works better overall.
The problem with zchunk is that it isn't cache-friendly. squid can't cache range requests. And even in case of a full download (fresh install) I've seen zchunk metadata being re-downloaded when requested again ...
take care, Gerd
On Mon, Jan 04, 2021 at 06:29:13PM -0500, Matthew Miller wrote:
I also remember when this was a killer feature for Fedora, and without any real way of judging use and demand, I'm hesitant to kill it off. But that's definitely plan B. We can point people who are in low-bandwidth situations at Silverblue, CoreOS, and Kinoite as the preferred approach.
It is also very difficult to measure - for my part I'm happy for every MB saved when downloading updates at home, because it directly translates into waiting time. At work I don't have a bandwidth problem, so it's way less of an issue there.
But I don't see anywhere how much the potential is - iirc correctly it sometimes saves hundreds of MBs, which translates into something like 10-20 Minutes saved time.
Even though it's an after-the-fact information I'm still happy it doing its job. (Interestingly on the last update almost half of the md5 sums mismatched for the drpms, which increased download size from 12.3MB to 14.0MB - but this seems to be a rare problem)
Personally I'd like to stay on regular Fedora Workstation / Fedora Server - I'd probably decide to take increased download times instead of switching distros/editions if drpm gets removed.
All the best, Astra
On Mon, 2020-12-21 at 09:53 -0800, Kevin Fenzi wrote:
Cool. A few questions inline...
On Mon, Dec 21, 2020 at 11:28:51AM -0500, Ben Cotton wrote:
https://fedoraproject.org/wiki/Changes/RPMCoW
== Summary ==
RPM Copy on Write provides a better experience for Fedora Users as it reduces the amount of I/O and offsets CPU cost of package decompression. RPM Copy on Write uses reflinking capabilities in btrfs, which is the default filesystem in Fedora 33.
What happens if you enable this on non btrfs installs? Does it just not work gracefully? Does it fail somehow?
It would be slower but still works; see note #5
Of note, even on systems with Btrfs/XFS that support reflinks, falling back to copying is still needed for e.g. files in /boot or /boot/EFI
On Monday, December 21, 2020 11:28:51 AM EST Ben Cotton wrote:
[snip]
# The file format for RPMs is different with Copy on Write. The headers are identical, but the payload is different. There is also a footer. ## Files are converted (“transcoded”) locally during download using <code>/usr/bin/rpm2extents</code> (part of rpm codebase). The format is not intended to be “portable” - i.e. copying the files from the cache is not supported.
I currently download once and upgrade three different systems by rsync-ing the cache.
Do I understand that this will no longer be supported or work?
I currently download once and upgrade three different systems by rsync-ing the cache.
Do I understand that this will no longer be supported or work?
That's an interesting question. Is sharing the cache directory from a single host intended to be shared like this? I am guessing no, but it may still be common.
It should still work, with two caveats: 1. The files in the cache will be bigger, so a simple rsync will involve more I/O, and the destination filesystem will also need more space and I/O time. 2. The systems must be the same endianness (The transcoded format doesn't bother with network order, because it's not intended to be shared) 3. The page size must be the same for reflinking to work: This is actually worked out when the filesystem is created, and defaults to the system page size, and if not the same as the current page size, the filesystem isn't even guaranteed to mount (see --sectorsize option in mkfs.btrfs man page).
In reality you're quite unlikely to share packages unless the architecture were the same, which would steer both endianness and page size to the same value. That said, I'm aware that aarch64 can be flexible in both ways. I'm covering my bases with my statement: I have thought about it, and don't think I'm in any position to make promises.
For this proposal: we're talking about shipping the code that would allow this to be turned on. We're not talking about enabling it by default. We can't until we have good answers to questions like this.
Thanks, Matthew.
On Tuesday, December 22, 2020 4:54:34 PM EST Matthew Almond via devel wrote:
I currently download once and upgrade three different systems by rsync-ing the cache.
Do I understand that this will no longer be supported or work?
That's an interesting question. Is sharing the cache directory from a single host intended to be shared like this? I am guessing no, but it may still be common.
It should still work, with two caveats:
- The files in the cache will be bigger, so a simple rsync will
involve more I/O, and the destination filesystem will also need more space and I/O time. 2. The systems must be the same endianness (The transcoded format doesn't bother with network order, because it's not intended to be shared) 3. The page size must be the same for reflinking to work: This is actually worked out when the filesystem is created, and defaults to the system page size, and if not the same as the current page size, the filesystem isn't even guaranteed to mount (see --sectorsize option in mkfs.btrfs man page).
In reality you're quite unlikely to share packages unless the architecture were the same, which would steer both endianness and page size to the same value. That said, I'm aware that aarch64 can be flexible in both ways. I'm covering my bases with my statement: I have thought about it, and don't think I'm in any position to make promises.
For this proposal: we're talking about shipping the code that would allow this to be turned on. We're not talking about enabling it by default. We can't until we have good answers to questions like this.
Understood.
To be clear, all three systems are x86_64, identical endianness, architecture, and page size (as far as I know).
Also, this isn't a big deal, really. I just wanted to reduce network bandwidth without operating a local mirror.
_______________________ sudo rsync -a --password-file=/etc/rsync.password --delete rsync://rsync@vfr/dnf /var/cache/dnf;sudo dnf --enablerepo=updates-testing upgrade
On Mon, Dec 21, 2020, at 11:28 AM, Ben Cotton wrote:
https://fedoraproject.org/wiki/Changes/RPMCoW
== Summary ==
RPM Copy on Write provides a better experience for Fedora Users as it reduces the amount of I/O and offsets CPU cost of package decompression. RPM Copy on Write uses reflinking capabilities in btrfs, which is the default filesystem in Fedora 33.
== Owners ==
- Name: [[User:malmond|Matthew Almond]], [[User:dcavalca|Davide Cavalca]]
- Email: malmond@fb.com, dcavalca@fb.com
== Detailed description ==
Installing and upgrading software packages is a standard part of managing the lifecycle of any operating system. For the entire lifecycle of Fedora, all software is packaged and distributed using the RPM file fomat. This proposal changes how software is downloaded and installed, leaving the distribution process unmodified.
=== Current process ===
# Resolve packaging request into a list of packages and operations # Download and verify new packages # Install and/or upgrade packages sequentially using RPM files, decompressing, and writing a copy of the new files to storage.
=== New process ===
# Resolve packaging request into a list of packages and operations # Download and '''decompress''' packages into a '''locally optimized''' rpm file
Please verify the signature on the downloaded RPM before decompressing it. (Do we do this already?)
# Install and/or upgrade packages sequentially using RPM files, using '''reference linking''' (reflinking) to reuse data already on disk.
Sounds like a great improvement! Any real-world data on how much time it saves, how much it changes disk usage, or how much SSD writes it saves?
The outcome is intended to be the same, but the order of operations is different.
# Decompression happens inline with download. This has a positive effect on resource usage: downloads are typically limited by bandwidth. Decompression and writing the full data into a single file per rpm is essentially free. Additionally: if there is more than one download at a time, a multi-CPU system can be better utilized. All compression types supported in RPM work because this uses the rpm I/O functions.
I referenced above, I think each chunk should also be verified before decompressing.
# RPMs are cached on local storage between downloading and installation time as normal. This allows DNF to defer actual RPM installation to when all the RPM are available. This is unchanged. # The file format for RPMs is different with Copy on Write. The headers are identical, but the payload is different. There is also a footer. ## Files are converted (“transcoded”) locally during download using <code>/usr/bin/rpm2extents</code> (part of rpm codebase). The format is not intended to be “portable” - i.e. copying the files from the cache is not supported.
I think these should be made to be portable. How many variants of these are there? Would it be difficult to make the transcoder also understand RPMs transcoded for a different platform/setup? Eventually, I'd like to see additional signatures added to the RPM for each of the variants so RPM itself can do the verification at install time, avoiding a transcode to the "canonical" format. (I suppose this might require a build-time or sign-time transcode to each of the other variants.) Until then, I'd like to ensure that the package signatures are being verified in a secure manner, which would be necessary for the plugin to be able to install packages not built with multiple signatures/digests.
Would it be practical to just have a single format aligned to the largest page size known, leaving fs holes as necessary on systems with smaller page sizes?
## Regular RPMs use a compressed .cpio based payload. In contrast, extent based RPMs contain uncompressed data aligned to the fundamental page size of the architecture, e.g. 4KiB on x86_64. This alignment is required for <code>FICLONERANGE</code> to work. Only files are represented in the payload, other directory entries like symlinks, device nodes etc are constructed entirely from rpm header information. Files are referenced by their digest, so identical files are de-duplicated.
How are hardlinks in an RPM handled? Do they stay as hardlinks or become reflinks only, losing the hardlink status? They should stay hardlinks, in my opinion.
## The footer currently has three sections ### Table of original (rpm) file digests, used to validate the integrity of the download in dnf. ### Table of digest → offset used when actually installing files. ### Signature 8 bytes at the end of the file, used to differentiate between traditional RPMs and extent based.
I think this magic number "signature" should vary based on the items that cause the format to change.
What happens if you try to use a transcoded RPM on a non-compatible system?
=== Notes ===
# The headers are preserved bit for bit during transcoding. This preserves signatures. The signatures cover the main header blob, and the main header blob ensures the integrity of data in two ways: ## Each file with content has a digest. Originally this was md5, but today it’s usually sha256. In normal RPM this is only used to verify the integrity of files, e.g. <code>rpm -V</code>. With CoW we use this as a content key. ## There is/are one or two digests (<code>PAYLOADDIGEST</code> and <code>PAYLOADDIGESTALT</code>) covering the payload archive (compressed cpio). The header value is preserved, but transcoded RPMs do not preserve the original structure so RPM’s pre-installation verification (controlled by <code>%_pkgverify_level</code> will fail. <code>dnf-plugin-cow</code> disables this check in dnf because it verifies the whole file digest which is captured during download/transcoding. The second one is likely used for delta rpm. # This is untested, and possibly incompatible with delta RPM (drpm). The process for reconstructing an rpm to install from a delta is expensive from both a CPU and I/O perspective, while only providing marginal benefits on download size. It is expected that having delta rpm enabled (which is the default) will be handled gracefully.
https://github.com/rpm-software-management/rpm/pull/880 added DIGESTALT, apparently to help reduce this CPU usage problem. I don't know if it's actually used by anything, but it is much newer than I'd have guessed (2019 October).
# Disk space requirements are expected to be marginally higher than before: all new packages or updates will consume their installed size before installation instead of about half their size (regular rpms with payloads still cost space). # <code>rpm-plugin-reflink</code> will fall back to simple file copying when the destination path is not on the same filesystem/subvolume. A common example is <code>/boot</code> and/or <code>/boot/efi</code>. # The system will still work on other filesystem types, but will ''always'' fall back to simple copying. This is expected to be slightly slower than not enabling CoW because the source for copying will be the decompressed data.
Any testing to see the speed impact?
# For systems that enable transparent filesystem compression: every file will continue to be decompressed from the original rpm, and then transparently re-compressed by the filesystem. There is no effective change here. There is a future project to investigate alternate distribution mechanics to provide parallel versions of file content pre-compressed in a filesystem specific format, reducing both CPU costs and I/O. It is expected that this will result in slightly higher network utilization because filesystem compression is purposely restricted to allow random I/O. # Current implementation of <code>dnf-plugin-cow</code> is in Python, but it looks possible to implement this in <code>libdnf</code> instead which would make it work in <code>packagekit</code>.
=== Performance Metrics ===
Ballpark performance difference is about half the duration for file download+install time. A lot of rpms are very small, so it’s difficult to see/measure. Larger RPMs give much clearer signal.
(Actual numbers/charts will be supplied in Jan 2021)
Seems like a very nice optimization! Thanks for working on it!
V/r, James Cassell
On Wed, 2020-12-23 at 19:23 -0500, James Cassell wrote:
# Resolve packaging request into a list of packages and operations
# Download and '''decompress''' packages into a '''locally optimized''' rpm file
Please verify the signature on the downloaded RPM before decompressing it. (Do we do this already?)
We have an opportunity to do the verification during download, but I'm not keen on it for two major reasons:
1. the transcoder would need to open the rpmdb to perform the verification, adding a fair amount of complexity. 2. (more crucially) I observe that dnf downloads packages and signatures before asking whether to trust them. The order of events means we can't be confident that all signatures are in the rpmdb yet.
My proposal is for enabling CoW with dnf. The code change to do transcoding is in librepo as part of the generic file download mechanism. librepo does verify downloads relative to the repo's recorded digest.
The transcoder produces a different series of bits to be written to disk, so how could that verification work? Turns out the answer is easy: we see the original bits on the input to the transcoder, so we calculate the digest of the bits received from the yum server and record this in the footer of the transcoded rpm. I've modified lr_checksum_fd() in librepo to look for this before using the xattr cache or reading the whole file again. You can only locate that whole file digest if the footer itself is complete.
The digest is actually a list of digests. The default value in createrepo_c is SHA256 (irrespective of what digest algorithm is used to identify/verify files in each rpm) and for now the dnf plugin passes "SHA256" to the transcoder statically. This is ultimately repo specific. I hope to eliminate the hard coding later if there's signal within librepo to choose the right digest algo for the specific repo.
The job of actually verifying the signature falls through to rpm as it did before. As stated in the proposal: the headers (lead, signature, main header) are completely untouched, so the gpg based signature is still verified as before, and at the same point in time.
# Install and/or upgrade packages sequentially using RPM files, using '''reference linking''' (reflinking) to reuse data already on disk.
Sounds like a great improvement! Any real-world data on how much time it saves, how much it changes disk usage, or how much SSD writes it saves?
Forthcoming! I've got some numbers I've used internally at Facebook to talk about this. To do this I had to write another rpm plugin to measure how much time was spent on decompressing and writing data. I'm planning on improving this and open sourcing that too. The goal here is to produce some publicly reproducible numbers.
The outcome is intended to be the same, but the order of operations is different.
# Decompression happens inline with download. This has a positive effect on resource usage: downloads are typically limited by bandwidth. Decompression and writing the full data into a single file per rpm is essentially free. Additionally: if there is more than one download at a time, a multi-CPU system can be better utilized. All compression types supported in RPM work because this uses the rpm I/O functions.
I referenced above, I think each chunk should also be verified before decompressing.
This is certainly possible, but not implemented. My thinking here is that the full rpm file digest enforced for files downloaded with dnf/librepo also covers this. The only optimization possible here is for a damaged rpm to fail faster during transcode. I consider this a pretty minor optimization.
# RPMs are cached on local storage between downloading and installation time as normal. This allows DNF to defer actual RPM installation to when all the RPM are available. This is unchanged. # The file format for RPMs is different with Copy on Write. The headers are identical, but the payload is different. There is also a footer. ## Files are converted (“transcoded”) locally during download using <code>/usr/bin/rpm2extents</code> (part of rpm codebase). The format is not intended to be “portable” - i.e. copying the files from the cache is not supported.
I think these should be made to be portable. How many variants of these are there? Would it be difficult to make the transcoder also understand RPMs transcoded for a different platform/setup? Eventually, I'd like to see additional signatures added to the RPM for each of the variants so RPM itself can do the verification at install time, avoiding a transcode to the "canonical" format. (I suppose this might require a build-time or sign-time transcode to each of the other variants.) Until then, I'd like to ensure that the package signatures are being verified in a secure manner, which would be necessary for the plugin to be able to install packages not built with multiple signatures/digests.
Would it be practical to just have a single format aligned to the largest page size known, leaving fs holes as necessary on systems with smaller page sizes?
I'm not keen on making the transcoded rpms portable because they're usually twice the size of the original/archive. If you want to share these between systems, I would think running a caching web proxy or an explicit internal mirror the more common way to do this.
Defaulting the alignment to 64k or some higher value would yield a lot of wasted space per file, unless holes were used, I've not experimented with this, but it's an interesting idea.
I've covered the transcoded file digest(s) above. This is repo driven not rpm driven. RPM signatures get enforced as normal.
## Regular RPMs use a compressed .cpio based payload. In contrast, extent based RPMs contain uncompressed data aligned to the fundamental page size of the architecture, e.g. 4KiB on x86_64. This alignment is required for <code>FICLONERANGE</code> to work. Only files are represented in the payload, other directory entries like symlinks, device nodes etc are constructed entirely from rpm header information. Files are referenced by their digest, so identical files are de-duplicated.
How are hardlinks in an RPM handled? Do they stay as hardlinks or become reflinks only, losing the hardlink status? They should stay hardlinks, in my opinion.
This is a great question: Everything ends up being bit for bit identical to systems without this system enabled. Making this work was an interesting challenge, and I'm pretty happy with how it turned out.
## The footer currently has three sections ### Table of original (rpm) file digests, used to validate the integrity of the download in dnf. ### Table of digest → offset used when actually installing files. ### Signature 8 bytes at the end of the file, used to differentiate between traditional RPMs and extent based.
I think this magic number "signature" should vary based on the items that cause the format to change.
The footer contains a list of digests for the source file verification, a list of digests -> offsets, and the signature itself. Some kind of versioning is possible, but I've not encountered a need to cross that bridge yet. (trying to avoid premature optimization when I don't have a good use case yet).
What happens if you try to use a transcoded RPM on a non-compatible system?
Depends on how it got there, and what you asked for. Here's some examples:
1. cp foo.rpm /var/cache/dnf/<repo>/Packages/ && dnf install foo ...will fail the librepo full file check, and it'll be re- downloaded. 2. dnf install /root/foo.rpm || rpm -i /root/foo.rpm (not actually tested) will likely fail with CPIO/payload error
Note that tools like rpm2cpio and rpm2archive will also fail on transcoded rpms. I have an open task to make the dnf plugin not transcode with 'yumdownloader' or 'dnf download' (plugin) as those are reasonable command to run. I will look at making error messages better and/or making some of these use cases work.
=== Notes ===
# The headers are preserved bit for bit during transcoding. This preserves signatures. The signatures cover the main header blob, and the main header blob ensures the integrity of data in two ways: ## Each file with content has a digest. Originally this was md5, but today it’s usually sha256. In normal RPM this is only used to verify the integrity of files, e.g. <code>rpm -V</code>. With CoW we use this as a content key. ## There is/are one or two digests (<code>PAYLOADDIGEST</code> and <code>PAYLOADDIGESTALT</code>) covering the payload archive (compressed cpio). The header value is preserved, but transcoded RPMs do not preserve the original structure so RPM’s pre-installation verification (controlled by <code>%_pkgverify_level</code> will fail. <code>dnf-plugin-cow</code> disables this check in dnf because it verifies the whole file digest which is captured during download/transcoding. The second one is likely used for delta rpm. # This is untested, and possibly incompatible with delta RPM (drpm). The process for reconstructing an rpm to install from a delta is expensive from both a CPU and I/O perspective, while only providing marginal benefits on download size. It is expected that having delta rpm enabled (which is the default) will be handled gracefully.
https://github.com/rpm-software-management/rpm/pull/880 added DIGESTALT, apparently to help reduce this CPU usage problem. I don't know if it's actually used by anything, but it is much newer than I'd have guessed (2019 October).
I don't see a straightforward way to use DIGESTALT. I think the transcoded file level digest is a decent way to falsify the file, and when the rpm is installed using dnf, you get a verify that checks the files. DIGESTALT helps provide a way to falsify a local rpm before trying to install it.
# Disk space requirements are expected to be marginally higher than before: all new packages or updates will consume their installed size before installation instead of about half their size (regular rpms with payloads still cost space). # <code>rpm-plugin-reflink</code> will fall back to simple file copying when the destination path is not on the same filesystem/subvolume. A common example is <code>/boot</code> and/or <code>/boot/efi</code>. # The system will still work on other filesystem types, but will ''always'' fall back to simple copying. This is expected to be slightly slower than not enabling CoW because the source for copying will be the decompressed data.
Any testing to see the speed impact?
Only accidentally ;) You're simply <moving> the decompression time to an earlier step, and then copying a lot more data bit by bit, so the full effect has strong dependency on CPU speed relative to I/O speed. We found that overall, it was *slightly* faster.
# For systems that enable transparent filesystem compression: every file will continue to be decompressed from the original rpm, and then transparently re-compressed by the filesystem. There is no effective change here. There is a future project to investigate alternate distribution mechanics to provide parallel versions of file content pre-compressed in a filesystem specific format, reducing both CPU costs and I/O. It is expected that this will result in slightly higher network utilization because filesystem compression is purposely restricted to allow random I/O. # Current implementation of <code>dnf-plugin-cow</code> is in Python, but it looks possible to implement this in <code>libdnf</code> instead which would make it work in <code>packagekit</code>.
=== Performance Metrics ===
Ballpark performance difference is about half the duration for file download+install time. A lot of rpms are very small, so it’s difficult to see/measure. Larger RPMs give much clearer signal.
(Actual numbers/charts will be supplied in Jan 2021)
Seems like a very nice optimization! Thanks for working on it!
Thanks for the feedback! I'll try to incorporate these points into the wiki in the new year - Matthew.
Dne 24. 12. 20 v 22:54 Matthew Almond via devel napsal(a):
Depends on how it got there, and what you asked for. Here's some examples:
- cp foo.rpm /var/cache/dnf/<repo>/Packages/ && dnf install foo ...will fail the librepo full file check, and it'll be re- downloaded.
- dnf install /root/foo.rpm || rpm -i /root/foo.rpm (not actually tested) will likely fail with CPIO/payload error
Note that tools like rpm2cpio and rpm2archive will also fail on transcoded rpms. I have an open task to make the dnf plugin not transcode with 'yumdownloader' or 'dnf download' (plugin) as those are reasonable command to run. I will look at making error messages better and/or making some of these use cases work.
This concerns us (speaking for RPM and DNF people) a little bit. If the transcoded RPM cannot be used as a regular RPM, it probably should have a different identity, for example a different suffix than .rpm. RPM and DNF are designed for generic use cases. I see these transcoded packages as a "cache" tailored for btrfs based systems only. It would be probably good to draw a border between them.
If I understand it correctly, the transcoding happens on each host. Have you considered transcoding all RPMs in a repo on server instead? Or would that be inefficient and increase network traffic too much?
On Tue, 2021-01-05 at 18:18 +0100, Daniel Mach wrote:
Dne 24. 12. 20 v 22:54 Matthew Almond via devel napsal(a):
Depends on how it got there, and what you asked for. Here's some examples:
- cp foo.rpm /var/cache/dnf/<repo>/Packages/ && dnf install foo ...will fail the librepo full file check, and it'll be re- downloaded.
- dnf install /root/foo.rpm || rpm -i /root/foo.rpm (not actually tested) will likely fail with CPIO/payload error
Note that tools like rpm2cpio and rpm2archive will also fail on transcoded rpms. I have an open task to make the dnf plugin not transcode with 'yumdownloader' or 'dnf download' (plugin) as those are reasonable command to run. I will look at making error messages better and/or making some of these use cases work.
This concerns us (speaking for RPM and DNF people) a little bit. If the transcoded RPM cannot be used as a regular RPM, it probably should have a different identity, for example a different suffix than .rpm. RPM and DNF are designed for generic use cases. I see these transcoded packages as a "cache" tailored for btrfs based systems only. It would be probably good to draw a border between them.
If I understand it correctly, the transcoding happens on each host. Have you considered transcoding all RPMs in a repo on server instead? Or would that be inefficient and increase network traffic too much?
The transcoded RPM is a valid RPM for the rpm program and most use cases. It's got all the original headers, so querying it (-qp) works perfectly, and it's still signed, and it's only produced in concert with dnf/librepo which validates that the file downloaded is the one described in the repo.
Notably it doesn't work with rpm2cpio and rpm2archive (and yes, I'd like to have a better story there). You typically get these through 'dnf download'. When I implemented the plugin for yum, I was able to avoid transcoding, but on dnf it's not implemented yet.
Signature *verification* partially works. Everything to do with signatures on just the header works (and the header describes the payload digest). There is one specific area which needs fixed: regular RPMs are read, digested, and signature verified before decompression. We need to guard against malicious compressed payloads that either perform a DoS on space/time, or worse (but more difficult) could exploit a bug in a decompression library. I am actively working on this.
The bottom line is that in every place you'd expect to see an rpm, and use it later, you have exactly the same number of things in the same place. If you want to reflink clone the dnf cache into a container - it'll work (really well). If you want to clean up the cache in some selective way. The interface for the cache remains the same.
On server side: no this doesn't make sense: you'd save on decompressing, but you'd use 2x the bandwidth and space on server side. You'll also need to keep the original rpms for clients that don't or can't use reflinking, so it's more like 2.5x the space, which I think is unreasonable.
I do think there's some room in the future to think and talk about how repos could be changed to take better advantage of this. I got some feedback on RPM PR ( https://github.com/rpm-software-management/rpm/pull/1470#issuecomment-754025... ) which is somewhere along the lines of what I was already thinking. That said, I'm aware that I'm being ambitious with this change request, and I'm focused on trying to integrate things that have been written/exist and can be demonstrated first.
Hope these explanations help! Thanks for the feedback :)
Matthew.
On Tue, Jan 05, 2021 at 07:01:56PM +0000, Matthew Almond via devel wrote:
Signature *verification* partially works. Everything to do with signatures on just the header works (and the header describes the payload digest). There is one specific area which needs fixed: regular RPMs are read, digested, and signature verified before decompression. We need to guard against malicious compressed payloads that either perform a DoS on space/time, or worse (but more difficult) could exploit a bug in a decompression library. I am actively working on this.
I just want to say, this is IMHO critical to even consider such proposal. Signature verification should come before parsing whatever is under that signature, otherwise you risk exposing to attacks various processing code that previously assumed it is feed with trusted data only. This applies to decompression library, actual transcoding code and possibly much more. Even if _currently_ there are no known vulnerabilities in a particular part, it doesn't mean they won't be discovered later. The defence in depth is especially important for update system, you don't want to be in a situation where like "oh, we've found a bug in an update system, so you need to execute this very part that is vulnerable to get it fixed".
On Mon, 2020-12-21 at 11:28 -0500, Ben Cotton wrote:
https://fedoraproject.org/wiki/Changes/RPMCoW
== Summary ==
RPM Copy on Write provides a better experience for Fedora Users as it reduces the amount of I/O and offsets CPU cost of package decompression. RPM Copy on Write uses reflinking capabilities in btrfs, which is the default filesystem in Fedora 33.
I've been communicating with the maintainer of RPM on the pull request and it's become clear that this likely depends on the creation of a public, supportable API for RPM. This is not achievable within the window for Fedora 34, so I'm withdrawing the change for Fedora 34 at this time. I will continue to work on this, and expect to re-submit for Fedora 35.
Just a reminder for those interested: I'm giving a talk at CentOS Dojo on this topic on Friday at 17:00 CET[2]
Regards, Matthew.
[1] https://github.com/rpm-software-management/rpm/pull/1470#issuecomment-772410... [2] https://hopin.com/events/centos-dojo-fosdem