Hi all!
I'm new on this list. I work on Qubes OS, where Fedora is used as a base distribution.
While trying to build the installation image in reproducible manner[1], I found the current installation image have unusual layout. Quoting dracut.cmdline manual page:
squashfs.img | Squashfs from LiveCD .iso downloaded via network !(mount) /LiveOS |- rootfs.img | Filesystem image to mount read-only !(mount) /bin | Live filesystem /boot | /dev | ... |
This rootfs.img layer makes the image build very much unreproducible. Why is it even there? Bare squashfs.img layer should be enough. Then, mount overlayfs over it (I see there is even some partial support for it in dmsquash-live). Most other Live systems I've seen use just squashfs + overlayfs (or aufs if kernel is older), so it's commonly tested configuration. I *guess* it's there for historical reason, from before aufs/overlayfs being available. Is there any other reason for that?
If there is no other reason, I propose to drop this and have installer/live filesystem directly in squashfs.img. This have multiple benefits: - it's much easier to make the image build process reproducible (see below) - less complexity, both in the build and in the boot (the whole dmsquash-live dracut module can be replaced with <20 line function[2] - smaller initramfs (which is extremely important if needed to be included in efiboot.img, which can't be larger than 32MB) - slightly faster boot time (device-mapper is slow)
What do you think?
As for the reproducibility, I've made changes to lorax (including dropping rootfs.img layer), anaconda, pungi and createrepo and this all allows to build bit-by-bit identical image, given the same input (rpm packages, pungi configuration, $SOURCE_DATE_EPOCH variable[3]). Well, almost - there is an issue with efiboot.img, but I already have a solution, just not pushed it yet.
You can find all the pull requests collected here: https://github.com/QubesOS/qubes-installer-qubes-os/pull/26
I'll work further to make the changes merged upstream.
[1] https://reproducible-builds.org/ [2] https://github.com/QubesOS/qubes-installer-qubes-os/pull/26/commits/332be8e1... [3] https://reproducible-builds.org/specs/source-date-epoch/
On Thu, Oct 11, 2018 at 6:37 PM, Marek Marczykowski-Górecki marmarek@invisiblethingslab.com wrote:
Hi all!
I'm new on this list. I work on Qubes OS, where Fedora is used as a base distribution.
While trying to build the installation image in reproducible manner[1], I found the current installation image have unusual layout. Quoting dracut.cmdline manual page:
squashfs.img | Squashfs from LiveCD .iso downloaded via network !(mount) /LiveOS |- rootfs.img | Filesystem image to mount read-only !(mount) /bin | Live filesystem /boot | /dev | ... |
This rootfs.img layer makes the image build very much unreproducible. Why is it even there? Bare squashfs.img layer should be enough. Then, mount overlayfs over it (I see there is even some partial support for it in dmsquash-live). Most other Live systems I've seen use just squashfs + overlayfs (or aufs if kernel is older), so it's commonly tested configuration. I *guess* it's there for historical reason, from before aufs/overlayfs being available. Is there any other reason for that?
I'm pretty sure the original reason was the default live install use dd to block copy the root file system into the fedora-root LV, and then resized the LV and ext4 file system. There have also been a number of squashfs improvements since that decision so there might have been limitations with squashfs that ext4 didn't have (I'm thinking xattr were long supported in ext4 before squashfs, and maybe capabilities?)
If there is no other reason, I propose to drop this and have installer/live filesystem directly in squashfs.img. This have multiple benefits:
- it's much easier to make the image build process reproducible (see below)
- less complexity, both in the build and in the boot (the whole dmsquash-live dracut module can be replaced with <20 line function[2]
- smaller initramfs (which is extremely important if needed to be included in efiboot.img, which can't be larger than 32MB)
- slightly faster boot time (device-mapper is slow)
What do you think?
Whatever we do should take into account the persistent root and persistent home use cases, specifically: https://github.com/livecd-tools/livecd-tools/blob/master/tools/livecd-iso-to...
--overlay-size-mb --home-size-mb
A particular criticism of the device-mapper solution currently being used is in that script: it blows up. Literally it's WORM, and deleting files simply dereferences them, it doesn't free up pool space, so it is inevitable that the pool will fill up, and when it does it blows up the file system, and it can't be repaired. All you can do is reset the overlay which means deleting all changes and starting over.
At least one of our spins, SOAS, depends on livecd-iso-to-disk for creating their final installation because it's predicated on running Fedora SOAS from a stick.
Why does efiboot.img have a 32MiB limit?
As for the reproducibility, I've made changes to lorax (including dropping rootfs.img layer), anaconda, pungi and createrepo and this all allows to build bit-by-bit identical image, given the same input (rpm packages, pungi configuration, $SOURCE_DATE_EPOCH variable[3]). Well, almost - there is an issue with efiboot.img, but I already have a solution, just not pushed it yet.
You can find all the pull requests collected here: https://github.com/QubesOS/qubes-installer-qubes-os/pull/26
I'll work further to make the changes merged upstream.
[1] https://reproducible-builds.org/ [2] https://github.com/QubesOS/qubes-installer-qubes-os/pull/26/commits/332be8e1... [3] https://reproducible-builds.org/specs/source-date-epoch/
Cool! Well you've already done most of the work and if this has support elsewhere already then I'm in favor of continuing in that direction.
I did give all of these things some thought a long time ago when I ran into a lorax hack by Will Woods who used Btrfs as the root.img file system, I'm not sure why it was used. But it gave me the idea of using a few features built into Btrfs specifically for this use case:
- seed/sprout feature can be used with zram block device for volatile overlay; and used with a blank partition on the stick for persistent overlay. Discovery is part of the btrfs kernel code.
- Since metadata and data is always checksummed on every read, we wouldn't have to depend on the slow and transient ISO checksum (rd.live.check which uses checkisomd5) which likewise breaks when creating a stick with livecd-iso-to-disk.
- Btrfs supports zstd compression. I did some testing and squashfs is still a bit more efficient because it compresses fs metadata, whereas Btrfs only compresses data extents.
The gotcha here is the resulting image isn't going to be bit for bit reproducible: UUIDs and time stamps are strewn throughout the file system (similar to ext4 and XFS), but any sufficiently complex file system is going to have this problem. Off hand I'm not sure how squashfs would get around it since it's going to draw from an ext4 source (not sure if the ephemeral root could be tmpfs and use it as the source for mksquashfs?)
On Thu, Oct 11, 2018 at 09:24:08PM -0600, Chris Murphy wrote:
On Thu, Oct 11, 2018 at 6:37 PM, Marek Marczykowski-Górecki marmarek@invisiblethingslab.com wrote:
Hi all!
I'm new on this list. I work on Qubes OS, where Fedora is used as a base distribution.
While trying to build the installation image in reproducible manner[1], I found the current installation image have unusual layout. Quoting dracut.cmdline manual page:
squashfs.img | Squashfs from LiveCD .iso downloaded via network !(mount) /LiveOS |- rootfs.img | Filesystem image to mount read-only !(mount) /bin | Live filesystem /boot | /dev | ... |
This rootfs.img layer makes the image build very much unreproducible. Why is it even there? Bare squashfs.img layer should be enough. Then, mount overlayfs over it (I see there is even some partial support for it in dmsquash-live). Most other Live systems I've seen use just squashfs + overlayfs (or aufs if kernel is older), so it's commonly tested configuration. I *guess* it's there for historical reason, from before aufs/overlayfs being available. Is there any other reason for that?
I'm pretty sure the original reason was the default live install use dd to block copy the root file system into the fedora-root LV, and then resized the LV and ext4 file system.
How is it done now?
There have also been a number of squashfs improvements since that decision so there might have been limitations with squashfs that ext4 didn't have (I'm thinking xattr were long supported in ext4 before squashfs, and maybe capabilities?)
If there is no other reason, I propose to drop this and have installer/live filesystem directly in squashfs.img. This have multiple benefits:
- it's much easier to make the image build process reproducible (see below)
- less complexity, both in the build and in the boot (the whole dmsquash-live dracut module can be replaced with <20 line function[2]
- smaller initramfs (which is extremely important if needed to be included in efiboot.img, which can't be larger than 32MB)
- slightly faster boot time (device-mapper is slow)
What do you think?
Whatever we do should take into account the persistent root and persistent home use cases, specifically: https://github.com/livecd-tools/livecd-tools/blob/master/tools/livecd-iso-to...
--overlay-size-mb --home-size-mb
A particular criticism of the device-mapper solution currently being used is in that script: it blows up. Literally it's WORM, and deleting files simply dereferences them, it doesn't free up pool space, so it is inevitable that the pool will fill up, and when it does it blows up the file system, and it can't be repaired. All you can do is reset the overlay which means deleting all changes and starting over.
At least one of our spins, SOAS, depends on livecd-iso-to-disk for creating their final installation because it's predicated on running Fedora SOAS from a stick.
Why does efiboot.img have a 32MiB limit?
Because "32MB should be enough for everybody"... Long story short, "El Torito" boot catalog structure have 16-bit field for image size (expressed in 512-bytes sectors). For details see here: https://wiki.osdev.org/El-Torito https://web.archive.org/web/20180112220141/https://download.intel.com/suppor... (page 10)
Full story: https://github.com/QubesOS/qubes-issues/issues/794#issuecomment-135988806
I've spent a lot of time debugging this, because mkisofs doesn't complain about it, just silently overflow higher bits to adjacent field, which results in weird results, depending on where you boot it. Adding isohybrid to the picture doesn't make it easier (there, higher bits are truncated, or actually not copied to the MBR partition table, as wasn't part of the original field).
As for the reproducibility, I've made changes to lorax (including dropping rootfs.img layer), anaconda, pungi and createrepo and this all allows to build bit-by-bit identical image, given the same input (rpm packages, pungi configuration, $SOURCE_DATE_EPOCH variable[3]). Well, almost - there is an issue with efiboot.img, but I already have a solution, just not pushed it yet.
You can find all the pull requests collected here: https://github.com/QubesOS/qubes-installer-qubes-os/pull/26
I'll work further to make the changes merged upstream.
[1] https://reproducible-builds.org/ [2] https://github.com/QubesOS/qubes-installer-qubes-os/pull/26/commits/332be8e1... [3] https://reproducible-builds.org/specs/source-date-epoch/
Cool! Well you've already done most of the work and if this has support elsewhere already then I'm in favor of continuing in that direction.
I did give all of these things some thought a long time ago when I ran into a lorax hack by Will Woods who used Btrfs as the root.img file system, I'm not sure why it was used. But it gave me the idea of using a few features built into Btrfs specifically for this use case:
- seed/sprout feature can be used with zram block device for volatile
overlay; and used with a blank partition on the stick for persistent overlay. Discovery is part of the btrfs kernel code.
- Since metadata and data is always checksummed on every read, we
wouldn't have to depend on the slow and transient ISO checksum (rd.live.check which uses checkisomd5) which likewise breaks when creating a stick with livecd-iso-to-disk.
- Btrfs supports zstd compression. I did some testing and squashfs is
still a bit more efficient because it compresses fs metadata, whereas Btrfs only compresses data extents.
The gotcha here is the resulting image isn't going to be bit for bit reproducible: UUIDs and time stamps are strewn throughout the file system (similar to ext4 and XFS), but any sufficiently complex file system is going to have this problem.
I wouldn't worry about _files_ timestamps that much - in most cases this is solvable problem by elaborate enough find+touch[4]. But that's not all obviously, there are various timestamps in superblock, and other metadata. The most problematic part in "normal" filesystems, using kernel driver is inode allocation, block allocation etc. This greatly depends on timing, ordering, specific kernel version etc. See [5] for details.
Off hand I'm not sure how squashfs would get around it since it's going to draw from an ext4 source (not sure if the ephemeral root could be tmpfs and use it as the source for mksquashfs?)
mksquashfs 5.0-rc1 have support for clamping mtime to $SOURCE_DATE_EPOCH variable[3]. And the other metadata is reproducible already in mksquashfs 4.3 (I think files are sorted or similar approach is taken).
TBH, there is also a tool to build ext4 filesystem reproducible, not using kernel driver. It's make_ext4 from OpenWRT projet. But I still think it would be better to drop that layer anyway.
[4] https://reproducible-builds.org/docs/archives/ [5] https://reproducible-builds.org/docs/system-images/
On Fri, Oct 12, 2018 at 4:30 AM, Marek Marczykowski-Górecki marmarek@invisiblethingslab.com wrote:
On Thu, Oct 11, 2018 at 09:24:08PM -0600, Chris Murphy wrote:
I'm pretty sure the original reason was the default live install use dd to block copy the root file system into the fedora-root LV, and then resized the LV and ext4 file system.
How is it done now?
On Live media installs, anaconda does:
rsync -pogAXtlHrDx --exclude /dev/ --exclude /proc/ --exclude /sys/ --exclude /run/ --exclude /boot/*rescue* --exclude /etc/machine-id /mnt/install/source/ /mnt/sysimage
On DVD and netinstalls, I'm guessing based on packaging.log that it's a dnf+rpm installation even though I never see a dnf or rpm process in either top or ps. In any case, the rpm packages are directly on the iso9660 file system, not baked into the
Why does efiboot.img have a 32MiB limit?
Because "32MB should be enough for everybody"... Long story short, "El Torito" boot catalog structure have 16-bit field for image size (expressed in 512-bytes sectors). For details see here: https://wiki.osdev.org/El-Torito https://web.archive.org/web/20180112220141/https://download.intel.com/suppor... (page 10)
OK. On Fedora 28 media, efiboot.img is ~9.2 MiB and does not contain either the kernel or initramfs. The kernel and initramfs are found on the iso9660 file system at images/pxeboot/ and also at isolinux/ where GRUB UEFI uses the former, and isolinux BIOS uses the latter. Both initrd's are 65M so they're already too big to go into bootefi.img - and they kinda need to be because this particular initramfs is built by dracut with --nohostonly flag so that hopefully we can boot anything. (Curiously, the initramfs is 65M on DVD/netinstall and 50M on LiveOS - I don't have an explanation for that. I'm looking at Fedora 28 release images.)
From my understanding, efiboot.img only would need to contain shim, grubia32, grubx64 and supporting bootloader only files.
BTW, trivia: Fedora's installer creates EFI System partitions that are always FAT16. So far as I know, no computer has complained, only humans. FAT12/16 is OK for removable media but the spec pretty clearly expects FAT32 for ESPs on permanent installs. The installer team doesn't want to use mkfs flags, they expect the defaults to work unless they don't work, and they do work, so FAT16 it is.
Full story: https://github.com/QubesOS/qubes-issues/issues/794#issuecomment-135988806
I've spent a lot of time debugging this, because mkisofs doesn't complain about it, just silently overflow higher bits to adjacent field, which results in weird results, depending on where you boot it. Adding isohybrid to the picture doesn't make it easier (there, higher bits are truncated, or actually not copied to the MBR partition table, as wasn't part of the original field).
I think we're stuck with isohybrid for a while. Having UEFI and BIOS bootloaders, along with isohybrid supporting both as well as Macs, all on one media image, that can be burned to optical media and written to a USB stick - is hugely beneficial.
The compose process takes about 12 hours. That every ISO for all the editions, and the spins, and the VM images, for all archs. Even having separate UEFI and BIOS images, or splitting out Macs with their own image, it'll increase compose times and complexity across the board. I'm not sure which happens first: the end to optical media booting support; or dropping support for BIOS and/or old Apple EFI Macs (only this year did they start using UEFI, rather than their own variant of Intel EFI pre-UEFI, so it'll take some time to see how that shakes out which also involves whether and how Secure Boot can ever be supported on Macs).
This talks a bit about isohybrid and all the very clever hacks involved to make Fedora boot practically anything with a single ISO 9660 image. (I'm being x86_64 arch specific when I say that.)
https://mjg59.dreamwidth.org/11285.html
I did give all of these things some thought a long time ago when I ran into a lorax hack by Will Woods who used Btrfs as the root.img file system, I'm not sure why it was used. But it gave me the idea of using a few features built into Btrfs specifically for this use case:
- seed/sprout feature can be used with zram block device for volatile
overlay; and used with a blank partition on the stick for persistent overlay. Discovery is part of the btrfs kernel code.
- Since metadata and data is always checksummed on every read, we
wouldn't have to depend on the slow and transient ISO checksum (rd.live.check which uses checkisomd5) which likewise breaks when creating a stick with livecd-iso-to-disk.
- Btrfs supports zstd compression. I did some testing and squashfs is
still a bit more efficient because it compresses fs metadata, whereas Btrfs only compresses data extents.
The gotcha here is the resulting image isn't going to be bit for bit reproducible: UUIDs and time stamps are strewn throughout the file system (similar to ext4 and XFS), but any sufficiently complex file system is going to have this problem.
I wouldn't worry about _files_ timestamps that much - in most cases this is solvable problem by elaborate enough find+touch[4]. But that's not all obviously, there are various timestamps in superblock, and other metadata. The most problematic part in "normal" filesystems, using kernel driver is inode allocation, block allocation etc. This greatly depends on timing, ordering, specific kernel version etc. See [5] for details.
mkfs.btrfs has --rootdir and --shrink features to pre-allocate a volume with files at mkfs time; I have no idea to what degree it depends on kernel code. The main benefit with this is it's really easy to implement full checksum matching for metadata and data on every read, and user space ends up with EIO instead of corrupt data, and super clear kernel complaints. And such corruption whether on optical or USB sticks, is common. Even the more rare case of a stick that passes md5 checksum, can later have transient and silent corruption that ends up showing up in weird ways.
It's plausible squashfs could implement this, I think by default it already checksums every file to look for duplicates, but it doesn't retain the per file hash for integrity checking later on. It's also possible with dm-verity or dm-integrity but then that adds back the dm complexity.
On Fri, 2018-10-12 at 15:44 -0600, Chris Murphy wrote:
On Fri, Oct 12, 2018 at 4:30 AM, Marek Marczykowski-Górecki marmarek@invisiblethingslab.com wrote:
On Thu, Oct 11, 2018 at 09:24:08PM -0600, Chris Murphy wrote:
I'm pretty sure the original reason was the default live install use dd to block copy the root file system into the fedora-root LV, and then resized the LV and ext4 file system.
How is it done now?
On Live media installs, anaconda does:
rsync -pogAXtlHrDx --exclude /dev/ --exclude /proc/ --exclude /sys/ --exclude /run/ --exclude /boot/*rescue* --exclude /etc/machine-id /mnt/install/source/ /mnt/sysimage
On DVD and netinstalls, I'm guessing based on packaging.log that it's a dnf+rpm installation even though I never see a dnf or rpm process in either top or ps. In any case, the rpm packages are directly on the iso9660 file system, not baked into the
anaconda uses dnf's python interface, it does not *run* 'dnf'.
https://github.com/rhinstaller/anaconda/blob/master/pyanaconda/payload/dnfpa...
On Fri, 2018-10-12 at 15:53 -0700, Adam Williamson wrote:
On Fri, 2018-10-12 at 15:44 -0600, Chris Murphy wrote:
On Fri, Oct 12, 2018 at 4:30 AM, Marek Marczykowski-Górecki marmarek@invisiblethingslab.com wrote:
On Thu, Oct 11, 2018 at 09:24:08PM -0600, Chris Murphy wrote:
I'm pretty sure the original reason was the default live install use dd to block copy the root file system into the fedora-root LV, and then resized the LV and ext4 file system.
How is it done now?
On Live media installs, anaconda does:
rsync -pogAXtlHrDx --exclude /dev/ --exclude /proc/ --exclude /sys/ --exclude /run/ --exclude /boot/*rescue* --exclude /etc/machine-id /mnt/install/source/ /mnt/sysimage
On DVD and netinstalls, I'm guessing based on packaging.log that it's a dnf+rpm installation even though I never see a dnf or rpm process in either top or ps. In any case, the rpm packages are directly on the iso9660 file system, not baked into the
anaconda uses dnf's python interface, it does not *run* 'dnf'.
https://github.com/rhinstaller/anaconda/blob/master/pyanaconda/payload/dnfpa...
Yep, but the DNF Python code still actually runs in a Python subprocess. This is needed as aparently something during the package installation transaction - most likely RPM - does a chroot. If the DNF code did run directly in Anaconda process, Anaconda would get chrooted as well and BAD THINGS (TM) would happen. Bad things ranging from missing icons to GTK crashing due to files is uses suddenly vanishing.
-- Adam Williamson Fedora QA Community Monkey IRC: adamw | Twitter: AdamW_Fedora | XMPP: adamw AT happyassassin . net http://www.happyassassin.net _______________________________________________ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
On Fri, Oct 12, 2018 at 03:44:38PM -0600, Chris Murphy wrote:
On Fri, Oct 12, 2018 at 4:30 AM, Marek Marczykowski-Górecki marmarek@invisiblethingslab.com wrote:
On Thu, Oct 11, 2018 at 09:24:08PM -0600, Chris Murphy wrote:
Why does efiboot.img have a 32MiB limit?
Because "32MB should be enough for everybody"... Long story short, "El Torito" boot catalog structure have 16-bit field for image size (expressed in 512-bytes sectors). For details see here: https://wiki.osdev.org/El-Torito https://web.archive.org/web/20180112220141/https://download.intel.com/suppor... (page 10)
OK. On Fedora 28 media, efiboot.img is ~9.2 MiB and does not contain either the kernel or initramfs.
I know, this particular problem was specific to Qubes OS, where kernel+initramfs needed to be on ESP, because of Xen+EFI limitation (basically kernel needs to be loaded through through UEFI instead of by grub, so it needs to live on something that UEFI understands). And actually recent Xen version doesn't have this limitation anymore (at least in theory...). This is just a bit of context from where it all got here, much less relevant today.
(...)
Full story: https://github.com/QubesOS/qubes-issues/issues/794#issuecomment-135988806
I've spent a lot of time debugging this, because mkisofs doesn't complain about it, just silently overflow higher bits to adjacent field, which results in weird results, depending on where you boot it. Adding isohybrid to the picture doesn't make it easier (there, higher bits are truncated, or actually not copied to the MBR partition table, as wasn't part of the original field).
I think we're stuck with isohybrid for a while. Having UEFI and BIOS bootloaders, along with isohybrid supporting both as well as Macs, all on one media image, that can be burned to optical media and written to a USB stick - is hugely beneficial.
I have no problem with isohybrid alone. It's major hack, but definitely worth it.
The compose process takes about 12 hours. That every ISO for all the editions, and the spins, and the VM images, for all archs. Even having separate UEFI and BIOS images, or splitting out Macs with their own image, it'll increase compose times and complexity across the board.
And also complexity for the users - which image to download. I totally understand why it is beneficial.
(...)
I did give all of these things some thought a long time ago when I ran into a lorax hack by Will Woods who used Btrfs as the root.img file system, I'm not sure why it was used. But it gave me the idea of using a few features built into Btrfs specifically for this use case:
- seed/sprout feature can be used with zram block device for volatile
overlay; and used with a blank partition on the stick for persistent overlay. Discovery is part of the btrfs kernel code.
- Since metadata and data is always checksummed on every read, we
wouldn't have to depend on the slow and transient ISO checksum (rd.live.check which uses checkisomd5) which likewise breaks when creating a stick with livecd-iso-to-disk.
- Btrfs supports zstd compression. I did some testing and squashfs is
still a bit more efficient because it compresses fs metadata, whereas Btrfs only compresses data extents.
The gotcha here is the resulting image isn't going to be bit for bit reproducible: UUIDs and time stamps are strewn throughout the file system (similar to ext4 and XFS), but any sufficiently complex file system is going to have this problem.
I wouldn't worry about _files_ timestamps that much - in most cases this is solvable problem by elaborate enough find+touch[4]. But that's not all obviously, there are various timestamps in superblock, and other metadata. The most problematic part in "normal" filesystems, using kernel driver is inode allocation, block allocation etc. This greatly depends on timing, ordering, specific kernel version etc. See [5] for details.
mkfs.btrfs has --rootdir and --shrink features to pre-allocate a volume with files at mkfs time; I have no idea to what degree it depends on kernel code.
Probably not at all, given it works as non-root user too. I've tried to run it twice on the same directory (and with the same --uuid) on 32MB of data and got different images (~2000 lines of hexdump diff). Could be some timestamps, could be something else.
The main benefit with this is it's really easy to implement full checksum matching for metadata and data on every read, and user space ends up with EIO instead of corrupt data, and super clear kernel complaints. And such corruption whether on optical or USB sticks, is common. Even the more rare case of a stick that passes md5 checksum, can later have transient and silent corruption that ends up showing up in weird ways.
It's plausible squashfs could implement this, I think by default it already checksums every file to look for duplicates, but it doesn't retain the per file hash for integrity checking later on.
Indeed it looks that way. I'm able to make one-byte modification to the image file resulting in different files (diff -r), but no read error. I wonder if integrity checking is something on squashfs roadmap...
It's also possible with dm-verity or dm-integrity but then that adds back the dm complexity.
Oh, please, no...
There are two almost separate aspects here: - image layout (squashfs+ext4, squashfs alone, squashfs+btrfs) - how copy-on-write is achieved (dm-snapshot, overlay fs)
For reproducibility, squashfs alone is the best option, but does not improve integrity checking (but also doesn't make it worse). For integrity checking, squashfs+btrfs may be better, but doesn't help that much with reproducibility. Maybe even make it worse, because mkfs.btrfs also make not reproducible result, while make_ext4 (do not confuse with mkfs.ext4!) is reproducible. Not being packaged for Fedora is only a small issue here.
As for copy-on-write, dm-snapshot is quite complex to setup and require underlying FS to support write. Also, doesn't allow to write more data than original image size (may be an issue for persistent partition case). Overlay fs on the other hand works with any underlying fs, you can write as much data as you want. And in case of persistent partition, you can access that data even if base image (the lower layer) is unavailable/broken. I think the only downside of overlay fs is when you modify large file it gets copied in full to the upper layer. But I don't think that's an issue in this use case.
For me, overlay fs is a clear winner here. But as for image layout, it isn't that simple. For reproducibility, squashfs alone is better. But if the goal of this change would be also improving read errors detection, then it isn't that clear anymore. It may be that it takes a simple mkfs.btrfs patch to make it reproducible, but it isn't obvious for me at this stage. Also, keeping two layers looks like unnecessary complexity.
What do you think about sidestepping this discussion a little and replacing dm-snapshot with overlay fs regardless of other changes here? That should be doable without any change to image format and will give more flexibility there. Then, it could be even made to support both 1-layer and 2-layer formats at the same time (depending on rootfs.img presence). Something that isn't possible with dm-snapshot right now.
On Fri, Oct 12, 2018 at 5:26 PM, Marek Marczykowski-Górecki marmarek@invisiblethingslab.com wrote:
On Fri, Oct 12, 2018 at 03:44:38PM -0600, Chris Murphy wrote:
mkfs.btrfs has --rootdir and --shrink features to pre-allocate a volume with files at mkfs time; I have no idea to what degree it depends on kernel code.
Probably not at all, given it works as non-root user too. I've tried to run it twice on the same directory (and with the same --uuid) on 32MB of data and got different images (~2000 lines of hexdump diff). Could be some timestamps, could be something else.
There is volume UUID which is what --uuid affects. But there are other uuids, including the chunk uuid which gets repeated in every leaf and node along with the volume uuid, device uuid, each files tree (subvolume) get its own uuid, etc. Time stamps include atime, otime, mtime, and ctime. Some objects have all 0's for uuid, and some items have only 0.0 for times. I'll float the reproducibility question on the Btrfs list, if it's desirable, useful, and how difficult it is. I think subsetting Btrfs features to reduce complexity generally, and therefore increase reproducibility as a consequence of that, has merit.
It's also possible with dm-verity or dm-integrity but then that adds back the dm complexity.
Oh, please, no...
Haha...
There are two almost separate aspects here:
- image layout (squashfs+ext4, squashfs alone, squashfs+btrfs)
- how copy-on-write is achieved (dm-snapshot, overlay fs)
ext4 alone, and btrfs alone are also viable. But since ext4 has no compression, image size grows by maybe a factor of 2. Btrfs supports lzo and zlib compression since forever, and zstd since kernel 4.14, same as squashfs. What's been missing is mksquashfs with zstd support, which I imagine will be in 5.0. The compression ratio compares well with xz currently being used by mksquashfs in Fedora composes, but with much less CPU to compress and decompress. So I'd say go with zstd in any case.
For reproducibility, squashfs alone is the best option, but does not improve integrity checking (but also doesn't make it worse).
I'm not able to estimate how much work it is to add a files hash manifest to squashfs, and to always use it on reads, and then add some error handling to EIO upon any mismatch. But yeah it'd need user space code in mksquashfs and also kernel code to support it.
As for copy-on-write, dm-snapshot is quite complex to setup and require underlying FS to support write. Also, doesn't allow to write more data than original image size (may be an issue for persistent partition case). Overlay fs on the other hand works with any underlying fs, you can write as much data as you want. And in case of persistent partition, you can access that data even if base image (the lower layer) is unavailable/broken. I think the only downside of overlay fs is when you modify large file it gets copied in full to the upper layer. But I don't think that's an issue in this use case.
For me, overlay fs is a clear winner here. But as for image layout, it isn't that simple. For reproducibility, squashfs alone is better. But if the goal of this change would be also improving read errors detection, then it isn't that clear anymore. It may be that it takes a simple mkfs.btrfs patch to make it reproducible, but it isn't obvious for me at this stage. Also, keeping two layers looks like unnecessary complexity.
I agree. Overlayfs works fine with any of the discussed filesystems. I'd give a slight edge to Btrfs seed+sprout as the overlay mechanism in the case of persistence on a USB stick: a) checksumming b) compression helps improve performance of USB flash drives and reduces wear c) kernel discovers both seed and sprout in early boot by sprout uuid alone, no special mount options needed for setup. But it's a really minor point because a) and b) are still possible with overlayfs with a new independent btrfs as the upperdir.
What do you think about sidestepping this discussion a little and replacing dm-snapshot with overlay fs regardless of other changes here? That should be doable without any change to image format and will give more flexibility there.
Agreed. What I can't tell you off hand is if livecd-iso-to-disk would be affected by this in some way; or whether the change policy applies. But I think it's better to file the change so there's awareness and coordination: installer team would have to sign off on the pull request for lorax, and then releng team probably should know about it because they define their own compose settings (I guess they often use upstreams defaults but they don't have to), and then QA might want a heads up so if things blow up they know who to ask what's up, and then it's also a good idea to let SOAS folks know about it. And a central point of filing changes is coordination.
https://fedoraproject.org/wiki/Changes/Policy
On Sat, Oct 13, 2018 at 8:17 PM Chris Murphy lists@colorremedies.com wrote:
On Fri, Oct 12, 2018 at 5:26 PM, Marek Marczykowski-Górecki marmarek@invisiblethingslab.com wrote:
On Fri, Oct 12, 2018 at 03:44:38PM -0600, Chris Murphy wrote:
mkfs.btrfs has --rootdir and --shrink features to pre-allocate a volume with files at mkfs time; I have no idea to what degree it depends on kernel code.
Probably not at all, given it works as non-root user too. I've tried to run it twice on the same directory (and with the same --uuid) on 32MB of data and got different images (~2000 lines of hexdump diff). Could be some timestamps, could be something else.
There is volume UUID which is what --uuid affects. But there are other uuids, including the chunk uuid which gets repeated in every leaf and node along with the volume uuid, device uuid, each files tree (subvolume) get its own uuid, etc. Time stamps include atime, otime, mtime, and ctime. Some objects have all 0's for uuid, and some items have only 0.0 for times. I'll float the reproducibility question on the Btrfs list, if it's desirable, useful, and how difficult it is. I think subsetting Btrfs features to reduce complexity generally, and therefore increase reproducibility as a consequence of that, has merit.
This is a really interesting idea...
It's also possible with dm-verity or dm-integrity but then that adds back the dm complexity.
Oh, please, no...
Haha...
This made me giggle a bit. :)
There are two almost separate aspects here:
- image layout (squashfs+ext4, squashfs alone, squashfs+btrfs)
- how copy-on-write is achieved (dm-snapshot, overlay fs)
ext4 alone, and btrfs alone are also viable. But since ext4 has no compression, image size grows by maybe a factor of 2. Btrfs supports lzo and zlib compression since forever, and zstd since kernel 4.14, same as squashfs. What's been missing is mksquashfs with zstd support, which I imagine will be in 5.0. The compression ratio compares well with xz currently being used by mksquashfs in Fedora composes, but with much less CPU to compress and decompress. So I'd say go with zstd in any case.
squashfs has supported zstd along with btrfs since kernel 4.14. zstd support was mainlined into squashfs-tools a year ago: https://github.com/plougher/squashfs-tools/commit/6113361316d5ce5bfdc118d188...
However, there's been no releases since the migration from CVS on SF to Git on GitHub.
For reproducibility, squashfs alone is the best option, but does not improve integrity checking (but also doesn't make it worse).
I'm not able to estimate how much work it is to add a files hash manifest to squashfs, and to always use it on reads, and then add some error handling to EIO upon any mismatch. But yeah it'd need user space code in mksquashfs and also kernel code to support it.
As for copy-on-write, dm-snapshot is quite complex to setup and require underlying FS to support write. Also, doesn't allow to write more data than original image size (may be an issue for persistent partition case). Overlay fs on the other hand works with any underlying fs, you can write as much data as you want. And in case of persistent partition, you can access that data even if base image (the lower layer) is unavailable/broken. I think the only downside of overlay fs is when you modify large file it gets copied in full to the upper layer. But I don't think that's an issue in this use case.
For me, overlay fs is a clear winner here. But as for image layout, it isn't that simple. For reproducibility, squashfs alone is better. But if the goal of this change would be also improving read errors detection, then it isn't that clear anymore. It may be that it takes a simple mkfs.btrfs patch to make it reproducible, but it isn't obvious for me at this stage. Also, keeping two layers looks like unnecessary complexity.
I agree. Overlayfs works fine with any of the discussed filesystems. I'd give a slight edge to Btrfs seed+sprout as the overlay mechanism in the case of persistence on a USB stick: a) checksumming b) compression helps improve performance of USB flash drives and reduces wear c) kernel discovers both seed and sprout in early boot by sprout uuid alone, no special mount options needed for setup. But it's a really minor point because a) and b) are still possible with overlayfs with a new independent btrfs as the upperdir.
What do you think about sidestepping this discussion a little and replacing dm-snapshot with overlay fs regardless of other changes here? That should be doable without any change to image format and will give more flexibility there.
Agreed. What I can't tell you off hand is if livecd-iso-to-disk would be affected by this in some way; or whether the change policy applies. But I think it's better to file the change so there's awareness and coordination: installer team would have to sign off on the pull request for lorax, and then releng team probably should know about it because they define their own compose settings (I guess they often use upstreams defaults but they don't have to), and then QA might want a heads up so if things blow up they know who to ask what's up, and then it's also a good idea to let SOAS folks know about it. And a central point of filing changes is coordination.
As the upstream for livecd-tools[1] (and thus livecd-iso-to-disk), I'd be very interested in changes to support both Btrfs seed+sprout and Btrfs+OverlayFS combinations.
[1]: https://github.com/livecd-tools/livecd-tools
On Sat, Oct 13, 2018 at 6:24 PM, Neal Gompa ngompa13@gmail.com wrote:
On Sat, Oct 13, 2018 at 8:17 PM Chris Murphy lists@colorremedies.com wrote:
On Fri, Oct 12, 2018 at 5:26 PM, Marek Marczykowski-Górecki marmarek@invisiblethingslab.com wrote:
On Fri, Oct 12, 2018 at 03:44:38PM -0600, Chris Murphy wrote:
mkfs.btrfs has --rootdir and --shrink features to pre-allocate a volume with files at mkfs time; I have no idea to what degree it depends on kernel code.
Probably not at all, given it works as non-root user too. I've tried to run it twice on the same directory (and with the same --uuid) on 32MB of data and got different images (~2000 lines of hexdump diff). Could be some timestamps, could be something else.
There is volume UUID which is what --uuid affects. But there are other uuids, including the chunk uuid which gets repeated in every leaf and node along with the volume uuid, device uuid, each files tree (subvolume) get its own uuid, etc. Time stamps include atime, otime, mtime, and ctime. Some objects have all 0's for uuid, and some items have only 0.0 for times. I'll float the reproducibility question on the Btrfs list, if it's desirable, useful, and how difficult it is. I think subsetting Btrfs features to reduce complexity generally, and therefore increase reproducibility as a consequence of that, has merit.
This is a really interesting idea...
https://lore.kernel.org/linux-btrfs/CAJCQCtTPwQnzwkpk=4ZsZXfWTC7HymYETxp-9xU...
squashfs has supported zstd along with btrfs since kernel 4.14. zstd support was mainlined into squashfs-tools a year ago: https://github.com/plougher/squashfs-tools/commit/6113361316d5ce5bfdc118d188...
However, there's been no releases since the migration from CVS on SF to Git on GitHub.
Ahh I missed that. And looking at koji, it seems like squashfs-tools are currently FTBFS on Fedora 29. I have F29 but squashfs-tools-4.3-16.fc28.x86_64.
OK, so it sounds to me like the current proposals for this thread as it relates to installer images for Fedora 30:
- Drop devicemapper in favor of overlayfs - Drop squashfs+ext4 images in favor of squashfs only image - Maybe move to zstd in the squashfs image
I think part of the feature/change proposal should be building an example LiveOS image in copr so we can get an idea of how to blow it up, and ask QA to run it through OpenQA tests and see what sorts of things break there.
Neal, any ideas who Marek could be a co-owner of the feature and help navigate the Fedora process? Maybe someone on the Anaconda or releng teams?
On Sun, Oct 14, 2018 at 1:58 PM Chris Murphy lists@colorremedies.com wrote:
On Sat, Oct 13, 2018 at 6:24 PM, Neal Gompa ngompa13@gmail.com wrote:
On Sat, Oct 13, 2018 at 8:17 PM Chris Murphy lists@colorremedies.com wrote:
On Fri, Oct 12, 2018 at 5:26 PM, Marek Marczykowski-Górecki marmarek@invisiblethingslab.com wrote:
On Fri, Oct 12, 2018 at 03:44:38PM -0600, Chris Murphy wrote:
mkfs.btrfs has --rootdir and --shrink features to pre-allocate a volume with files at mkfs time; I have no idea to what degree it depends on kernel code.
Probably not at all, given it works as non-root user too. I've tried to run it twice on the same directory (and with the same --uuid) on 32MB of data and got different images (~2000 lines of hexdump diff). Could be some timestamps, could be something else.
There is volume UUID which is what --uuid affects. But there are other uuids, including the chunk uuid which gets repeated in every leaf and node along with the volume uuid, device uuid, each files tree (subvolume) get its own uuid, etc. Time stamps include atime, otime, mtime, and ctime. Some objects have all 0's for uuid, and some items have only 0.0 for times. I'll float the reproducibility question on the Btrfs list, if it's desirable, useful, and how difficult it is. I think subsetting Btrfs features to reduce complexity generally, and therefore increase reproducibility as a consequence of that, has merit.
This is a really interesting idea...
https://lore.kernel.org/linux-btrfs/CAJCQCtTPwQnzwkpk=4ZsZXfWTC7HymYETxp-9xU...
I'm interested to see how that thread turns out... It's a tempting idea, because it gives you so much more flexibility. Installation onto a disk could be a "btrfs send" and overlay changes could be easily flattened on top of the target system. It'd also be much cheaper and lighter for supporting the live environment.
squashfs has supported zstd along with btrfs since kernel 4.14. zstd support was mainlined into squashfs-tools a year ago: https://github.com/plougher/squashfs-tools/commit/6113361316d5ce5bfdc118d188...
However, there's been no releases since the migration from CVS on SF to Git on GitHub.
Ahh I missed that. And looking at koji, it seems like squashfs-tools are currently FTBFS on Fedora 29. I have F29 but squashfs-tools-4.3-16.fc28.x86_64.
OK, so it sounds to me like the current proposals for this thread as it relates to installer images for Fedora 30:
- Drop devicemapper in favor of overlayfs
- Drop squashfs+ext4 images in favor of squashfs only image
- Maybe move to zstd in the squashfs image
I think part of the feature/change proposal should be building an example LiveOS image in copr so we can get an idea of how to blow it up, and ask QA to run it through OpenQA tests and see what sorts of things break there.
Neal, any ideas who Marek could be a co-owner of the feature and help navigate the Fedora process? Maybe someone on the Anaconda or releng teams?
Brian C. Lane from the Weldr team is probably the guy to work with on this. He is the chief developer of Lorax, which is where livemedia-creator comes from. I've CC'd him to this email.
-- 真実はいつも一つ!/ Always, there's only one truth!
On Sun, Oct 14, 2018 at 12:21 PM, Neal Gompa ngompa13@gmail.com wrote:
On Sun, Oct 14, 2018 at 1:58 PM Chris Murphy lists@colorremedies.com wrote:
On Sat, Oct 13, 2018 at 6:24 PM, Neal Gompa ngompa13@gmail.com wrote:
This is a really interesting idea...
https://lore.kernel.org/linux-btrfs/CAJCQCtTPwQnzwkpk=4ZsZXfWTC7HymYETxp-9xU...
I'm interested to see how that thread turns out... It's a tempting idea, because it gives you so much more flexibility. Installation onto a disk could be a "btrfs send" and overlay changes could be easily flattened on top of the target system. It'd also be much cheaper and lighter for supporting the live environment.
Ha! I just realized after all this time that the Btrfs wiki does not make clear how to make a sprout, even though it mentions the more esoteric recursive seed.[1] Of course you can mkfs.btrfs, mount it, and send/receive. But send requires read only snapshots. Making a sprout is easier, you just remove the seed device. This is supported since 2009.
# losetup -r /dev/loop0 root.img # mount /dev/loop0 /mnt/ # btrfs device add /dev/sda3 /mnt # mount -o remount,rw /mnt # btrfs device remove /dev/loop0 /mnt
And now it replicates extents from seed to sprout. The copy is faster than pvmove, rsync, dd, or rpm-ostree deploy.
OK so let's say you have a USB stick 'sdb' and internal drive 'sda'. And the stick already has a Fedora LiveOS imaged on it, only change is the root.img is a Btrfs seed. The simplistic systemd pre-mount and mount look like:
# losetup -r /dev/loop0 root.img # mount -t btrfs /dev/loop0 / # btrfs device add /dev/zram1 / # mount -t btrfs -o remount,rw /
- now you have a live overlay in RAM; user can start using this LiveOS environment including making changes like installing software; setting up non-volatile persistence on the stick looks like:
# btrfs device add /dev/sdb3 / # btrfs device remove /dev/zram1 / # echo 1 > /sys/class/zram-control/hot_remove
- now the extents on zram1 are moved from zram1 to sdb3 (the stick); setting up an installation to the internal drive 'sda' by "flattening" as you say, merely means adding the internal drive to the mounted Btrfs volume and removing all others:
# btrfs device add /dev/sda3 / # btrfs device remove /dev/sdb3 / # btrfs device remove /dev/loop0 /
- now extents on sdb3 (stick) and loop0 (seed) are copied to sda3 (internal), including any changes the user is making while all of this is happening. In fact, the user does not even have to reboot because once the operation finishes, and the loop is torn down, the stick is not in use by the kernel. The user can just unplug the stick and keep working. A spin or downstream could very sanely, and straightforwardly build a no-UI OS installation.
It's not obvious that 'btrfs device add' incorporates a mkfs and that you can now just delete the ro seed. Also not obvious is the 'dev add' on an ro mounted seed causes a new volume UUID to be generated. This is immediately discovered by libblkid. The kernel knows that this new volume is a two device (or three device, whatever the case is) btrfs and which devices they are. And this is such basic btrfs handling code that GRUB and extlinux Btrfs code understand it.
[1] https://btrfs.wiki.kernel.org/index.php/Seed-device
On Sun, Oct 14, 2018 at 04:35:53PM -0600, Chris Murphy wrote: [..]
Ha! I just realized after all this time that the Btrfs wiki does not make clear how to make a sprout, even though it mentions the more esoteric recursive seed.[1] Of course you can mkfs.btrfs, mount it, and send/receive. But send requires read only snapshots. Making a sprout is easier, you just remove the seed device. This is supported since 2009.
# losetup -r /dev/loop0 root.img # mount /dev/loop0 /mnt/ # btrfs device add /dev/sda3 /mnt # mount -o remount,rw /mnt # btrfs device remove /dev/loop0 /mnt
And now it replicates extents from seed to sprout. The copy is faster than pvmove, rsync, dd, or rpm-ostree deploy.
This sounds great!
I just tried it (on Fedora 29), but those steps don't work for me:
# cryptsetup --readonly luksOpen /dev/nbd0p4 tmp # mount -o noatime /dev/mapper/tmp /mnt/tmp # mount: /mnt/tmp: WARNING: device write-protected, mounted read-only. # btrfs device add /dev/nbd1 /mnt/tmp Performing full device TRIM /dev/nbd1 (4.00GiB) ... ERROR: error adding device '/dev/nbd1': Read-only file system
Am I missing something?
Best regards Georg
On Fri, Jan 04, 2019 at 09:27:33PM +0100, Georg Sauthoff wrote:
On Sun, Oct 14, 2018 at 04:35:53PM -0600, Chris Murphy wrote: [..]
And now it replicates extents from seed to sprout. The copy is faster than pvmove, rsync, dd, or rpm-ostree deploy.
This sounds great!
I just tried it (on Fedora 29), but those steps don't work for me:
# cryptsetup --readonly luksOpen /dev/nbd0p4 tmp # mount -o noatime /dev/mapper/tmp /mnt/tmp # mount: /mnt/tmp: WARNING: device write-protected, mounted read-only. # btrfs device add /dev/nbd1 /mnt/tmp Performing full device TRIM /dev/nbd1 (4.00GiB) ... ERROR: error adding device '/dev/nbd1': Read-only file system
Am I missing something?
Ok, a necessary condition for creating a sprout is setting the seed parameter on the source filesystem (via btrfstune). [1]
(with the seed parameter a mount of that FS is automatically read-only)
Thus, this works for me:
# cryptsetup luksOpen /dev/nbd0p4 tmp # btrfstune -S 1/dev/mapper/tmp # mount -o noatime /dev/mapper/tmp /mnt/tmp # mount: /mnt/tmp: WARNING: device write-protected, mounted read-only. # btrfs device add /dev/nbd1 /mnt/tmp Performing full device TRIM /dev/nbd1 (2.80GiB) ... # mount -o remount,rw /mnt/tmp # time btrfs device remove /dev/mapper/tmp /mnt/tmp # umount /mnt/tmp
Best regards Georg Sauthoff
[1]: btrfstune is also mentioned in the previously referenced https://btrfs.wiki.kernel.org/index.php/Seed-device article
On Wed, Jan 16, 2019 at 10:42 AM Georg Sauthoff mail@georg.so wrote:
On Fri, Jan 04, 2019 at 09:27:33PM +0100, Georg Sauthoff wrote:
On Sun, Oct 14, 2018 at 04:35:53PM -0600, Chris Murphy wrote: [..]
And now it replicates extents from seed to sprout. The copy is faster than pvmove, rsync, dd, or rpm-ostree deploy.
This sounds great!
I just tried it (on Fedora 29), but those steps don't work for me:
# cryptsetup --readonly luksOpen /dev/nbd0p4 tmp # mount -o noatime /dev/mapper/tmp /mnt/tmp # mount: /mnt/tmp: WARNING: device write-protected, mounted read-only. # btrfs device add /dev/nbd1 /mnt/tmp Performing full device TRIM /dev/nbd1 (4.00GiB) ... ERROR: error adding device '/dev/nbd1': Read-only file system
Am I missing something?
Ok, a necessary condition for creating a sprout is setting the seed parameter on the source filesystem (via btrfstune). [1]
(with the seed parameter a mount of that FS is automatically read-only)
Thus, this works for me:
# cryptsetup luksOpen /dev/nbd0p4 tmp # btrfstune -S 1/dev/mapper/tmp # mount -o noatime /dev/mapper/tmp /mnt/tmp # mount: /mnt/tmp: WARNING: device write-protected, mounted read-only. # btrfs device add /dev/nbd1 /mnt/tmp Performing full device TRIM /dev/nbd1 (2.80GiB) ... # mount -o remount,rw /mnt/tmp # time btrfs device remove /dev/mapper/tmp /mnt/tmp # umount /mnt/tmp
Yeah sorry I made the assumption that "the seed" is already flagged with btrfstune. If it weren't flagged as seed and is rw mounted, replication does still happen however the first device has its signature wiped, and the second device inherits the same fs UUID. The use case here is live migration from one device to another.
In the seed/sprout use case the seed is not wiped (so it can be an on-going source), and the sprout gets a new fs UUID assigned.
On Sun, Oct 14, 2018 at 02:21:47PM -0400, Neal Gompa wrote:
Neal, any ideas who Marek could be a co-owner of the feature and help navigate the Fedora process? Maybe someone on the Anaconda or releng teams?
Brian C. Lane from the Weldr team is probably the guy to work with on this. He is the chief developer of Lorax, which is where livemedia-creator comes from. I've CC'd him to this email.
Thanks, Marek and I are already in touch :) As long as overlayfs can do what we need the bulk of the extra work needs to be done in anaconda-dracut.
We may also want to make this switch an option for a bit, while we work out the details.
On Mon, Oct 15, 2018 at 08:30:39AM -0700, Brian C. Lane wrote:
On Sun, Oct 14, 2018 at 02:21:47PM -0400, Neal Gompa wrote:
Neal, any ideas who Marek could be a co-owner of the feature and help navigate the Fedora process? Maybe someone on the Anaconda or releng teams?
Brian C. Lane from the Weldr team is probably the guy to work with on this. He is the chief developer of Lorax, which is where livemedia-creator comes from. I've CC'd him to this email.
Thanks, Marek and I are already in touch :) As long as overlayfs can do what we need the bulk of the extra work needs to be done in anaconda-dracut.
FWIW the change to anaconda-dracut to support _only_ overlayfs is quite small: https://github.com/marmarek/qubes-installer-qubes-os/commit/332be8e1e3e10060...
And it drops dependency on dmsquash-live module.
We may also want to make this switch an option for a bit, while we work out the details.
Support for both layouts will be more tricky, because of the split between anaconda-dracut and dmsquash-live. Integrating (parts of?) the latter in the former would make it much easier. But IMO it's worth making it support both layouts, at least for now.
Anyway, can somebody help me with change proposal? For example I'm not sure if this is "Self Contained" or "System Wide" Change, or what should specifically be listed in "Scope". If IRC would be more appropriate for such discussion, that's fine for me too.
On Mon, Oct 15, 2018 at 05:52:29PM +0200, Marek Marczykowski-Górecki wrote:
Anyway, can somebody help me with change proposal? For example I'm not sure if this is "Self Contained" or "System Wide" Change, or what should specifically be listed in "Scope". If IRC would be more appropriate for such discussion, that's fine for me too.
I would suggest system-wide, since every edition and spin relies on the installer.
"Scope" should cover everything you're changing, and everyone who is impacted in some way (whether they need to be directly involved, or are impacted and need to change something, or whether they just might like to be aware).
On Fri, Oct 12, 2018 at 3:44 PM, Chris Murphy lists@colorremedies.com wrote:
On Fri, Oct 12, 2018 at 4:30 AM, Marek Marczykowski-Górecki marmarek@invisiblethingslab.com wrote:
On Thu, Oct 11, 2018 at 09:24:08PM -0600, Chris Murphy wrote:
I'm pretty sure the original reason was the default live install use dd to block copy the root file system into the fedora-root LV, and then resized the LV and ext4 file system.
How is it done now?
On Live media installs, anaconda does:
rsync -pogAXtlHrDx --exclude /dev/ --exclude /proc/ --exclude /sys/ --exclude /run/ --exclude /boot/*rescue* --exclude /etc/machine-id /mnt/install/source/ /mnt/sysimage
On DVD and netinstalls, I'm guessing based on packaging.log that it's a dnf+rpm installation even though I never see a dnf or rpm process in either top or ps. In any case, the rpm packages are directly on the iso9660 file system, not baked into the
One other thing that really hogs system resources for some reason, is one of the loopback mount devices, I think loop1 which is root.img, hogs nearly 100% CPU for the duration of the installation for LiveOS media. I don't know why, but it might be worth benchmarking nbd based mounts for comparison. The installation turns my computers into hair dryers. The installation process bottleneck should be reading the compressed root image, not CPU.
On Thu, Oct 11, 2018, at 8:37 PM, Marek Marczykowski-Górecki wrote:
Hi all!
I'm new on this list. I work on Qubes OS, where Fedora is used as a base distribution.
Tangentially: Qubes is very cool and I'm glad you find Fedora useful as a base system. I work on Fedora CoreOS and have patches in a lot of OS components; lorax, systemd, etc. If there's something blocking you feel free to reach out and I may be able to spend some time to help. Also, if you decide to investigate using rpm-ostree for the Qubes dom0 - I'd be very interested in helping.
*guess* it's there for historical reason, from before aufs/overlayfs being available. Is there any other reason for that?
There is one thing to consider; overlayfs-on-overlayfs is not supported, so this would break podman/docker out of the box. But we could probably have them fall back to the vfs backend; it's not like we really care a lot about performance in the live media case.
One side note for OSTree-based systems we have some built-in support for the read-only media case: https://github.com/ostreedev/ostree/commit/ff6883ca0655ac8844cd783caf6a7d881... Which then changed to use systemd: https://github.com/ostreedev/ostree/commit/05d0ee5cbecd1287b87d38e969862a5d8... That still has /etc and /sysroot be read-only though. Extending to overlayfs for / would allow e.g. `rpm-ostree install` to work as well if the upperdir is on a writable layer.