Dracut, dmsquash, and overlays

List overview All Threads
Download

newer

older

Fedora Security Team

Splitting & renaming of dhcp...

Major Hayden

28 Jul 2014 28 Jul '14

7:37 a.m.

Hello there,

I'm working with F20 and CentOS 7 to create some live booted images. I'm not looking to do live USB/CD media, but rather boot a server over the network with a kernel, initramfs, and squashfs. It's working well so far, but I have a filesystem issue that I can't seem to fix.

My build scripts create a 10GB sparse file and I fill that with an ext4 filesystem. I package that up into a squashfs as specified in the docs[1]. That boots just fine.

However, if I attempt to fill the filesystem, it fills and becomes corrupt much earlier than I'd expect. I've put some log info into a gist[2].

My expectation is that if I create a 10GB live filesystem and and the system has > 10GB of RAM available, I'd expect that I could store somewhere around 10GB of data in the live filesystem before I run into a full filesystem. Is that expectation incorrect? Am I configuring something incorrectly?

Thanks for taking the time to read this far. :)

[1] https://fedoraproject.org/wiki/LiveOS_image [2] https://gist.github.com/major/d4e9f447ab942edd7952

-- Major Hayden

Show replies by date

Fabian Deutsch

28 Jul 28 Jul

8:55 a.m.

New subject: Aw: Dracut, dmsquash, and overlays

...

Hello there,

I'm working with F20 and CentOS 7 to create some live booted images. I'm not looking to do live USB/CD media, but rather boot a server over the network with a kernel, initramfs, and squashfs. It's working well so far, but I have a filesystem issue that I can't seem to fix.

My build scripts create a 10GB sparse file and I fill that with an ext4 filesystem. I package that up into a squashfs as specified in the docs[1]. That boots just fine.

However, if I attempt to fill the filesystem, it fills and becomes corrupt much earlier than I'd expect. I've put some log info into a gist[2].

My expectation is that if I create a 10GB live filesystem and and the system has > 10GB of RAM available, I'd expect that I could store somewhere around 10GB of data in the live filesystem before I run into a full filesystem. Is that expectation incorrect? Am I configuring something incorrectly?

Thanks for taking the time to read this far. :)

Hey,

I am not completely if it is this issue you are seeing, but a squashfs image takes more memory then the image itself, becasue you first need to unsquash (in ram) and the load the fs pages into ram. Also take a look here: http://dummdida.tumblr.com/post/89051342705/taking-a-look-at-the-rootfs-foot...

- fabian

...

[1] https://fedoraproject.org/wiki/LiveOS_image [2] https://gist.github.com/major/d4e9f447ab942edd7952

-- Major Hayden

-- devel mailing list devel@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/devel Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct

Major Hayden

9:03 a.m.

On Jul 28, 2014, at 10:55, Fabian Deutsch fabian.deutsch@gmx.de wrote:

...

I am not completely if it is this issue you are seeing, but a squashfs image takes more memory then the image itself, becasue you first need to unsquash (in ram) and the load the fs pages into ram. Also take a look here: http://dummdida.tumblr.com/post/89051342705/taking-a-look-at-the-rootfs-foot...

fabian

Hello Fabian,

That certainly helps. I also discovered why the filesystem was filling up so quickly. I didn't read the docs or the dmsquash-live-root.sh script closely enough. There's a line where a 512MB snapshot overlay is created if you don't specify your own overlay filesystem[1]

There's another thread from way back in 2009 where the overlay file is discussed along with the lack of a stackable filesystem (AUFS, unionfs, or similar)[2].

Are there any other good options for persistent data for servers booted via live boot? I'm writing some scripts now that make symlinks from certain directories back to a persistent storage volume on the host.

[1] https://github.com/zfsonlinux/dracut/blob/master/modules.d/90dmsquash-live/d... [2] http://fedora.12.x6.nabble.com/Anyone-using-the-overlay-file-td2652964.html

-- Major Hayden

Will Woods

3:11 p.m.

On Mon, 2014-07-28 at 09:37 -0500, Major Hayden wrote:

...

Hello there,

I'm working with F20 and CentOS 7 to create some live booted images. I'm not looking to do live USB/CD media, but rather boot a server over the network with a kernel, initramfs, and squashfs. It's working well so far, but I have a filesystem issue that I can't seem to fix.

My build scripts create a 10GB sparse file and I fill that with an ext4 filesystem. I package that up into a squashfs as specified in the docs[1]. That boots just fine.

However, if I attempt to fill the filesystem, it fills and becomes corrupt much earlier than I'd expect. I've put some log info into a gist[2].

My expectation is that if I create a 10GB live filesystem and and the system has > 10GB of RAM available, I'd expect that I could store somewhere around 10GB of data in the live filesystem before I run into a full filesystem. Is that expectation incorrect? Am I configuring something incorrectly?

You're not quite right about how the overlay works.

The default in-memory overlay is just 512MB. And the device-mapper docs note that "if it fills up the snapshot will become useless and be disabled, returning errors."[1].

You should also note that the overlay is a block-level snapshot - so any changes to existing files or filesystem metadata will cause data to be written to the overlay. Furthermore, the default chunk size is 4kb - so any change less than 4kb will take 4kb of space.

Basically, the dmsquash-live + overlay thing is a gross hack. It was designed to solve the problem of fitting an entire desktop OS on a 650MB CD-ROM, to be used as a demonstration, or for other short-lived systems (e.g. the installer). It was not designed for long-term use.

I assume you're using filesystem images 'cuz you don't have a reliable network connection (otherwise you'd probably be using NFS or iSCSI or something)?

Since your systems have lots of RAM, why not just use a regular ext4 filesystem image as your root filesystem? Then you don't need to worry about blowing up the overlay at all.

If you need compression to save bandwidth/download time: you could just xz-compress the filesystem image and uncompress it after download?

If you need compression to save RAM: why not use a squashfs image directly, and mount/bind a tmpfs to the places you'll be writing data?

You might even consider using btrfs, mounted with the "compress" or "compress-force" option[2], which would give you the benefit of compression *and* a normal read-write filesystem.

Is there a particular reason you need to use dmsquash-live, or is this just a case of the hammer making all your problems look like nails?

-w

[1] https://www.kernel.org/doc/Documentation/device-mapper/snapshot.txt [2] https://btrfs.wiki.kernel.org/index.php/Compression

Major Hayden

7:26 p.m.

On Jul 28, 2014, at 17:11, Will Woods wwoods@redhat.com wrote:

...

You're not quite right about how the overlay works.

The default in-memory overlay is just 512MB. And the device-mapper docs note that "if it fills up the snapshot will become useless and be disabled, returning errors."[1].

You should also note that the overlay is a block-level snapshot - so any changes to existing files or filesystem metadata will cause data to be written to the overlay. Furthermore, the default chunk size is 4kb - so any change less than 4kb will take 4kb of space.

After hitting a wall several times today, I began to see what you're talking about here. ;)

...

I assume you're using filesystem images 'cuz you don't have a reliable network connection (otherwise you'd probably be using NFS or iSCSI or something)?

The network connection is reliable, but I'm working with thousands of nodes that need a largely stateless system with only a few persistent items.

...

Since your systems have lots of RAM, why not just use a regular ext4 filesystem image as your root filesystem? Then you don't need to worry about blowing up the overlay at all.

Are you suggesting an ext4 r/w filesystem stored in RAM? I haven't seen how to do that in dracut with the existing scripts.

...

If you need compression to save RAM: why not use a squashfs image directly, and mount/bind a tmpfs to the places you'll be writing data?

I'd be interested in that for sure but the dmsquash module in dracut seems to require a real ext/btrfs/xfs filesystem for device mapper. I couldn't find a way to boot a plain squashfs with a filesystem in it.

...

Is there a particular reason you need to use dmsquash-live, or is this just a case of the hammer making all your problems look like nails?

My goal is to live boot our servers since the majority of our systems would be stateless. Being able to reboot into a known good, tested state would be advantageous. I've worked with Debian's Live Systems project[1] and their strategy is to mount a squashfs read only but then use aufs to provide a writeable filesystem overlay. It's handy since you can fill up the overlay without causing the snapshot to overflow. However, AUFS isn't in the upstream kernel and that makes things a bit challenging.

If there's a better strategy than using dmsquash-live in dracut, or if I need to do some work to build a new dracut module, I'm certainly up for that. Finding documentation on the internals of dracut has been a bit challenging for me so far.

[1] http://live.debian.net/

-- Major Hayden

Peter Robinson

29 Jul 29 Jul

1:25 a.m.

...

...
You're not quite right about how the overlay works.

The default in-memory overlay is just 512MB. And the device-mapper docs note that "if it fills up the snapshot will become useless and be disabled, returning errors."[1].

You should also note that the overlay is a block-level snapshot - so any changes to existing files or filesystem metadata will cause data to be written to the overlay. Furthermore, the default chunk size is 4kb - so any change less than 4kb will take 4kb of space.

After hitting a wall several times today, I began to see what you're talking about here. ;)

...
I assume you're using filesystem images 'cuz you don't have a reliable network connection (otherwise you'd probably be using NFS or iSCSI or something)?

The network connection is reliable, but I'm working with thousands of nodes that need a largely stateless system with only a few persistent items.

...
Since your systems have lots of RAM, why not just use a regular ext4 filesystem image as your root filesystem? Then you don't need to worry about blowing up the overlay at all.

Are you suggesting an ext4 r/w filesystem stored in RAM? I haven't seen how to do that in dracut with the existing scripts.

...
If you need compression to save RAM: why not use a squashfs image directly, and mount/bind a tmpfs to the places you'll be writing data?

I'd be interested in that for sure but the dmsquash module in dracut seems to require a real ext/btrfs/xfs filesystem for device mapper. I couldn't find a way to boot a plain squashfs with a filesystem in it.

...
Is there a particular reason you need to use dmsquash-live, or is this just a case of the hammer making all your problems look like nails?

My goal is to live boot our servers since the majority of our systems would be stateless. Being able to reboot into a known good, tested state would be advantageous. I've worked with Debian's Live Systems project[1] and their strategy is to mount a squashfs read only but then use aufs to provide a writeable filesystem overlay. It's handy since you can fill up the overlay without causing the snapshot to overflow. However, AUFS isn't in the upstream kernel and that makes things a bit challenging.

It sounds like oVirt node does a lot of what you need and might be a good starting point, it's basically a minimal KVM plus associated userspace hypervisor. It can be booted as a live image, pxe boot or installed.

http://www.ovirt.org/Category:Node http://www.ovirt.org/Node_Building http://www.ovirt.org/Node_PXE

Peter

Will Woods

8:32 a.m.

On Mon, 2014-07-28 at 21:26 -0500, Major Hayden wrote:

...

On Jul 28, 2014, at 17:11, Will Woods wwoods@redhat.com wrote:

...
Since your systems have lots of RAM, why not just use a regular ext4 filesystem image as your root filesystem? Then you don't need to worry about blowing up the overlay at all.

Are you suggesting an ext4 r/w filesystem stored in RAM? I haven't seen how to do that in dracut with the existing scripts.

Any filesystem image *should* work as a live image. dmsquash-live images get handled specially, but plain ext4 (or squashfs, or cramfs, or btrfs, or whatever) should work just as well.

See dmsquash-live-root.sh, where (if given a filesystem image) it checks the filesystem type and mounts it accordingly:

https://github.com/haraldh/dracut/blob/master/modules.d/90dmsquash-live/dmsq...

Or just try it:

root=live:http://host.your.sys/path-to/ext4.img

If this doesn't work, that's a bug in dracut, and should be filed accordingly.

-w

John Florian

30 Jul 30 Jul

8:13 a.m.

...

-----Original Message----- From: devel-bounces@lists.fedoraproject.org [mailto:devel- bounces@lists.fedoraproject.org] On Behalf Of Major Hayden My goal is to live boot our servers since the majority of our systems would be stateless. Being able to reboot into a known good, tested state would be advantageous.

I'm doing just that with my own custom Fedora Live spins that now runs on nearly a thousand "nodes". I can kill any of them by exhausting the overlay space, but I avoid that issue by careful configuration of running services to ensure that anything[1] that gets written hits either tmpfs or my backing storage -- which in my case that happens to be various forms of Flash memory cards (CF, CFast, SD, or even SSD in a few cases). Other than that data, the image is only semi-mutable. Yes you alter things because of the overlay, but reboot and your back to square one. This works great for us because I can apply an updated rpm or the like between spin releases yet maintain the robustness of a live image.

If you wish to discuss details and/or share lessons learned, tricks, etc. you may mail me off list if you wish.

[1] By "anything", I really only mean stuff that's being written on an on-going basis (e.g., logging) or stuff that's beneficially cached (e.g., yum data). Not only do I avoid exhausting the overlay, I also gain reliability against network outages for some use cases. This matters a lot to me since my spin is relatively use-case agnostic; it's an generic Appliance OS and puppet makes each node into something specific at run-time though always starting from a common image. -- John Florian

3621

Age (days ago)

3623

Last active (days ago)

devel@lists.fedoraproject.org

7 comments

5 participants

tags (0)

participants (5)

Fabian Deutsch
John Florian
Major Hayden
Peter Robinson
Will Woods