On Fri, Mar 8, 2013 at 6:55 PM, Frederick Grose <fgrose@gmail.com> wrote:
On Thu, Mar 7, 2013 at 10:41 AM, <John.Florian@dart.biz> wrote:
> From: Frederick Grose <fgrose@gmail.com>
> On Wed, Mar 6, 2013 at 3:59 PM, <John.Florian@dart.biz> wrote:
<snip>
> root@aos-61:46 # # Lets now make it all go wonky:
> root@aos-61:46 # time dd if=/dev/zero of=/foo
> Bus error
>
> real    1m15.775s
> user    0m2.818s
> sys     0m24.129s
> root@aos-61:46 #
> root@aos-61:46 # ls /root
> -bash: /bin/ls: Input/output error
> root@aos-61:46 # df -h
> -bash: /usr/bin/df: Input/output error                              
>                                                
> root@aos-61:46 # mount                                              
>                                                
> -bash: /usr/bin/mount: Input/output error                          
>                                                
> root@aos-61:46 # cat /proc/meminfo                                  
>                                                
> -bash: /usr/bin/cat: Input/output error                            
>                                                
>
> Is this expected?  Is there anything I can do, e.g., configuration-
> wise, that can prevent this?  Ideally this would fail much like any
> other full disk situation.  I understand that the overlay consumes
> space, i.e., memory, for this file growth, including file removals,
> but I'd at least like to be able to remotely reboot a system when in
> this state, however I can't even do that because the reboot command
> will either return the same I/O error or it may succeed but get the
> I/O error when systemd tries to read /usr/lib/systemd/system/reboot.target.
>
> I dug around in bugzilla, but found nothing there.  I can file a
> bug, but which package is likely at fault here?
> --
> John Florian

>
> See https://fedoraproject.org/wiki/LiveOS_image for some background
> and potential workarounds.

>
>         --Fred --



There's really not much on that page that helps me here.  I'm trying to use Live images for a mostly-stateless embedded appliance OS deployed to hundreds or thousands of devices.  I realize that the COW design is always going to be limited, but a more graceful failure mode is really needed, somehow.  For our use, the biggest gain in stability here actually comes from systemd's journal with its trim-before-write approach instead of the legacy write now, trim asynchronously approach we used to have.  However, that only covers one specific use case: logging.  Writing to proper persistent storage allows me to avoid the root file system overlay, but most of these embedded devices use CF or SD cards for storage, which have limited write cycles that must be respected.

Is there a way to implement an artificial capacity limit that would prevent processes from exhausting the overlay so that the reserve might be used for recording the event and rebooting back to a safer state?

At the very least, I think this page could benefit from a little stronger, more explicit wording of this failure case.  While it talks a little about some work-arounds, it actually says very little about why they are needed.  Only in the "Overlay Recovery" section does it hint at the crash potential.

--
John Florian


Thank you for the review!  I've updated the wiki page based on your comments,
https://fedoraproject.org/wiki/LiveOS_image

Documenting that a temporary overlay is a 0.5 GiB sparse file in a RAM filesystem gave me the idea to try using an overlay size greater than available memory, and hope that kernel out-of-memory warnings would intervene before the device-mapper filesystem invalidation.

I modified /usr/sbin/dmsquash-live-root in the initramfs to create a temporary 500 GiB sparse overlay:

dd if=/dev/null of=/overlay bs=1024 count=1 seek=$((512*1024*1024)) 2> /dev/null

Then after booting an updated, Fedora 18 Live desktop, LiveUSB read only and running your failure demo,

time dd if=/dev/zero of=/foo

I got out-of-memory warnings after a file of about 450 MiB was written and the command returned--no crash!

Some post test output:

[root@localhost ~]# dmsetup status
live-osimg-min: 0 8388608 snapshot 2584/2584 24
live-rw: 0 8388608 snapshot 921720/1073741824 3600
 

top - 18:11:53 up 17 min,  3 users,  load average: 0.68, 0.75, 0.57
Tasks: 182 total,   2 running, 180 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.6 us,  1.6 sy,  0.0 ni, 96.5 id,  0.0 wa,  0.2 hi,  0.0 si,  0.0 st
KiB Mem:   3339812 total,  3260284 used,    79528 free,   316384 buffers
KiB Swap:  3341308 total,        0 used,  3341308 free,  1948108 cached


You might test this method in your systems and let us know how it works.

           --Fred

Pardon my bad observations, my above conclusion IS WRONG and unsupported by the above test.

I deceived myself with an unfamiliar error message, and actually seem to have tested James Heather's method in my last test.

My root filesystem size was 4 GiB with about 450 MiB free.  An out-of-disc-space warning is what actually popped up and caused the test command to exit before another failure or crash.

To retest the oversized overlay hypothesis, I resized the LiveUSB root filesystem to 12 GiB and repeated the test on it as an attached LiveOS filesystem, /dev/mapper/dm-PCBV6p (mounted at /mnt/a).

[root@localhost a]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 24/1073741824 16

[root@localhost ~]# dmsetup table
dm-PCBV6p: 0 25165824 snapshot 7:9 7:10 P 8

[root@localhost ~]# losetup -a
/dev/loop8: [2081]:1833 (/run/media/fgrose/LIVE/LiveOS/squashfs.img)
/dev/loop9: [1800]:3 (/run/media/livemnt-squash-mRJNIA/LiveOS/rootfs.img)
/dev/loop10: [0017]:58601 (/run/media/tmpvJjuX7)
/dev/loop11: [2081]:1832 (/run/media/fgrose/LIVE/LiveOS/home.img)

[root@localhost ~]# df -Th
Filesystem            Type      Size  Used Avail Use% Mounted on
devtmpfs              devtmpfs  1.6G     0  1.6G   0% /dev
tmpfs                 tmpfs     1.6G  152K  1.6G   1% /dev/shm
tmpfs                 tmpfs     1.6G  3.3M  1.6G   1% /run
tmpfs                 tmpfs     1.6G     0  1.6G   0% /sys/fs/cgroup
/dev/sda1             ext4       18G  8.8G  7.7G  54% /
tmpfs                 tmpfs     1.6G   28K  1.6G   1% /tmp
/dev/sdc1             vfat       15G  8.8G  6.2G  59% /run/media/fgrose/LIVE
/dev/loop8            squashfs  929M  929M     0 100% /run/media/livemnt-squash-mRJNIA
/dev/mapper/dm-PCBV6p ext4       12G  3.4G  8.2G  30% /mnt/a
/dev/loop11           ext4      380M   35M  325M  10% /mnt/a/home

[root@localhost ~]# mount
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime,seclabel)
devtmpfs on /dev type devtmpfs (rw,nosuid,seclabel,size=1652304k,nr_inodes=413076,mode=755)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
selinuxfs on /sys/fs/selinux type selinuxfs (rw,relatime)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,seclabel)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,seclabel,gid=5,mode=620,ptmxmode=000)
tmpfs on /run type tmpfs (rw,nosuid,nodev,seclabel,mode=755)
tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,seclabel,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpuacct,cpu)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
/dev/sda1 on / type ext4 (rw,relatime,seclabel,data=ordered)
rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw,relatime)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=37,pgrp=1,timeout=300,minproto=5,maxproto=5,direct)
configfs on /sys/kernel/config type configfs (rw,relatime)
mqueue on /dev/mqueue type mqueue (rw,relatime,seclabel)
debugfs on /sys/kernel/debug type debugfs (rw,relatime)
tmpfs on /tmp type tmpfs (rw,seclabel)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,seclabel)
gvfsd-fuse on /run/user/1000/gvfs type fuse.gvfsd-fuse (rw,nosuid,nodev,relatime,user_id=1000,group_id=1000)
/dev/sdc1 on /run/media/fgrose/LIVE type vfat (rw,nosuid,nodev,relatime,uid=1000,gid=1000,fmask=0022,dmask=0077,codepage=437,iocharset=ascii,shortname=mixed,showexec,utf8,flush,errors=remount-ro,uhelper=udisks2)
/dev/sdb2 on /var/cache/yum type ext4 (rw,relatime,seclabel,data=ordered)
/dev/loop8 on /run/media/livemnt-squash-mRJNIA type squashfs (ro,relatime,seclabel)
/dev/mapper/dm-PCBV6p on /mnt/a type ext4 (rw,relatime,seclabel,data=ordered)
/dev/loop11 on /mnt/a/home type ext4 (rw,relatime,seclabel,data=ordered)
/dev/sdc1 on /mnt/a/run/initramfs/live type vfat (rw,nosuid,nodev,relatime,uid=1000,gid=1000,fmask=0022,dmask=0077,codepage=437,iocharset=ascii,shortname=mixed,showexec,utf8,flush,errors=remount-ro)


The target filesystem, /dev/mapper/dm-PCBV6p, did go invalid, was changed to ro, which led to this test command output:

[root@localhost a]# time dd if=/dev/zero of=foo
dd: writing to ‘foo’: Read-only file system
4029694+0 records in
4029693+0 records out
2063202816 bytes (2.1 GB) copied, 40.0696 s, 51.5 MB/s

real 0m40.079s
user 0m3.799s
sys 0m32.422s

In a separate terminal I manually monitored the dmsetup status:

[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 184/1073741824 16
[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 192/1073741824 16
[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 192/1073741824 16
[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 263440/1073741824 1040
[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 366360/1073741824 1440
[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 526608/1073741824 2064
[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 703280/1073741824 2752
[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 904600/1073741824 3528
[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 1131568/1073741824 4416
[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 1383288/1073741824 5392
[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 1579432/1073741824 6160
[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 1579432/1073741824 6160
[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 1731568/1073741824 6752
[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 2191040/1073741824 8536
[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 2420840/1073741824 9432
[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 2632232/1073741824 10256
[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 2632288/1073741824 10256
[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 2964616/1073741824 11544
[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot 3208632/1073741824 12496
[root@localhost ~]# dmsetup status
dm-PCBV6p: 0 25165824 snapshot Invalid

[root@localhost ~]# df -Th
Filesystem            Type      Size  Used Avail Use% Mounted on
devtmpfs              devtmpfs  1.6G     0  1.6G   0% /dev
tmpfs                 tmpfs     1.6G  504K  1.6G   1% /dev/shm
tmpfs                 tmpfs     1.6G 1000M  632M  62% /run
tmpfs                 tmpfs     1.6G     0  1.6G   0% /sys/fs/cgroup
/dev/sda1             ext4       18G  8.7G  7.7G  54% /
tmpfs                 tmpfs     1.6G   68K  1.6G   1% /tmp
/dev/sdc1             vfat       15G  8.8G  6.2G  59% /run/media/fgrose/LIVE
/dev/loop8            squashfs  929M  929M     0 100% /run/media/livemnt-squash-mRJNIA
/dev/mapper/dm-PCBV6p ext4       12G  4.6G  7.1G  40% /mnt/a
/dev/loop11           ext4      380M   35M  325M  10% /mnt/a/home

The invalidation occurred at the 1.6 GB size limit applied to the /run tmpfs where the overlay, /dev/loop10, was mounted,
[root@localhost ~]# losetup /dev/loop10
/dev/loop10: [0017]:58601 (/run/media/tmpvJjuX7)


[root@localhost ~]# ls /mnt/a
ls: cannot access /mnt/a/.readahead: Input/output error

top - 00:26:06 up 13 min,  4 users,  load average: 0.55, 0.68, 0.44
Tasks: 204 total,   2 running, 202 sleeping,   0 stopped,   0 zombie
%Cpu(s):  4.3 us,  1.6 sy,  0.0 ni, 92.4 id,  1.3 wa,  0.3 hi,  0.0 si,  0.0 st
KiB Mem:   3339812 total,  3256176 used,    83636 free,    68956 buffers
KiB Swap:  3341308 total,        0 used,  3341308 free,  2312664 cached

Notice that Swap was not activated, but free memory got down to ~83 MiB.
When I tested the above on the booted LiveUSB, 2-3 GiB of swap was activated before the fatal crash.

So an oversized overlay DOES NOT prevent device-mapper invalidation by the above test method.

       --Fred