On Mon, May 2, 2011 at 4:46 PM, Howard Powell <hbp4c(a)virginia.edu> wrote:
Hi -
I've been using the livecd set of tools to build a pxeboot image for a set
of compute nodes in our local HPC environment. The livecd project has
allowed me to make all of the compute nodes diskless, and any software
errors are trivial to fix (just reboot).
I've run into one problem - there appears to be a problem with my image
where if any process on a node produces a large amount of disk I/O to /tmp -
somewhere around 0.5GiB or more in one operation, causes the root filesystem
to panic and the node must be rebooted.
Creating the image is as simple as:
# LANG=C livecd-creator --config=/local/nodes/hyades-nodes.cfg
--fslabel=hyades -t /local/nodes/
# livecd-iso-to-pxeboot /local/nodes/hyades.iso
The exact error caused during the I/O operation on a compute node is logged
as:
May 2 16:11:32 eth-c31.cluster kernel: device-mapper: snapshots:
Invalidating snapshot: Unable to allocate exception.
May 2 16:11:32 eth-c31.cluster syslogd: /var/log/messages: Read-only file
system
May 2 16:11:32 eth-c31.cluster kernel: Buffer I/O error on device dm-0,
logical block 997925
May 2 16:11:32 eth-c31.cluster kernel: lost page write due to I/O error on
dm-0
May 2 16:11:32 eth-c31.cluster kernel: Aborting journal on device dm-0.
May 2 16:11:32 eth-c31.cluster kernel: __journal_remove_journal_head:
freeing b_committed_data
May 2 16:11:32 eth-c31.cluster last message repeated 5 times
May 2 16:11:32 eth-c31.cluster kernel: journal commit I/O error
May 2 16:11:32 eth-c31.cluster kernel: ext3_abort called.
May 2 16:11:32 eth-c31.cluster kernel: EXT3-fs error (device dm-0):
ext3_journal_start_sb: Detected aborted journal
May 2 16:11:32 eth-c31.cluster kernel: Remounting filesystem read-only
May 2 16:11:32 eth-c31.cluster kernel: __journal_remove_journal_head:
freeing b_committed_data
May 2 16:11:32 eth-c31.cluster kernel: __journal_remove_journal_head:
freeing b_committed_data
May 2 16:11:32 eth-c31.cluster kernel: __journal_remove_journal_head:
freeing b_frozen_data
May 2 16:11:32 eth-c31.cluster kernel: __journal_remove_journal_head:
freeing b_frozen_data
May 2 16:11:32 eth-c31.cluster kernel: __journal_remove_journal_head:
freeing b_committed_data
May 2 16:11:32 eth-c31.cluster kernel: __journal_remove_journal_head:
freeing b_frozen_data
May 2 16:11:43 eth-c31.cluster kernel: printk: 259144 messages suppressed.
May 2 16:11:43 eth-c31.cluster kernel: Buffer I/O error on device dm-0,
logical block 737
May 2 16:11:43 eth-c31.cluster kernel: lost page write due to I/O error on
dm-0
May 2 16:11:43 eth-c31.cluster kernel: Buffer I/O error on device dm-0,
logical block 115035
May 2 16:11:43 eth-c31.cluster kernel: lost page write due to I/O error on
dm-0
Googling for information suggests that the device underlying the filesystem
is running out of space, which explains why the filesystem crashes. df
reports that the / filesystem should have space:
[root@c31 ~]# df -h
/dev/mapper/live-rw 6.0G 1.2G 4.8G 19% /
I've adjusted the "part / -size 6144" parameter in my kickstart file, but
I
see no effective results other than the size that df reports changes to
match what I specify. Writing a file to /tmp larger than about 512MB causes
the filesystem to continue to crash even if the space is reported as
available.
Each compute node has 32GB of system memory, and is running an x86_64
kernel.
I'm open to any suggestions on how to fix this issue.
Thanks!
Howard
I'm not familiar with livecd-iso-to-pxeboot, but a standard LiveOS image
places /tmp in a tmpfs. See
http://git.fedorahosted.org/git/?p=spin-kickstarts.git;a=blob;f=fedora-li...
You may try adjusting that line in /etc/rc.d/init.d/livesys, if that fits
your situation.
--Fred