Hello,
I'm a little bit confused about the io_timeout option for sanlock. I increased the io_timeout to 30 seconds, but it seems like the overall initialization becomes slower now. libvirtd is the client through the sanlock plugin. sanlock runs as "sanlock daemon -R 1 -o 30"
Restarting sanlock + libvirtd takes about 60 seconds before libvirtd acquires the lease (or at least, before libvirtd starts responding).
After a reboot, for some reason, this delay increases up to 360 seconds... I have no idea why this would take longer...
The libvirtd guys don't seem to know why this happens... so I hope to find an answer on this list... I didn't find any other timeouts that are configurable. From the source code, it looks like most of the timeouts are based on the io_timeout.
Best regards, Frido
---------- Forwarded message ---------- From: Daniel P. Berrange berrange@redhat.com Date: Tue, Mar 13, 2012 at 3:54 PM Subject: Re: [libvirt-users] libvirt with sanlock To: Frido Roose fr.roose@gmail.com Cc: libvirt-users@redhat.com
On Tue, Mar 13, 2012 at 03:42:36PM +0100, Frido Roose wrote:
Hello,
I configured libvirtd with the sanlock lock manager plugin:
# rpm -qa | egrep "libvirt-0|sanlock-[01]" libvirt-lock-sanlock-0.9.4-23.el6_2.4.x86_64 sanlock-1.8-2.el6.x86_64 libvirt-0.9.4-23.el6_2.4.x86_64
# egrep -v "^#|^$" /etc/libvirt/qemu-sanlock.conf auto_disk_leases = 1 disk_lease_dir = "/var/lib/libvirt/sanlock" host_id = 4
# mount | grep sanlock /dev/mapper/kvm--shared-sanlock on /var/lib/libvirt/sanlock type gfs2 (rw,noatime,hostdata=jid=0)
# cat /etc/sysconfig/sanlock SANLOCKOPTS="-R 1 -o 30"
I increased the sanlock io_timeout to 30 seconds (default = 10), because the sanlock dir is on a GFS2 volume and can be blocked for some time while fencing and journal recovery takes place. With the default sanlock io timeout, I get lease timeouts because IO is blocked: Mar 5 15:37:14 raiti sanlock[5858]: 3318 s1 check_our_lease warning 79 last_success 3239 Mar 5 15:37:15 raiti sanlock[5858]: 3319 s1 check_our_lease failed 80
So far, all fine, but when I restart sanlock and libvirtd, it takes about
2
- 30 seconds = 1 minute before libvirtd is usable. "virsh list" hangs
during this time. I can still live with that... But it gets worse after a reboot, when running a "virsh list" even takes a couple of minutes (like about 5 minutes) before it responds. After this initial time, virsh is responding normally, so it looks like an initialization issue to me.
Is this a configuration issue, a bug, or expected behavior?
Each libvirtd instance has a lease that it owns. When restarting libvirtd it tries to acquire this lease. I don't really understand why, but sanlock sometimes has to wait a very long time between starting & completing its lease acquisition. You'll probably have to ask the sanlock developers for an explanation of why
Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/:%7C |: http://libvirt.org -o- http://virt-manager.org:%7C |: http://autobuild.org -o- http://search.cpan.org/~danberr/:%7C |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc:%7C
I increased the sanlock io_timeout to 30 seconds (default = 10), because the sanlock dir is on a GFS2 volume and can be blocked for some time while fencing and journal recovery takes place.
This is one of the reasons why you should not put sanlock leases on gfs2. They should be put directly on a shared block device.
I'm a little bit confused about the io_timeout option for sanlock. I increased the io_timeout to 30 seconds, but it seems like the overall initialization becomes slower now. libvirtd is the client through the sanlock plugin. sanlock runs as "sanlock daemon -R 1 -o 30"
Restarting sanlock + libvirtd takes about 60 seconds before libvirtd acquires the lease (or at least, before libvirtd starts responding).
After a reboot, for some reason, this delay increases up to 360 seconds... I have no idea why this would take longer...
The libvirtd guys don't seem to know why this happens... so I hope to find an answer on this list... I didn't find any other timeouts that are configurable. From the source code, it looks like most of the timeouts are based on the io_timeout.
Yes, all the timeouts are derived from the io_timeout and are dictated by the recovery requirements and the algorithm the host_id leases are based on: "Light-Weight Leases for Storage-Centric Coordination" by Gregory Chockler and Dahlia Malkhi.
Here are the actual equations copied from sanlock_internal.h. "delta" refers to host_id leases that take a long time to acquire at startup "free" corresponds to starting up after a clean shutdown "held" corresponds to starting up after an unclean shutdown
You should find that with 30 sec io timeout these come out to 1 min / 4 min which you see when starting after a clean / unclean shutdown.
* io_timeout_seconds: defined by us * * id_renewal_seconds: defined by us * * id_renewal_fail_seconds: defined by us * * watchdog_fire_timeout: /dev/watchdog will fire without being petted this long * = 60 constant * * host_dead_seconds: the length of time from the last successful host_id * renewal until that host is killed by its watchdog. * = id_renewal_fail_seconds + watchdog_fire_timeout * * delta_large_delay: from the algorithm * = id_renewal_seconds + (6 * io_timeout_seconds) * * delta_short_delay: from the algorithm * = 2 * io_timeout_seconds * * delta_acquire_held_max: max time it can take to successfully * acquire a non-free delta lease * = io_timeout_seconds (read) + * max(delta_large_delay, host_dead_seconds) + * io_timeout_seconds (read) + * io_timeout_seconds (write) + * delta_short_delay + * io_timeout_seconds (read) * * delta_acquire_held_min: min time it can take to successfully * acquire a non-free delta lease * = max(delta_large_delay, host_dead_seconds) * * delta_acquire_free_max: max time it can take to successfully * acquire a free delta lease. * = io_timeout_seconds (read) + * io_timeout_seconds (write) + * delta_short_delay + * io_timeout_seconds (read) * * delta_acquire_free_min: min time it can take to successfully * acquire a free delta lease. * = delta_short_delay * * delta_renew_max: max time it can take to successfully * renew a delta lease. * = io_timeout_seconds (read) + * io_timeout_seconds (write) * * delta_renew_min: min time it can take to successfully * renew a delta lease. * = 0
On Tue, Mar 13, 2012 at 5:08 PM, David Teigland teigland@redhat.com wrote:
I increased the sanlock io_timeout to 30 seconds (default = 10),
because
the sanlock dir is on a GFS2 volume and can be blocked for some time
while
fencing and journal recovery takes place.
This is one of the reasons why you should not put sanlock leases on gfs2. They should be put directly on a shared block device.
I'm a little bit confused about the io_timeout option for sanlock. I increased the io_timeout to 30 seconds, but it seems like the overall initialization becomes slower now. libvirtd is the client through the sanlock plugin. sanlock runs as "sanlock daemon -R 1 -o 30"
Restarting sanlock + libvirtd takes about 60 seconds before libvirtd acquires the lease (or at least, before libvirtd starts responding).
After a reboot, for some reason, this delay increases up to 360
seconds...
I have no idea why this would take longer...
The libvirtd guys don't seem to know why this happens... so I hope to
find
an answer on this list... I didn't find any other timeouts that are configurable. From the source code, it looks like most of the timeouts
are
based on the io_timeout.
Yes, all the timeouts are derived from the io_timeout and are dictated by the recovery requirements and the algorithm the host_id leases are based on: "Light-Weight Leases for Storage-Centric Coordination" by Gregory Chockler and Dahlia Malkhi.
Here are the actual equations copied from sanlock_internal.h. "delta" refers to host_id leases that take a long time to acquire at startup "free" corresponds to starting up after a clean shutdown "held" corresponds to starting up after an unclean shutdown
You should find that with 30 sec io timeout these come out to 1 min / 4 min which you see when starting after a clean / unclean shutdown.
Thanks! This information explains the differences in delay I encounter between a clean and unclean situation. The reboot delay was effectively after a fence operation, so an unclean restart.
I guess having a delta_acquire_held_min of 300 seconds is to be sure that no host with this host_id would acquire the lock in the meantime.
I'm not sure if this still makes sense on top of GFS2, but that reminds me to the fact that you said sanlock was meant to be used with a block device. Maybe this is something that the libvirtd devs need to be aware of. I'll start a discussion about this on the libvirt list as the person who tried to help me didn't understand the delays neither.
- io_timeout_seconds: defined by us
- id_renewal_seconds: defined by us
- id_renewal_fail_seconds: defined by us
- watchdog_fire_timeout: /dev/watchdog will fire without being petted
this long
- = 60 constant
- host_dead_seconds: the length of time from the last successful host_id
- renewal until that host is killed by its watchdog.
- = id_renewal_fail_seconds + watchdog_fire_timeout
- delta_large_delay: from the algorithm
- = id_renewal_seconds + (6 * io_timeout_seconds)
- delta_short_delay: from the algorithm
- = 2 * io_timeout_seconds
- delta_acquire_held_max: max time it can take to successfully
- acquire a non-free delta lease
- = io_timeout_seconds (read) +
- max(delta_large_delay, host_dead_seconds) +
- io_timeout_seconds (read) +
- io_timeout_seconds (write) +
- delta_short_delay +
- io_timeout_seconds (read)
- delta_acquire_held_min: min time it can take to successfully
- acquire a non-free delta lease
- = max(delta_large_delay, host_dead_seconds)
- delta_acquire_free_max: max time it can take to successfully
- acquire a free delta lease.
- = io_timeout_seconds (read) +
- io_timeout_seconds (write) +
- delta_short_delay +
- io_timeout_seconds (read)
- delta_acquire_free_min: min time it can take to successfully
- acquire a free delta lease.
- = delta_short_delay
- delta_renew_max: max time it can take to successfully
- renew a delta lease.
- = io_timeout_seconds (read) +
- io_timeout_seconds (write)
- delta_renew_min: min time it can take to successfully
- renew a delta lease.
- = 0
sanlock-devel mailing list sanlock-devel@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/sanlock-devel
On Tue, Mar 13, 2012 at 5:46 PM, Frido Roose frido_roose@trimble.comwrote:
On Tue, Mar 13, 2012 at 5:08 PM, David Teigland teigland@redhat.comwrote:
I increased the sanlock io_timeout to 30 seconds (default = 10),
because
the sanlock dir is on a GFS2 volume and can be blocked for some time
while
fencing and journal recovery takes place.
This is one of the reasons why you should not put sanlock leases on gfs2. They should be put directly on a shared block device.
Ok, I looked over this when answering, but I was thinking the same...
I'm a little bit confused about the io_timeout option for sanlock. I increased the io_timeout to 30 seconds, but it seems like the overall initialization becomes slower now. libvirtd is the client through the sanlock plugin. sanlock runs as "sanlock daemon -R 1 -o 30"
Restarting sanlock + libvirtd takes about 60 seconds before libvirtd acquires the lease (or at least, before libvirtd starts responding).
After a reboot, for some reason, this delay increases up to 360
seconds...
I have no idea why this would take longer...
The libvirtd guys don't seem to know why this happens... so I hope to
find
an answer on this list... I didn't find any other timeouts that are configurable. From the source code, it looks like most of the timeouts
are
based on the io_timeout.
Yes, all the timeouts are derived from the io_timeout and are dictated by the recovery requirements and the algorithm the host_id leases are based on: "Light-Weight Leases for Storage-Centric Coordination" by Gregory Chockler and Dahlia Malkhi.
Here are the actual equations copied from sanlock_internal.h. "delta" refers to host_id leases that take a long time to acquire at startup "free" corresponds to starting up after a clean shutdown "held" corresponds to starting up after an unclean shutdown
You should find that with 30 sec io timeout these come out to 1 min / 4 min which you see when starting after a clean / unclean shutdown.
Thanks! This information explains the differences in delay I encounter between a clean and unclean situation. The reboot delay was effectively after a fence operation, so an unclean restart.
I guess having a delta_acquire_held_min of 300 seconds is to be sure that no host with this host_id would acquire the lock in the meantime.
I'm not sure if this still makes sense on top of GFS2, but that reminds me to the fact that you said sanlock was meant to be used with a block device. Maybe this is something that the libvirtd devs need to be aware of. I'll start a discussion about this on the libvirt list as the person who tried to help me didn't understand the delays neither.
- io_timeout_seconds: defined by us
- id_renewal_seconds: defined by us
- id_renewal_fail_seconds: defined by us
- watchdog_fire_timeout: /dev/watchdog will fire without being petted
this long
- = 60 constant
- host_dead_seconds: the length of time from the last successful host_id
- renewal until that host is killed by its watchdog.
- = id_renewal_fail_seconds + watchdog_fire_timeout
- delta_large_delay: from the algorithm
- = id_renewal_seconds + (6 * io_timeout_seconds)
- delta_short_delay: from the algorithm
- = 2 * io_timeout_seconds
- delta_acquire_held_max: max time it can take to successfully
- acquire a non-free delta lease
- = io_timeout_seconds (read) +
- max(delta_large_delay, host_dead_seconds) +
- io_timeout_seconds (read) +
- io_timeout_seconds (write) +
- delta_short_delay +
- io_timeout_seconds (read)
- delta_acquire_held_min: min time it can take to successfully
- acquire a non-free delta lease
- = max(delta_large_delay, host_dead_seconds)
- delta_acquire_free_max: max time it can take to successfully
- acquire a free delta lease.
- = io_timeout_seconds (read) +
- io_timeout_seconds (write) +
- delta_short_delay +
- io_timeout_seconds (read)
- delta_acquire_free_min: min time it can take to successfully
- acquire a free delta lease.
- = delta_short_delay
- delta_renew_max: max time it can take to successfully
- renew a delta lease.
- = io_timeout_seconds (read) +
- io_timeout_seconds (write)
- delta_renew_min: min time it can take to successfully
- renew a delta lease.
- = 0
sanlock-devel mailing list sanlock-devel@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/sanlock-devel
On Tue, Mar 13, 2012 at 05:46:00PM +0100, Frido Roose wrote:
I'm not sure if this still makes sense on top of GFS2, but that reminds me to the fact that you said sanlock was meant to be used with a block device. Maybe this is something that the libvirtd devs need to be aware of. I'll start a discussion about this on the libvirt list as the person who tried to help me didn't understand the delays neither.
The suggested libvirt sanlock setup uses NFS. When using NFS, you probably want to tune your io timeout so that id_renewal_fail_seconds is less than the time it takes your NFS server to restart. id_renewal_fail_seconds is 8*io_timeout, so with a 10 sec io_timeout, your NFS server would need to restart within 80 seconds, otherwise the vm's would be killed.
- io_timeout_seconds: defined by us
- id_renewal_seconds: defined by us
- id_renewal_fail_seconds: defined by us
- watchdog_fire_timeout: /dev/watchdog will fire without being petted
this long
- = 60 constant
- host_dead_seconds: the length of time from the last successful host_id
- renewal until that host is killed by its watchdog.
- = id_renewal_fail_seconds + watchdog_fire_timeout
- delta_large_delay: from the algorithm
- = id_renewal_seconds + (6 * io_timeout_seconds)
- delta_short_delay: from the algorithm
- = 2 * io_timeout_seconds
- delta_acquire_held_max: max time it can take to successfully
- acquire a non-free delta lease
- = io_timeout_seconds (read) +
- max(delta_large_delay, host_dead_seconds) +
- io_timeout_seconds (read) +
- io_timeout_seconds (write) +
- delta_short_delay +
- io_timeout_seconds (read)
- delta_acquire_held_min: min time it can take to successfully
- acquire a non-free delta lease
- = max(delta_large_delay, host_dead_seconds)
- delta_acquire_free_max: max time it can take to successfully
- acquire a free delta lease.
- = io_timeout_seconds (read) +
- io_timeout_seconds (write) +
- delta_short_delay +
- io_timeout_seconds (read)
- delta_acquire_free_min: min time it can take to successfully
- acquire a free delta lease.
- = delta_short_delay
- delta_renew_max: max time it can take to successfully
- renew a delta lease.
- = io_timeout_seconds (read) +
- io_timeout_seconds (write)
- delta_renew_min: min time it can take to successfully
- renew a delta lease.
- = 0
sanlock-devel mailing list sanlock-devel@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/sanlock-devel
sanlock-devel mailing list sanlock-devel@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/sanlock-devel
On Tue, Mar 13, 2012 at 6:12 PM, David Teigland teigland@redhat.com wrote:
On Tue, Mar 13, 2012 at 05:46:00PM +0100, Frido Roose wrote:
I'm not sure if this still makes sense on top of GFS2, but that reminds me to the fact that you said sanlock was meant to be used with a block device. Maybe this is something that the libvirtd devs need to be aware of. I'll start a discussion about this on the libvirt list as the person who tried to help me didn't understand the delays neither.
The suggested libvirt sanlock setup uses NFS. When using NFS, you probably want to tune your io timeout so that id_renewal_fail_seconds is less than the time it takes your NFS server to restart. id_renewal_fail_seconds is 8*io_timeout, so with a 10 sec io_timeout, your NFS server would need to restart within 80 seconds, otherwise the vm's would be killed.
Yes, and for GFS2, it is similar, but here it's not about the time to restart, but the time to handle the locking (fencing, journal recovery, ...) which blocks access to the volume for a specific time. I'm not that fond of depending on a NFS cluster resource with its own difficulties when you already have the whole clustering infrastructure for your virtual environment. Then adding a GFS2 volume is very easy and clean.
But now I have a better understanding of how it all fits together and why it behaves like that.
Thanks for your useful help!
sanlock-devel@lists.fedorahosted.org