Hello,
I'm not sure if this is the appropriate place to ask, but we're running libvirtd with the sanlock plugin. Ater a while, I noticed the following in the sanlock logfile after I couldn't create a new VM:
838555 sanlock daemon started aio 1 10 renew 20 80 host 144ce1f8-ddf7-4139-8e97-291aeddd7b81.arqua time 1330086549 838556 s1 lockspace __LIBVIRT__DISKS__:13:/var/lib/libvirt/sanlock/__LIBVIRT__DISKS__:0 838616 s1:r1 resource __LIBVIRT__DISKS__:7cf11c7a5da5d530521646d738636e84:/var/lib/libvirt/sanlock/7cf11c7a5da5d530521646d738636e84:0 for 1,9,31377 838632 s1:r2 resource __LIBVIRT__DISKS__:b2ad1e0874ba7c316cc848d3e0a98439:/var/lib/libvirt/sanlock/b2ad1e0874ba7c316cc848d3e0a98439:0 for 2,12,31499 839477 s1 check_our_lease warning 60 last_success 839417 839478 s1 check_our_lease warning 61 last_success 839417 839479 s1 check_our_lease warning 62 last_success 839417 839480 s1 check_our_lease warning 63 last_success 839417 839481 s1 check_our_lease warning 64 last_success 839417 839482 s1 check_our_lease warning 65 last_success 839417 839483 s1 check_our_lease warning 66 last_success 839417 839484 s1 check_our_lease warning 67 last_success 839417 839485 s1 check_our_lease warning 68 last_success 839417 839486 s1 check_our_lease warning 69 last_success 839417 839487 s1 check_our_lease warning 70 last_success 839417 839488 s1 check_our_lease warning 71 last_success 839417 839489 s1 check_our_lease warning 72 last_success 839417 839490 s1 check_our_lease warning 73 last_success 839417 839491 s1 check_our_lease warning 74 last_success 839417 839492 s1 check_our_lease warning 75 last_success 839417 839493 s1 check_our_lease warning 76 last_success 839417 839494 s1 check_our_lease warning 77 last_success 839417 839495 s1 check_our_lease warning 78 last_success 839417 839496 s1 check_our_lease warning 79 last_success 839417 839497 s1 check_our_lease failed 80 841667 s1 renewed 841667 delta_length 2229 too long 841671 r3 cmd_acquire 1,9,7234 invalid lockspace found -1 failed 0 name __LIBVIRT__DISKS__ 841672 r4 cmd_acquire 2,10,7268 invalid lockspace found -1 failed 0 name __LIBVIRT__DISKS__ 841674 r5 cmd_acquire 2,10,7765 invalid lockspace found -1 failed 0 name __LIBVIRT__DISKS__ 841675 r6 cmd_acquire 1,9,7735 invalid lockspace found -1 failed 0 name __LIBVIRT__DISKS__ 842012 r7 cmd_acquire 1,9,8495 invalid lockspace found -1 failed 0 name __LIBVIRT__DISKS__ 842013 r8 cmd_acquire 1,9,8622 invalid lockspace found -1 failed 0 name __LIBVIRT__DISKS__ 842015 r9 cmd_acquire 1,9,9130 invalid lockspace found -1 failed 0 name __LIBVIRT__DISKS__ 842016 r10 cmd_acquire 1,9,9241 invalid lockspace found -1 failed 0 name __LIBVIRT__DISKS__ 842019 r11 cmd_acquire 1,9,9669 invalid lockspace found -1 failed 0 name __LIBVIRT__DISKS__ 842020 r12 cmd_acquire 2,10,9697 invalid lockspace found -1 failed 0 name __LIBVIRT__DISKS__ 842021 r13 cmd_acquire 1,9,10146 invalid lockspace found -1 failed 0 name __LIBVIRT__DISKS__ 842022 r14 cmd_acquire 1,9,10240 invalid lockspace found -1 failed 0 name __LIBVIRT__DISKS__ 842115 r15 cmd_acquire 1,9,10755 invalid lockspace found -1 failed 0 name __LIBVIRT__DISKS__ 842117 r16 cmd_acquire 1,9,11008 invalid lockspace found -1 failed 0 name __LIBVIRT__DISKS__ 842410 r17 cmd_acquire 1,9,11511 invalid lockspace found -1 failed 0 name __LIBVIRT__DISKS__ 842410 r18 cmd_acquire 1,9,11616 invalid lockspace found -1 failed 0 name __LIBVIRT__DISKS__ 843609 r19 cmd_acquire 1,9,12835 invalid lockspace found -1 failed 0 name __LIBVIRT__DISKS__ 843616 r20 cmd_acquire 1,9,13093 invalid lockspace found -1 failed 0 name __LIBVIRT__DISKS__
So for some reason, the lockspace became invalid (and it looks like the host lease was not renewed). Running sanlock-1.8-2.el6.x86_64.
Anything else I can check?
Best regards, Frido
On Tue, Feb 28, 2012 at 04:40:04PM +0100, Frido Roose wrote:
839496 s1 check_our_lease warning 79 last_success 839417 839497 s1 check_our_lease failed 80 841667 s1 renewed 841667 delta_length 2229 too long
It looks like i/o to your storage blocked for 2229 seconds, which is much longer than the 80 seconds it's given. If this is on nfs, you might check if the nfs server was down.
On Tue, Feb 28, 2012 at 5:00 PM, David Teigland teigland@redhat.com wrote:
On Tue, Feb 28, 2012 at 04:40:04PM +0100, Frido Roose wrote:
839496 s1 check_our_lease warning 79 last_success 839417 839497 s1 check_our_lease failed 80 841667 s1 renewed 841667 delta_length 2229 too long
It looks like i/o to your storage blocked for 2229 seconds, which is much longer than the 80 seconds it's given. If this is on nfs, you might check if the nfs server was down.
Thanks for your quick reply. The lockspace is on a GFS2 volume, but you are right that we had an issue with the GFS2 volume that was hanging for some reason:
Feb 24 13:49:23 maraikoh kernel: INFO: task gfs2_quotad:6113 blocked for more than 120 seconds. Feb 24 13:49:23 maraikoh kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Feb 24 13:49:23 maraikoh kernel: gfs2_quotad D 0000000000000005 0 6113 2 0x00000080 Feb 24 13:49:23 maraikoh kernel: ffff880279901c20 0000000000000046 ffff880279901b90 ffffffffa043cd3d Feb 24 13:49:23 maraikoh kernel: 0000000000000000 ffff880279fd9800 ffff880279901c50 ffffffffa043b4e6 Feb 24 13:49:23 maraikoh kernel: ffff8801f6ca30b8 ffff880279901fd8 000000000000f4e8 ffff8801f6ca30b8 Feb 24 13:49:23 maraikoh kernel: Call Trace: Feb 24 13:49:23 maraikoh kernel: [<ffffffffa043cd3d>] ? dlm_put_lockspace+0x1d/0x40 [dlm]
That will have caused the timeout for sanlock.
Thanks for explaining this!
Best regards, Frido
On Fri, Mar 02, 2012 at 10:51:31AM +0100, Frido Roose wrote:
On Tue, Feb 28, 2012 at 5:00 PM, David Teigland teigland@redhat.com wrote:
On Tue, Feb 28, 2012 at 04:40:04PM +0100, Frido Roose wrote:
839496 s1 check_our_lease warning 79 last_success 839417 839497 s1 check_our_lease failed 80 841667 s1 renewed 841667 delta_length 2229 too long
It looks like i/o to your storage blocked for 2229 seconds, which is much longer than the 80 seconds it's given. If this is on nfs, you might check if the nfs server was down.
Thanks for your quick reply. The lockspace is on a GFS2 volume, but you are right that we had an issue with the GFS2 volume that was hanging for some reason:
sanlock is designed to be used on a shared block device directly. Using it on top of gfs2 doesn't make much sense.
On Fri, Mar 2, 2012 at 3:52 PM, David Teigland teigland@redhat.com wrote:
On Fri, Mar 02, 2012 at 10:51:31AM +0100, Frido Roose wrote:
On Tue, Feb 28, 2012 at 5:00 PM, David Teigland teigland@redhat.com
wrote:
On Tue, Feb 28, 2012 at 04:40:04PM +0100, Frido Roose wrote:
839496 s1 check_our_lease warning 79 last_success 839417 839497 s1 check_our_lease failed 80 841667 s1 renewed 841667 delta_length 2229 too long
It looks like i/o to your storage blocked for 2229 seconds, which is
much
longer than the 80 seconds it's given. If this is on nfs, you might
check
if the nfs server was down.
Thanks for your quick reply. The lockspace is on a GFS2 volume, but you are right that we had an issue with the GFS2 volume that was hanging for some reason:
sanlock is designed to be used on a shared block device directly. Using it on top of gfs2 doesn't make much sense.
That's how libvirtd implements it (http://libvirt.org/locking.html). It expects a directory configured by the disk_lease_dir parameter, not a block device. I don't see why NFS would make more sense than GFS2, since NFS also exports a directory?
The result is a shared disk over all cluster nodes: # ls -l /var/lib/libvirt/sanlock/ total 13364 -rw------- 1 root root 1048576 Feb 24 14:53 144144defb21c092062750f6c91d91b4 -rw------- 1 root root 1048576 Feb 17 22:53 15b60674efbfa8ea119962de17968bbf -rw------- 1 root root 1048576 Feb 28 16:48 5a229925efd128f4737bbb4e0772543a -rw------- 1 root root 1048576 Mar 1 13:47 5a7bab89aefafd8460975ebdd18eab4d -rw------- 1 root root 1048576 Feb 24 14:53 7cf11c7a5da5d530521646d738636e84 -rw------- 1 root root 1048576 Feb 28 16:22 a2f9ae7d144b7a2a01315db2fb278a09 -rw------- 1 root root 1048576 Feb 24 14:53 b2ad1e0874ba7c316cc848d3e0a98439 -rw------- 1 root root 1048576 Feb 28 16:45 c8c1d70159e8cc4b591e1b03eb525390 -rw------- 1 root root 1048576 Feb 28 16:50 ce9ddbfff85d9fa320ad12d752cc5a71 -rw------- 1 root root 1048576 Feb 27 16:08 d4c9f7bf3a6b41be8a501ac8362cf70f -rw------- 1 root root 1048576 Feb 24 16:58 fc3d3741342f06ecbc642eddc165b4c6 -rw------- 1 root root 1048576 Feb 28 16:49 ff4710dfd7f79caeac8a87603ec312f2 -rw------- 1 root root 1048576 Mar 2 16:13 __LIBVIRT__DISKS__
sanlock is designed to be used on a shared block device directly. Using it on top of gfs2 doesn't make much sense.
That's how libvirtd implements it (http://libvirt.org/locking.html). It expects a directory configured by the disk_lease_dir parameter, not a block device.
It should probably gain the ability to use block devices.
I don't see why NFS would make more sense than GFS2, since NFS also exports a directory?
The main reason to use sanlock is in cases where you don't have clustering capabilites to do locking/coordination more directly.
libvirt could also gain the ability to use the cluster/locking capabilities of gfs2 directly, e.g. it could just take an flock on the file.
sanlock-devel@lists.fedorahosted.org