Hi David,
I'm working on minimizing the time to stop vdsm domain monitors [1]. When we
stop a monitor, the last thing it does is releasing the domain host id,
invoking the python rem_lockspace api using sync mode.
In my test, I have a system with 30 domain monitor threads. Signaling the
threads to stop takes about 10 milliseconds, but joining the threads takes
about 10 seconds.
Profiling vdsm reveal that the time is spent in rem_lockspace call. In this
example, there were 30 calls (one per thread), each call took 5.560 seconds
(wall time).
ncalls tottime percall cumtime percall filename:lineno(function)
30 0.000 0.000 166.829 5.561 sd.py:469(BlockStorageDomain.releaseHostId)
30 0.001 0.000 166.829 5.561 clusterlock.py:203(SANLock.releaseHostId)
30 166.813 5.560 166.813 5.560 sanlock:0(rem_lockspace)
Is this expected behavior?
Can we expect that removing a lockspace will be much faster?
The reason we are concerned about this, is that the current stop timeout for
vdsm is 10 seconds. So if you have many storage domains (I have seen systems
with 40 domains in production), vdsm may be killed before all monitor threads
are stopped, leading to orphaned lockspace, and if the machine is the spm,
orphaned spm resource. From our experience, this prevent stopping of sanlock
daemon, so you cannot upgrade vdsm without rebooting the machine.
[1]
http://gerrit.ovirt.org/27573 domainMonitor: Stop domain monitors concurrently
Thanks,
Nir