----- Original Message -----
From: "Nir Soffer" <nsoffer(a)redhat.com>
To: sanlock-devel(a)lists.fedorahosted.org
Cc: "Federico Simoncelli" <fsimonce(a)redhat.com>, "Allon
Mureinik" <amureini(a)redhat.com>, "David Teigland"
<teigland(a)redhat.com>, "Dan Kenigsberg" <danken(a)redhat.com>
Sent: Wednesday, May 14, 2014 4:29:16 PM
Subject: Releasing host id takes 4-6 seonds - is this epxected beahvioir?
Hi David,
I'm working on minimizing the time to stop vdsm domain monitors [1]. When we
stop a monitor, the last thing it does is releasing the domain host id,
invoking the python rem_lockspace api using sync mode.
In my test, I have a system with 30 domain monitor threads. Signaling the
threads to stop takes about 10 milliseconds, but joining the threads takes
about 10 seconds.
Profiling vdsm reveal that the time is spent in rem_lockspace call. In this
example, there were 30 calls (one per thread), each call took 5.560 seconds
(wall time).
ncalls tottime percall cumtime percall filename:lineno(function)
30 0.000 0.000 166.829 5.561
sd.py:469(BlockStorageDomain.releaseHostId)
30 0.001 0.000 166.829 5.561
clusterlock.py:203(SANLock.releaseHostId)
30 166.813 5.560 166.813 5.560 sanlock:0(rem_lockspace)
Is this expected behavior?
Can we expect that removing a lockspace will be much faster?
The reason we are concerned about this, is that the current stop timeout for
vdsm is 10 seconds. So if you have many storage domains (I have seen systems
with 40 domains in production), vdsm may be killed before all monitor threads
are stopped, leading to orphaned lockspace, and if the machine is the spm,
orphaned spm resource.
Small nit: resources are associated to a process so the spm resource cannot
be orphaned.
--
Federico