Releasing host id takes 4-6 seonds - is this epxected beahvioir?

Wednesday, 14 May 2014

Hi David,

I'm working on minimizing the time to stop vdsm domain monitors [1]. When we
stop a monitor, the last thing it does is releasing the domain host id,
invoking the python rem_lockspace api using sync mode.

In my test, I have a system with 30 domain monitor threads. Signaling the
threads to stop takes about 10 milliseconds, but joining the threads takes
about 10 seconds.

Profiling vdsm reveal that the time is spent in rem_lockspace call. In this
example, there were 30 calls (one per thread), each call took 5.560 seconds
(wall time).

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)

       30    0.000    0.000  166.829    5.561 sd.py:469(BlockStorageDomain.releaseHostId)
       30    0.001    0.000  166.829    5.561 clusterlock.py:203(SANLock.releaseHostId)
       30  166.813    5.560  166.813    5.560 sanlock:0(rem_lockspace)

Is this expected behavior?

Can we expect that removing a lockspace will be much faster?

The reason we are concerned about this, is that the current stop timeout for
vdsm is 10 seconds. So if you have many storage domains (I have seen systems
with 40 domains in production), vdsm may be killed before all monitor threads
are stopped, leading to orphaned lockspace, and if the machine is the spm,
orphaned spm resource. From our experience, this prevent stopping of sanlock
daemon, so you cannot upgrade vdsm without rebooting the machine.

[1] http://gerrit.ovirt.org/27573 domainMonitor: Stop domain monitors concurrently

Thanks,                                                                                   

Nir

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Releasing host id takes 4-6 seonds - is this epxected beahvioir?