On Wed, May 19, 2021 at 4:59 PM Nir Soffer nsoffer@redhat.com wrote:
On Wed, May 19, 2021 at 12:42 PM Vojtech Juranek vjuranek@redhat.com wrote: ...
+static PyObject * +disks_to_list(struct sanlk_disk *disks, uint32_t disks_count) +{
- PyObject *result = NULL;
- PyObject *disk = NULL;
- result = PyList_New(disks_count);
- if (result == NULL)
return NULL;A memory error was set in this case...
Thanks, I didn't realize this. This was the case which was unclear to me (or better to say which I misunderstood).
LGTM, +1 to the patch
Thanks.
Here are some tests using ovirt:
Using inquire on idle SPM host:
pprint.pprint(sanlock.inquire(pid=1956))
[{'disks': [('/dev/8ef0cc64-77c5-49e0-a0f5-03cee9bbee20/leases', 1048576)], 'flags': 1, 'lockspace': b'8ef0cc64-77c5-49e0-a0f5-03cee9bbee20', 'resource': b'SDM', 'version': 30}]
After starting cold merge task on a vm with 20 disks, each on different storage domain. This starts 10 concurrent cold merge jobs on the SPM host, and every job acquire the volume lease.
pprint.pprint(sanlock.inquire(pid=1956))
[{'disks': [('/dev/8ef0cc64-77c5-49e0-a0f5-03cee9bbee20/leases', 1048576)], 'flags': 1, 'lockspace': b'8ef0cc64-77c5-49e0-a0f5-03cee9bbee20', 'resource': b'SDM', 'version': 30}, {'disks': [('/dev/2f8c68ef-d5c1-4f34-9a6a-f7cb2f0aef0a/leases', 108003328)], 'flags': 1, 'lockspace': b'2f8c68ef-d5c1-4f34-9a6a-f7cb2f0aef0a', 'resource': b'4b7dd8cb-f61d-4840-9f84-87135d0c4b3b', 'version': 14}, {'disks': [('/dev/747ea907-c24c-43ff-a389-660c729703b0/leases', 108003328)], 'flags': 1, 'lockspace': b'747ea907-c24c-43ff-a389-660c729703b0', 'resource': b'fc34e8b4-fe7a-4d9f-a7e9-d4e4637d24e6', 'version': 15}, {'disks': [('/dev/71702e1e-82d2-41d2-93df-678f2415da2a/leases', 108003328)], 'flags': 1, 'lockspace': b'71702e1e-82d2-41d2-93df-678f2415da2a', 'resource': b'e181720a-9edf-46cb-b5ff-83e986cb3db3', 'version': 14}, {'disks': [('/dev/31e3675e-302a-4a22-ad9a-402c0c1e9765/leases', 108003328)], 'flags': 1, 'lockspace': b'31e3675e-302a-4a22-ad9a-402c0c1e9765', 'resource': b'3003f575-c38b-4034-8c54-3f000239b5c9', 'version': 15}, {'disks': [('/dev/f5597bd4-5c4f-4c4c-bd98-e8a33dace830/leases', 108003328)], 'flags': 1, 'lockspace': b'f5597bd4-5c4f-4c4c-bd98-e8a33dace830', 'resource': b'ba28ed6c-a0d9-4f89-821a-73a992d3a074', 'version': 15}, {'disks': [('/dev/e673e145-d644-44ac-a416-a8fc14b029ca/leases', 108003328)], 'flags': 1, 'lockspace': b'e673e145-d644-44ac-a416-a8fc14b029ca', 'resource': b'bc8556b2-095a-429c-9a61-c46f6390c022', 'version': 16}, {'disks': [('/dev/06e43bfc-2ffe-419e-843b-59a5d23165e1/leases', 108003328)], 'flags': 1, 'lockspace': b'06e43bfc-2ffe-419e-843b-59a5d23165e1', 'resource': b'8fb95469-16cd-406b-863d-2b76c2393711', 'version': 16}, {'disks': [('/dev/829b9bed-1b74-4f74-9d18-1006c87ef620/leases', 108003328)], 'flags': 1, 'lockspace': b'829b9bed-1b74-4f74-9d18-1006c87ef620', 'resource': b'75b3f9c9-32e0-4b1e-a367-a3a3f1ab2950', 'version': 16}, {'disks': [('/dev/7cf51212-8247-41b0-ab38-da5954059d7a/leases', 108003328)], 'flags': 1, 'lockspace': b'7cf51212-8247-41b0-ab38-da5954059d7a', 'resource': b'350353df-0434-4668-abdf-d7d55d5f0f49', 'version': 14}]
When all jobs finished:
pprint.pprint(sanlock.inquire(pid=1956))
[{'disks': [('/dev/8ef0cc64-77c5-49e0-a0f5-03cee9bbee20/leases', 1048576)], 'flags': 1, 'lockspace': b'8ef0cc64-77c5-49e0-a0f5-03cee9bbee20', 'resource': b'SDM', 'version': 30}]
Timing 1000 calls to sanlock.inquire() using pid:
timeit.timeit("sanlock.inquire(pid=1956)", setup="import sanlock", number=1000)
0.04519722999998521
Timing 1000 calls to sanlock inquire using slkfd:
fd = sanlock.register() fd
4
pprint.pprint(sanlock.inquire(slkfd=4))
[]
timeit.timeit("sanlock.inquire(slkfd=4)", setup="import sanlock", number=1000)
0.04666791799991188
Testing 1000000 calls to see how sanlock is affected, the client uses about 50% cpu, and sanlock about 75% cpu.
No change seen in memory usage of the python process.
Testing overloading sanlock during acquire/release operations. I run 10000 inquires while running 10 concurrent cold merge in ovirt.
One of the inquire calls failed with EBUSY:
timeit.timeit("sanlock.inquire(pid=1956)", setup="import sanlock", number=100000)
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib64/python3.6/timeit.py", line 233, in timeit return Timer(stmt, setup, timer, globals).timeit(number) File "/usr/lib64/python3.6/timeit.py", line 178, in timeit timing = self.inner(it, self.timer) File "<timeit-src>", line 6, in inner sanlock.SanlockException: (16, 'Inquire error', 'Device or resource busy')
This is expected when using pid=. We will synchronize the calls to sanlock in ovirt so we don't expect to have this issue.
The inquire calls cause one of the sanlock.release() calls in vdsm to fail:
2021-05-19 16:46:49,120+0300 INFO (tasks/1) [storage.SANLock] Releasing Lease(name='c873c3e2-5da1-4614-b564-37324fefb9a1', path ='/rhev/data-center/mnt/storage:_export_00/b2ff7cd6-64a6-453d-b71e-22fa4b11f16e/images/d4fa7500-306b-489b-88e7-48a3e1ff1a9b/c873 c3e2-5da1-4614-b564-37324fefb9a1.lease', offset=0) (clusterlock:602) 2021-05-19 16:46:49,121+0300 ERROR (tasks/1) [storage.TaskManager.Task] (Task='352a44c2-09a2-46ae-9327-d5ae444ea312') Unexpected error (task:880) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run return fn(*args, **kargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 350, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/lib/python3.6/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 1856, in prepareMerge merge.prepare(subchainInfo) File "/usr/lib/python3.6/site-packages/vdsm/storage/merge.py", line 183, in prepare subchain.top_vol) File "/usr/lib/python3.6/site-packages/vdsm/storage/guarded.py", line 111, in __exit__ self._release() File "/usr/lib/python3.6/site-packages/vdsm/storage/guarded.py", line 128, in _release raise ReleaseError(errors) vdsm.storage.guarded.ReleaseError: [ReleaseLockFailure('b2ff7cd6-64a6-453d-b71e-22fa4b11f16e', SanlockException(16, 'Sanlock res ource not released', 'Device or resource busy'))]
We don't handle this in vdsm since we don't expect other processes to inquire our leases.
This left unreleased lease in vdsm:
pprint.pprint(sanlock.inquire(pid=1956))
[{'disks': [('/dev/8ef0cc64-77c5-49e0-a0f5-03cee9bbee20/leases', 1048576)], 'flags': 1, 'lockspace': b'8ef0cc64-77c5-49e0-a0f5-03cee9bbee20', 'resource': b'SDM', 'version': 30}, {'disks': [('/rhev/data-center/mnt/storage:_export_00/b2ff7cd6-64a6-453d-b71e-22fa4b11f16e/images/d4fa7500-306b-489b-88e7-48a3e1ff1a9b/c873c3e2-5da1-4614-b564-37324fefb9a1.lease', 0)], 'flags': 1, 'lockspace': b'b2ff7cd6-64a6-453d-b71e-22fa4b11f16e', 'resource': b'c873c3e2-5da1-4614-b564-37324fefb9a1', 'version': 17}]
I could release it from the python shell using inquire result:
res = sanlock.inquire(pid=1956)
res
[{'lockspace': b'8ef0cc64-77c5-49e0-a0f5-03cee9bbee20', 'resource': b'SDM', 'flags': 1, 'version': 30, 'disks': [('/dev/8ef0cc64-77c5-49e0-a0f5-03cee9bbee20/leases', 1048576)]}, {'lockspace': b'b2ff7cd6-64a6-453d-b71e-22fa4b11f16e', 'resource': b'c873c3e2-5da1-4614-b564-37324fefb9a1', 'flags': 1, 'version': 17, 'disks': [('/rhev/data-center/mnt/storage:_export_00/b2ff7cd6-64a6-453d-b71e-22fa4b11f16e/images/d4fa7500-306b-489b-88e7-48a3e1ff1a9b/c873c3e2-5da1-4614-b564-37324fefb9a1.lease', 0)]}]
sanlock.release(res[1]["lockspace"], res[1]["resource"], res[1]["disks"], pid=1956)
pprint.pprint(sanlock.inquire(pid=1956))
[{'disks': [('/dev/8ef0cc64-77c5-49e0-a0f5-03cee9bbee20/leases', 1048576)], 'flags': 1, 'lockspace': b'8ef0cc64-77c5-49e0-a0f5-03cee9bbee20', 'resource': b'SDM', 'version': 30}]
More testing - trying to run concurrent cold merge in vdsm when sanlock.inquire() stress test is running (68 seconds run started 10 seconds before).
Vdsm failed to acquire a lock with EBUSY:
2021-05-19 17:07:14,237+0300 INFO (tasks/0) [storage.SANLock] Acquiring Lease(name='e181720a-9edf-46cb-b5ff-83e986cb3db3', path ='/dev/71702e1e-82d2-41d2-93df-678f2415da2a/leases', offset=108003328) for host id 2, lvb=False (clusterlock:469) 2021-05-19 17:07:14,245+0300 ERROR (tasks/0) [storage.guarded] Error acquiring lock <VolumeLease ns=04_lease_71702e1e-82d2-41d2- 93df-678f2415da2a, name=e181720a-9edf-46cb-b5ff-83e986cb3db3, mode=exclusive at 0x7f5dfc058b70> (guarded:96) 2021-05-19 17:07:14,254+0300 ERROR (tasks/0) [storage.TaskManager.Task] (Task='0e1c4eea-7b95-4d49-acc6-2f57cff88542') Unexpected error (task:880) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/clusterlock.py", line 505, in acquire **extra_args) sanlock.SanlockException: (16, 'Sanlock resource not acquired', 'Device or resource busy')
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run return fn(*args, **kargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 350, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/lib/python3.6/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 1856, in prepareMerge merge.prepare(subchainInfo) File "/usr/lib/python3.6/site-packages/vdsm/storage/merge.py", line 178, in prepare with guarded.context(subchain.locks): File "/usr/lib/python3.6/site-packages/vdsm/storage/guarded.py", line 102, in __enter__ six.reraise(*exc) File "/usr/lib/python3.6/site-packages/six.py", line 675, in reraise raise value File "/usr/lib/python3.6/site-packages/vdsm/storage/guarded.py", line 93, in __enter__ lock.acquire() File "/usr/lib/python3.6/site-packages/vdsm/storage/volume.py", line 1607, in acquire dom.acquireVolumeLease(self._host_id, self._img_id, self._vol_id) File "/usr/lib/python3.6/site-packages/vdsm/storage/sd.py", line 548, in acquireVolumeLease self._domainLock.acquire(hostId, lease) File "/usr/lib/python3.6/site-packages/vdsm/storage/clusterlock.py", line 510, in acquire "Cannot acquire %s" % (lease,), str(e)) vdsm.storage.exception.AcquireLockFailure: Cannot obtain lock: "id=71702e1e-82d2-41d2-93df-678f2415da2a, rc=16, out=Cannot acqui re Lease(name='e181720a-9edf-46cb-b5ff-83e986cb3db3', path='/dev/71702e1e-82d2-41d2-93df-678f2415da2a/leases', offset=108003328) , err=(16, 'Sanlock resource not acquired', 'Device or resource busy')"
So using sanlock.inquire(pid=) is risky. Only one process should manage or query resources at the same time, or all users should be prepared to retry on EBUSY.
Nir