Locking Problem when deleting in parallel

Thursday, 16 November 2023

Hi,

since we updated to the latest CentOS 7 Version

# rpm -qa | grep 389 389-ds-base-libs-1.3.11.1-3.el7_9.x86_64
389-ds-base-devel-1.3.11.1-3.el7_9.x86_64
389-ds-base-snmp-1.3.11.1-3.el7_9.x86_64
389-adminutil-devel-1.1.22-2.el7.x86_64
389-adminutil-1.1.22-2.el7.x86_64
389-ds-base-1.3.11.1-3.el7_9.x86_64
389-admin-1.1.46-4.el7.x86_64

# uname -r

3.10.0-1160.11.1.el7.x86_64

We experience strange locking (?) behaviour: we have a synchronisation 
jobs that tried to delete about 1300 Accounts, always 10 in parallel 
using some simple forking perl / shell scripts. Pstree locks like this

auto_sync.pl(21776)───bash(21777)───perl(21798)─┬─perl(21951)───sh(30340)───ldap_remove_use(30341)───ldapremove(30992)───ldapdelete(7489)

                                               ├─perl(22691)───sh(1015)───ldap_remove_use(1016)───ldapremove(1687)───ldapdelete(7474)
                                               ├─perl(23474)───sh(4344)───ldap_remove_use(4345)───ldapremove(5037)───ldapdelete(7453)
                                               ├─perl(24243)───sh(2113)───ldap_remove_use(2114)───ldapremove(2775)───ldapdelete(7528)
                                               ├─perl(24979)───sh(29293)───ldap_remove_use(29294)───ldapremove(29943)───ldapdelete(7514)
                                               ├─perl(25718)───sh(3190)───ldap_remove_use(3191)───ldapremove(3912)───ldapdelete(7539)
                                               ├─perl(26456)───sh(32437)───ldap_remove_use(32438)───ldapremove(624)───ldapdelete(7468)
                                               ├─perl(27193)───sh(5442)───ldap_remove_use(5443)───ldapremove(6154)───ldapdelete(7553)
                                               ├─perl(27943)───sh(7937)───ldap_remove_use(7938)───ldapremove(8598)───ldapmodify(8683)
                                               └─perl(28681)───sh(6549)───ldap_remove_use(6550)───ldapremove(7546)───ldapmodify(7637)

So we run 10 ldapmodify / ldapdelete calls nearly at the same time and 
the server does not do anything. After 100s to 400s it returns an error:

[16/Nov/2023:17:41:52.285305117 +0100] conn=512226 op=86 MOD
dn="cn=group,ou=...."
[16/Nov/2023:17:43:32.599278565 +0100] conn=512226 op=86 RESULT err=16 tag=103 nentries=0
wtime=0.000075793 optime=100.313978009 etime=100.314051783 csn=655646bb000517e90000

[16/Nov/2023:17:11:43.331110511 +0100] conn=509941 op=2 DEL
dn="uid=testuser,ou=People,dc=..."
[16/Nov/2023:17:18:24.325913462 +0100] conn=509941 op=2 RESULT err=1 tag=107 nentries=0
wtime=0.000228257 optime=400.994827179 etime=400.995050827 csn=65564073000017e90000
[16/Nov/2023:17:18:24.326834055 +0100] conn=509941 op=3 UNBIND

causing the client

ldap_delete: Operations error (1)

Some modifies did work, but very slow either. Only a kill and restart 
the ns-slapd helped. It's not strictly reproducable, happens after a 
while...

Since we have other problems (see "389 DS memory growth") with this 
version as well (the versions before did work perfectly for years!) I 
think about upgrading the whole cluster to a debian based system with a 
more recent Version on debian. We run also some debian 11 based 389 and 
some IPAs in podman with Rocky and have no problems at all.

Or are there any other hints we may could try to come around this 
strange behaviour on 1.3.11.1-3 ?

br

Harald

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005