[389-devel] Re: Locking Problem when deleting in parallel

Thursday, 16 November 2023

Hi Harold,

Can you please check the DS errors log when you start seeing the err=1 
on deletes.  I suspect it is a database retry count error, and the 
errors log should provide info on the err=1.  RI plugin could also be 
involved here.  Now that err=16 is "type of value exists", it means 
whatever was being added was already present in the entry (harmless).  
It's the delete/err=1 that we need to investigate.

Another useful piece of information would be to get several stack traces 
from 389-ds-base when the problem starts to happen:

https://www.port389.org/docs/389ds/FAQ/faq.html#sts=Debugging%C2%A0Hangs

Thanks,

Mark

On 11/16/23 3:32 PM, Harald Strack wrote:
...

 Hi,

 we checked IOPs and data throughput and it seems the the hypervisor 
 has some problems:

   fio --name TEST --eta-newline=5s --filename=temp.file --rw=randread --size=2g
--io_size=10g -
 -blocksize=4k --ioengine=libaio --fsync=1 --iodepth=1 --direct=1 --numjobs=32
--runtime=60 --group_reporting
 TEST: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B,
ioengine=libaio, iodepth=1
 ...
 fio-3.7
 Starting 32 processes
 TEST: Laying out IO file (1 file / 2048MiB)
 Jobs: 32 (f=32): [r(32)][11.7%][r=16KiB/s,w=0KiB/s][r=4,w=0 IOPS][eta 00m:53s]
 Jobs: 32 (f=32): [r(32)][21.7%][r=76KiB/s,w=0KiB/s][r=19,w=0 IOPS][eta 00m:47s]
 Jobs: 32 (f=32): [r(32)][31.7%][r=504KiB/s,w=0KiB/s][r=126,w=0 IOPS][eta 00m:41s]
 Jobs: 32 (f=32): [r(32)][41.7%][r=624KiB/s,w=0KiB/s][r=156,w=0 IOPS][eta 00m:35s]
 Jobs: 32 (f=32): [r(32)][51.7%][r=676KiB/s,w=0KiB/s][r=169,w=0 IOPS][eta 00m:29s]
 Jobs: 32 (f=32): [r(32)][61.7%][r=412KiB/s,w=0KiB/s][r=103,w=0 IOPS][eta 00m:23s]
 Jobs: 32 (f=32): [r(32)][71.7%][r=480KiB/s,w=0KiB/s][r=120,w=0 IOPS][eta 00m:17s]
 Jobs: 32 (f=32): [r(32)][81.7%][r=616KiB/s,w=0KiB/s][r=154,w=0 IOPS][eta 00m:11s]
 Jobs: 32 (f=32): [r(32)][91.7%][r=620KiB/s,w=0KiB/s][r=155,w=0 IOPS][eta 00m:05s]
 Jobs: 32 (f=32): [r(32)][100.0%][r=560KiB/s,w=0KiB/s][r=140,w=0 IOPS][eta 00m:00s]
 TEST: (groupid=0, jobs=32): err= 0: pid=6733: Thu Nov 16 21:05:31 2023
    read: IOPS=105, BW=424KiB/s (434kB/s)(24.9MiB/60199msec)
     slat (usec): min=7, max=267470, avg=2490.67, stdev=21228.41
     clat (msec): min=2, max=8378, avg=298.93, stdev=506.16
      lat (msec): min=2, max=8378, avg=301.42, stdev=505.58
     clat percentiles (msec):
      |  1.00th=[   17],  5.00th=[   31], 10.00th=[   48], 20.00th=[   83],
      | 30.00th=[  114], 40.00th=[  155], 50.00th=[  203], 60.00th=[  255],
      | 70.00th=[  321], 80.00th=[  414], 90.00th=[  550], 95.00th=[  693],
      | 99.00th=[ 2299], 99.50th=[ 3775], 99.90th=[ 8154], 99.95th=[ 8154],
      | 99.99th=[ 8356]
    bw (  KiB/s): min=    7, max=   64, per=3.97%, avg=16.78, stdev= 8.89, samples=3013
    iops        : min=    1, max=   16, avg= 4.13, stdev= 2.24, samples=3013
   lat (msec)   : 4=0.08%, 10=0.16%, 20=1.74%, 50=8.67%, 100=14.83%
   lat (msec)   : 250=33.47%, 500=28.07%, 750=8.87%, 1000=1.86%
   cpu          : usr=0.01%, sys=0.03%, ctx=6547, majf=0, minf=1167
   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      issued rwts: total=6381,0,0,0 short=0,0,0,0 dropped=0,0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=1

 Run status group 0 (all jobs):
    READ: bw=424KiB/s (434kB/s), 424KiB/s-424KiB/s (434kB/s-434kB/s), io=24.9MiB (26.1MB),
run=60199-60199msec

 Disk stats (read/write):
     dm-0: ios=6382/2864, merge=0/0, ticks=1918969/5453392, in_queue=7375505,
util=100.00%, aggrios=6381/2792, aggrm
 erge=1/73, aggrticks=1907132/5325725, aggrin_queue=7233461, aggrutil=99.98%
   sda: ios=6381/2792, merge=1/73, ticks=1907132/5325725, in_queue=7233461, util=99.98%

 Only 105 IOPs and some KB of IO. So seems not to be a 389 Problem.

 On 16.11.23 20:33, Harald Strack wrote:
>
> Hi,
>
> since we updated to the latest CentOS 7 Version
>
> # rpm -qa | grep 389 389-ds-base-libs-1.3.11.1-3.el7_9.x86_64
> 389-ds-base-devel-1.3.11.1-3.el7_9.x86_64
> 389-ds-base-snmp-1.3.11.1-3.el7_9.x86_64
> 389-adminutil-devel-1.1.22-2.el7.x86_64
> 389-adminutil-1.1.22-2.el7.x86_64
> 389-ds-base-1.3.11.1-3.el7_9.x86_64
> 389-admin-1.1.46-4.el7.x86_64
>
> # uname -r
>
> 3.10.0-1160.11.1.el7.x86_64
>
>
> We experience strange locking (?) behaviour: we have a 
> synchronisation jobs that tried to delete about 1300 Accounts, always 
> 10 in parallel using some simple forking perl / shell scripts. Pstree 
> locks like this
>
>
>
auto_sync.pl(21776)───bash(21777)───perl(21798)─┬─perl(21951)───sh(30340)───ldap_remove_use(30341)───ldapremove(30992)───ldapdelete(7489)

>
                                               ├─perl(22691)───sh(1015)───ldap_remove_use(1016)───ldapremove(1687)───ldapdelete(7474)

>
                                               ├─perl(23474)───sh(4344)───ldap_remove_use(4345)───ldapremove(5037)───ldapdelete(7453)

>
                                               ├─perl(24243)───sh(2113)───ldap_remove_use(2114)───ldapremove(2775)───ldapdelete(7528)

>
                                               ├─perl(24979)───sh(29293)───ldap_remove_use(29294)───ldapremove(29943)───ldapdelete(7514)

>
                                               ├─perl(25718)───sh(3190)───ldap_remove_use(3191)───ldapremove(3912)───ldapdelete(7539)

>
                                               ├─perl(26456)───sh(32437)───ldap_remove_use(32438)───ldapremove(624)───ldapdelete(7468)

>
                                               ├─perl(27193)───sh(5442)───ldap_remove_use(5443)───ldapremove(6154)───ldapdelete(7553)

>
                                               ├─perl(27943)───sh(7937)───ldap_remove_use(7938)───ldapremove(8598)───ldapmodify(8683)

>
                                               └─perl(28681)───sh(6549)───ldap_remove_use(6550)───ldapremove(7546)───ldapmodify(7637)

>
>
>
> So we run 10 ldapmodify / ldapdelete calls nearly at the same time 
> and the server does not do anything. After 100s to 400s it returns an 
> error:
>
>
> [16/Nov/2023:17:41:52.285305117 +0100] conn=512226 op=86 MOD
dn="cn=group,ou=...."
> [16/Nov/2023:17:43:32.599278565 +0100] conn=512226 op=86 RESULT err=16 tag=103
nentries=0 wtime=0.000075793 optime=100.313978009 etime=100.314051783
csn=655646bb000517e90000
>
> [16/Nov/2023:17:11:43.331110511 +0100] conn=509941 op=2 DEL
dn="uid=testuser,ou=People,dc=..."
> [16/Nov/2023:17:18:24.325913462 +0100] conn=509941 op=2 RESULT err=1 tag=107
nentries=0 wtime=0.000228257 optime=400.994827179 etime=400.995050827
csn=65564073000017e90000
> [16/Nov/2023:17:18:24.326834055 +0100] conn=509941 op=3 UNBIND
>
> causing the client
>
> ldap_delete: Operations error (1)
>
> Some modifies did work, but very slow either. Only a kill and restart 
> the ns-slapd helped. It's not strictly reproducable, happens after a 
> while...
>
> Since we have other problems (see "389 DS memory growth") with this 
> version as well (the versions before did work perfectly for years!) I 
> think about upgrading the whole cluster to a debian based system with 
> a more recent Version on debian. We run also some debian 11 based 389 
> and some IPAs in podman with Rocky and have no problems at all.
>
> Or are there any other hints we may could try to come around this 
> strange behaviour on 1.3.11.1-3 ?
>
> br
>
> Harald
>
>
 -- 
 Harald Strack
 Geschäftsführer

 ssystems GmbH
 Kastanienallee 74
 10435 Berlin

 --
 _______________________________________________
 389-devel mailing list --389-devel(a)lists.fedoraproject.org
 To unsubscribe send an email to389-devel-leave(a)lists.fedoraproject.org
 Fedora Code of Conduct:https://docs.fedoraproject.org/en-US/project/code-of-conduct/
 List Guidelines:https://fedoraproject.org/wiki/Mailing_list_guidelines
 List
Archives:https://lists.fedoraproject.org/archives/list/389-devel@lists.fe...
 Do not reply to spam, report it:https://pagure.io/fedora-infrastructure/new_issue 
-- 
Identity Management Development Team

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

[389-devel] Re: Locking Problem when deleting in parallel