Hi Harold,

Can you please check the DS errors log when you start seeing the err=1 on deletes.  I suspect it is a database retry count error, and the errors log should provide info on the err=1.  RI plugin could also be involved here.  Now that err=16 is "type of value exists", it means whatever was being added was already present in the entry (harmless).  It's the delete/err=1 that we need to investigate. 

Another useful piece of information would be to get several stack traces from 389-ds-base when the problem starts to happen:

https://www.port389.org/docs/389ds/FAQ/faq.html#sts=Debugging%C2%A0Hangs

Thanks,

Mark

On 11/16/23 3:32 PM, Harald Strack wrote:

Hi,

we checked IOPs and data throughput and it seems the the hypervisor has some problems:

 fio --name TEST --eta-newline=5s --filename=temp.file --rw=randread --size=2g --io_size=10g -
-blocksize=4k --ioengine=libaio --fsync=1 --iodepth=1 --direct=1 --numjobs=32 --runtime=60 --group_reporting  
TEST: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
...
fio-3.7
Starting 32 processes
TEST: Laying out IO file (1 file / 2048MiB)
Jobs: 32 (f=32): [r(32)][11.7%][r=16KiB/s,w=0KiB/s][r=4,w=0 IOPS][eta 00m:53s]   
Jobs: 32 (f=32): [r(32)][21.7%][r=76KiB/s,w=0KiB/s][r=19,w=0 IOPS][eta 00m:47s]  
Jobs: 32 (f=32): [r(32)][31.7%][r=504KiB/s,w=0KiB/s][r=126,w=0 IOPS][eta 00m:41s]
Jobs: 32 (f=32): [r(32)][41.7%][r=624KiB/s,w=0KiB/s][r=156,w=0 IOPS][eta 00m:35s]  
Jobs: 32 (f=32): [r(32)][51.7%][r=676KiB/s,w=0KiB/s][r=169,w=0 IOPS][eta 00m:29s]  
Jobs: 32 (f=32): [r(32)][61.7%][r=412KiB/s,w=0KiB/s][r=103,w=0 IOPS][eta 00m:23s]  
Jobs: 32 (f=32): [r(32)][71.7%][r=480KiB/s,w=0KiB/s][r=120,w=0 IOPS][eta 00m:17s]
Jobs: 32 (f=32): [r(32)][81.7%][r=616KiB/s,w=0KiB/s][r=154,w=0 IOPS][eta 00m:11s]  
Jobs: 32 (f=32): [r(32)][91.7%][r=620KiB/s,w=0KiB/s][r=155,w=0 IOPS][eta 00m:05s]  
Jobs: 32 (f=32): [r(32)][100.0%][r=560KiB/s,w=0KiB/s][r=140,w=0 IOPS][eta 00m:00s]
TEST: (groupid=0, jobs=32): err= 0: pid=6733: Thu Nov 16 21:05:31 2023
  read: IOPS=105, BW=424KiB/s (434kB/s)(24.9MiB/60199msec)
   slat (usec): min=7, max=267470, avg=2490.67, stdev=21228.41
   clat (msec): min=2, max=8378, avg=298.93, stdev=506.16
    lat (msec): min=2, max=8378, avg=301.42, stdev=505.58
   clat percentiles (msec):
    |  1.00th=[   17],  5.00th=[   31], 10.00th=[   48], 20.00th=[   83],
    | 30.00th=[  114], 40.00th=[  155], 50.00th=[  203], 60.00th=[  255],
    | 70.00th=[  321], 80.00th=[  414], 90.00th=[  550], 95.00th=[  693],
    | 99.00th=[ 2299], 99.50th=[ 3775], 99.90th=[ 8154], 99.95th=[ 8154],
    | 99.99th=[ 8356]
  bw (  KiB/s): min=    7, max=   64, per=3.97%, avg=16.78, stdev= 8.89, samples=3013
  iops        : min=    1, max=   16, avg= 4.13, stdev= 2.24, samples=3013
 lat (msec)   : 4=0.08%, 10=0.16%, 20=1.74%, 50=8.67%, 100=14.83%
 lat (msec)   : 250=33.47%, 500=28.07%, 750=8.87%, 1000=1.86%
 cpu          : usr=0.01%, sys=0.03%, ctx=6547, majf=0, minf=1167
 IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    issued rwts: total=6381,0,0,0 short=0,0,0,0 dropped=0,0,0,0
    latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  READ: bw=424KiB/s (434kB/s), 424KiB/s-424KiB/s (434kB/s-434kB/s), io=24.9MiB (26.1MB), run=60199-60199msec

Disk stats (read/write):
   dm-0: ios=6382/2864, merge=0/0, ticks=1918969/5453392, in_queue=7375505, util=100.00%, aggrios=6381/2792, aggrm
erge=1/73, aggrticks=1907132/5325725, aggrin_queue=7233461, aggrutil=99.98%
 sda: ios=6381/2792, merge=1/73, ticks=1907132/5325725, in_queue=7233461, util=99.98%

Only 105 IOPs and some KB of IO. So seems not to be a 389 Problem.

On 16.11.23 20:33, Harald Strack wrote:

Hi,

since we updated to the latest CentOS 7 Version

# rpm -qa | grep 389 389-ds-base-libs-1.3.11.1-3.el7_9.x86_64
389-ds-base-devel-1.3.11.1-3.el7_9.x86_64
389-ds-base-snmp-1.3.11.1-3.el7_9.x86_64
389-adminutil-devel-1.1.22-2.el7.x86_64
389-adminutil-1.1.22-2.el7.x86_64
389-ds-base-1.3.11.1-3.el7_9.x86_64
389-admin-1.1.46-4.el7.x86_64

# uname -r

3.10.0-1160.11.1.el7.x86_64


We experience strange locking (?) behaviour: we have a synchronisation jobs that tried to delete about 1300 Accounts, always 10 in parallel using some simple forking perl / shell scripts. Pstree locks like this


auto_sync.pl(21776)───bash(21777)───perl(21798)─┬─perl(21951)───sh(30340)───ldap_remove_use(30341)───ldapremove(30992)───ldapdelete(7489)

                                                ├─perl(22691)───sh(1015)───ldap_remove_use(1016)───ldapremove(1687)───ldapdelete(7474)

                                                ├─perl(23474)───sh(4344)───ldap_remove_use(4345)───ldapremove(5037)───ldapdelete(7453)

                                                ├─perl(24243)───sh(2113)───ldap_remove_use(2114)───ldapremove(2775)───ldapdelete(7528)

                                                ├─perl(24979)───sh(29293)───ldap_remove_use(29294)───ldapremove(29943)───ldapdelete(7514)

                                                ├─perl(25718)───sh(3190)───ldap_remove_use(3191)───ldapremove(3912)───ldapdelete(7539)

                                                ├─perl(26456)───sh(32437)───ldap_remove_use(32438)───ldapremove(624)───ldapdelete(7468)

                                                ├─perl(27193)───sh(5442)───ldap_remove_use(5443)───ldapremove(6154)───ldapdelete(7553)

                                                ├─perl(27943)───sh(7937)───ldap_remove_use(7938)───ldapremove(8598)───ldapmodify(8683)

                                                └─perl(28681)───sh(6549)───ldap_remove_use(6550)───ldapremove(7546)───ldapmodify(7637)





So we run 10 ldapmodify / ldapdelete calls nearly at the same time and the server does not do anything. After 100s to 400s it returns an error:


[16/Nov/2023:17:41:52.285305117 +0100] conn=512226 op=86 MOD dn="cn=group,ou=...." 
[16/Nov/2023:17:43:32.599278565 +0100] conn=512226 op=86 RESULT err=16 tag=103 nentries=0 wtime=0.000075793 optime=100.313978009 etime=100.314051783 csn=655646bb000517e90000

[16/Nov/2023:17:11:43.331110511 +0100] conn=509941 op=2 DEL dn="uid=testuser,ou=People,dc=..." 
[16/Nov/2023:17:18:24.325913462 +0100] conn=509941 op=2 RESULT err=1 tag=107 nentries=0 wtime=0.000228257 optime=400.994827179 etime=400.995050827 csn=65564073000017e90000
[16/Nov/2023:17:18:24.326834055 +0100] conn=509941 op=3 UNBIND

causing the client

ldap_delete: Operations error (1)

Some modifies did work, but very slow either. Only a kill and restart the ns-slapd helped. It's not strictly reproducable, happens after a while...

Since we have other problems (see "389 DS memory growth") with this version as well (the versions before did work perfectly for years!) I think about upgrading the whole cluster to a debian based system with a more recent Version on debian. We run also some debian 11 based 389 and some IPAs in podman with Rocky and have no problems at all.

Or are there any other hints we may could try to come around this strange behaviour on 1.3.11.1-3 ?

br

Harald


-- 
Harald Strack
Geschäftsführer

ssystems GmbH
Kastanienallee 74
10435 Berlin

--
_______________________________________________
389-devel mailing list -- 389-devel@lists.fedoraproject.org
To unsubscribe send an email to 389-devel-leave@lists.fedoraproject.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproject.org
Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue
-- 
Identity Management Development Team