Can you please check the DS errors log when you start seeing the
err=1 on deletes. I suspect it is a database retry count error,
and the errors log should provide info on the err=1. RI plugin
could also be involved here. Now that err=16 is "type of value
exists", it means whatever was being added was already present in
the entry (harmless). It's the delete/err=1 that we need to
investigate.
Another useful piece of information would be to get several stack
traces from 389-ds-base when the problem starts to happen:
Hi,
we checked IOPs and data throughput and it seems the the
hypervisor has some problems:
fio --name TEST --eta-newline=5s --filename=temp.file --rw=randread --size=2g --io_size=10g -
-blocksize=4k --ioengine=libaio --fsync=1 --iodepth=1 --direct=1 --numjobs=32 --runtime=60 --group_reporting
TEST: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
...
fio-3.7
Starting 32 processes
TEST: Laying out IO file (1 file / 2048MiB)
Jobs: 32 (f=32): [r(32)][11.7%][r=16KiB/s,w=0KiB/s][r=4,w=0 IOPS][eta 00m:53s]
Jobs: 32 (f=32): [r(32)][21.7%][r=76KiB/s,w=0KiB/s][r=19,w=0 IOPS][eta 00m:47s]
Jobs: 32 (f=32): [r(32)][31.7%][r=504KiB/s,w=0KiB/s][r=126,w=0 IOPS][eta 00m:41s]
Jobs: 32 (f=32): [r(32)][41.7%][r=624KiB/s,w=0KiB/s][r=156,w=0 IOPS][eta 00m:35s]
Jobs: 32 (f=32): [r(32)][51.7%][r=676KiB/s,w=0KiB/s][r=169,w=0 IOPS][eta 00m:29s]
Jobs: 32 (f=32): [r(32)][61.7%][r=412KiB/s,w=0KiB/s][r=103,w=0 IOPS][eta 00m:23s]
Jobs: 32 (f=32): [r(32)][71.7%][r=480KiB/s,w=0KiB/s][r=120,w=0 IOPS][eta 00m:17s]
Jobs: 32 (f=32): [r(32)][81.7%][r=616KiB/s,w=0KiB/s][r=154,w=0 IOPS][eta 00m:11s]
Jobs: 32 (f=32): [r(32)][91.7%][r=620KiB/s,w=0KiB/s][r=155,w=0 IOPS][eta 00m:05s]
Jobs: 32 (f=32): [r(32)][100.0%][r=560KiB/s,w=0KiB/s][r=140,w=0 IOPS][eta 00m:00s]
TEST: (groupid=0, jobs=32): err= 0: pid=6733: Thu Nov 16 21:05:31 2023
read: IOPS=105, BW=424KiB/s (434kB/s)(24.9MiB/60199msec)
slat (usec): min=7, max=267470, avg=2490.67, stdev=21228.41
clat (msec): min=2, max=8378, avg=298.93, stdev=506.16
lat (msec): min=2, max=8378, avg=301.42, stdev=505.58
clat percentiles (msec):
| 1.00th=[ 17], 5.00th=[ 31], 10.00th=[ 48], 20.00th=[ 83],
| 30.00th=[ 114], 40.00th=[ 155], 50.00th=[ 203], 60.00th=[ 255],
| 70.00th=[ 321], 80.00th=[ 414], 90.00th=[ 550], 95.00th=[ 693],
| 99.00th=[ 2299], 99.50th=[ 3775], 99.90th=[ 8154], 99.95th=[ 8154],
| 99.99th=[ 8356]
bw ( KiB/s): min= 7, max= 64, per=3.97%, avg=16.78, stdev= 8.89, samples=3013
iops : min= 1, max= 16, avg= 4.13, stdev= 2.24, samples=3013
lat (msec) : 4=0.08%, 10=0.16%, 20=1.74%, 50=8.67%, 100=14.83%
lat (msec) : 250=33.47%, 500=28.07%, 750=8.87%, 1000=1.86%
cpu : usr=0.01%, sys=0.03%, ctx=6547, majf=0, minf=1167
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=6381,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=424KiB/s (434kB/s), 424KiB/s-424KiB/s (434kB/s-434kB/s), io=24.9MiB (26.1MB), run=60199-60199msec
Disk stats (read/write):
dm-0: ios=6382/2864, merge=0/0, ticks=1918969/5453392, in_queue=7375505, util=100.00%, aggrios=6381/2792, aggrm
erge=1/73, aggrticks=1907132/5325725, aggrin_queue=7233461, aggrutil=99.98%
sda: ios=6381/2792, merge=1/73, ticks=1907132/5325725, in_queue=7233461, util=99.98%
Only 105 IOPs and some KB of IO. So seems not to be a 389
Problem.
On 16.11.23 20:33, Harald Strack
wrote:
Hi,
since we updated to the latest CentOS 7 Version
# rpm -qa
| grep 389 389-ds-base-libs-1.3.11.1-3.el7_9.x86_64
389-ds-base-devel-1.3.11.1-3.el7_9.x86_64
389-ds-base-snmp-1.3.11.1-3.el7_9.x86_64
389-adminutil-devel-1.1.22-2.el7.x86_64
389-adminutil-1.1.22-2.el7.x86_64
389-ds-base-1.3.11.1-3.el7_9.x86_64
389-admin-1.1.46-4.el7.x86_64
# uname -r
3.10.0-1160.11.1.el7.x86_64
We experience strange locking (?) behaviour: we have a
synchronisation jobs that tried to delete about 1300 Accounts,
always 10 in parallel using some simple forking perl / shell
scripts. Pstree locks like this
auto_sync.pl(21776)───bash(21777)───perl(21798)─┬─perl(21951)───sh(30340)───ldap_remove_use(30341)───ldapremove(30992)───ldapdelete(7489)
├─perl(22691)───sh(1015)───ldap_remove_use(1016)───ldapremove(1687)───ldapdelete(7474)
├─perl(23474)───sh(4344)───ldap_remove_use(4345)───ldapremove(5037)───ldapdelete(7453)
├─perl(24243)───sh(2113)───ldap_remove_use(2114)───ldapremove(2775)───ldapdelete(7528)
├─perl(24979)───sh(29293)───ldap_remove_use(29294)───ldapremove(29943)───ldapdelete(7514)
├─perl(25718)───sh(3190)───ldap_remove_use(3191)───ldapremove(3912)───ldapdelete(7539)
├─perl(26456)───sh(32437)───ldap_remove_use(32438)───ldapremove(624)───ldapdelete(7468)
├─perl(27193)───sh(5442)───ldap_remove_use(5443)───ldapremove(6154)───ldapdelete(7553)
├─perl(27943)───sh(7937)───ldap_remove_use(7938)───ldapremove(8598)───ldapmodify(8683)
└─perl(28681)───sh(6549)───ldap_remove_use(6550)───ldapremove(7546)───ldapmodify(7637)
So we run 10 ldapmodify / ldapdelete calls nearly at the same
time and the server does not do anything. After 100s to 400s
it returns an error:
[16/Nov/2023:17:41:52.285305117 +0100] conn=512226 op=86 MOD dn="cn=group,ou=...."
[16/Nov/2023:17:43:32.599278565 +0100] conn=512226 op=86 RESULT err=16 tag=103 nentries=0 wtime=0.000075793 optime=100.313978009 etime=100.314051783 csn=655646bb000517e90000
[16/Nov/2023:17:11:43.331110511 +0100] conn=509941 op=2 DEL dn="uid=testuser,ou=People,dc=..."
[16/Nov/2023:17:18:24.325913462 +0100] conn=509941 op=2 RESULT err=1 tag=107 nentries=0 wtime=0.000228257 optime=400.994827179 etime=400.995050827 csn=65564073000017e90000
[16/Nov/2023:17:18:24.326834055 +0100] conn=509941 op=3 UNBIND
causing the client
ldap_delete: Operations error (1)
Some modifies did work, but very slow either. Only a kill and
restart the ns-slapd helped. It's not strictly reproducable,
happens after a while...
Since we have other problems (see "389 DS memory growth")
with this version as well (the versions before did work
perfectly for years!) I think about upgrading the whole
cluster to a debian based system with a more recent Version on
debian. We run also some debian 11 based 389 and some IPAs in
podman with Rocky and have no problems at all.
Or are there any other hints we may could try to come around
this strange behaviour on 1.3.11.1-3 ?
br
Harald
--
Harald Strack
Geschäftsführer
ssystems GmbH
Kastanienallee 74
10435 Berlin
--
_______________________________________________
389-devel mailing list -- 389-devel@lists.fedoraproject.org
To unsubscribe send an email to 389-devel-leave@lists.fedoraproject.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproject.org
Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue