Hi Harold,
Can you please check the DS errors log when you start seeing the err=1
on deletes. I suspect it is a database retry count error, and the
errors log should provide info on the err=1. RI plugin could also be
involved here. Now that err=16 is "type of value exists", it means
whatever was being added was already present in the entry (harmless).
It's the delete/err=1 that we need to investigate.
Another useful piece of information would be to get several stack traces
from 389-ds-base when the problem starts to happen:
Hi,
we checked IOPs and data throughput and it seems the the hypervisor
has some problems:
fio --name TEST --eta-newline=5s --filename=temp.file --rw=randread --size=2g
--io_size=10g -
-blocksize=4k --ioengine=libaio --fsync=1 --iodepth=1 --direct=1 --numjobs=32
--runtime=60 --group_reporting
TEST: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B,
ioengine=libaio, iodepth=1
...
fio-3.7
Starting 32 processes
TEST: Laying out IO file (1 file / 2048MiB)
Jobs: 32 (f=32): [r(32)][11.7%][r=16KiB/s,w=0KiB/s][r=4,w=0 IOPS][eta 00m:53s]
Jobs: 32 (f=32): [r(32)][21.7%][r=76KiB/s,w=0KiB/s][r=19,w=0 IOPS][eta 00m:47s]
Jobs: 32 (f=32): [r(32)][31.7%][r=504KiB/s,w=0KiB/s][r=126,w=0 IOPS][eta 00m:41s]
Jobs: 32 (f=32): [r(32)][41.7%][r=624KiB/s,w=0KiB/s][r=156,w=0 IOPS][eta 00m:35s]
Jobs: 32 (f=32): [r(32)][51.7%][r=676KiB/s,w=0KiB/s][r=169,w=0 IOPS][eta 00m:29s]
Jobs: 32 (f=32): [r(32)][61.7%][r=412KiB/s,w=0KiB/s][r=103,w=0 IOPS][eta 00m:23s]
Jobs: 32 (f=32): [r(32)][71.7%][r=480KiB/s,w=0KiB/s][r=120,w=0 IOPS][eta 00m:17s]
Jobs: 32 (f=32): [r(32)][81.7%][r=616KiB/s,w=0KiB/s][r=154,w=0 IOPS][eta 00m:11s]
Jobs: 32 (f=32): [r(32)][91.7%][r=620KiB/s,w=0KiB/s][r=155,w=0 IOPS][eta 00m:05s]
Jobs: 32 (f=32): [r(32)][100.0%][r=560KiB/s,w=0KiB/s][r=140,w=0 IOPS][eta 00m:00s]
TEST: (groupid=0, jobs=32): err= 0: pid=6733: Thu Nov 16 21:05:31 2023
read: IOPS=105, BW=424KiB/s (434kB/s)(24.9MiB/60199msec)
slat (usec): min=7, max=267470, avg=2490.67, stdev=21228.41
clat (msec): min=2, max=8378, avg=298.93, stdev=506.16
lat (msec): min=2, max=8378, avg=301.42, stdev=505.58
clat percentiles (msec):
| 1.00th=[ 17], 5.00th=[ 31], 10.00th=[ 48], 20.00th=[ 83],
| 30.00th=[ 114], 40.00th=[ 155], 50.00th=[ 203], 60.00th=[ 255],
| 70.00th=[ 321], 80.00th=[ 414], 90.00th=[ 550], 95.00th=[ 693],
| 99.00th=[ 2299], 99.50th=[ 3775], 99.90th=[ 8154], 99.95th=[ 8154],
| 99.99th=[ 8356]
bw ( KiB/s): min= 7, max= 64, per=3.97%, avg=16.78, stdev= 8.89, samples=3013
iops : min= 1, max= 16, avg= 4.13, stdev= 2.24, samples=3013
lat (msec) : 4=0.08%, 10=0.16%, 20=1.74%, 50=8.67%, 100=14.83%
lat (msec) : 250=33.47%, 500=28.07%, 750=8.87%, 1000=1.86%
cpu : usr=0.01%, sys=0.03%, ctx=6547, majf=0, minf=1167
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=6381,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=424KiB/s (434kB/s), 424KiB/s-424KiB/s (434kB/s-434kB/s), io=24.9MiB (26.1MB),
run=60199-60199msec
Disk stats (read/write):
dm-0: ios=6382/2864, merge=0/0, ticks=1918969/5453392, in_queue=7375505,
util=100.00%, aggrios=6381/2792, aggrm
erge=1/73, aggrticks=1907132/5325725, aggrin_queue=7233461, aggrutil=99.98%
sda: ios=6381/2792, merge=1/73, ticks=1907132/5325725, in_queue=7233461, util=99.98%
Only 105 IOPs and some KB of IO. So seems not to be a 389 Problem.
On 16.11.23 20:33, Harald Strack wrote:
>
> Hi,
>
> since we updated to the latest CentOS 7 Version
>
> # rpm -qa | grep 389 389-ds-base-libs-1.3.11.1-3.el7_9.x86_64
> 389-ds-base-devel-1.3.11.1-3.el7_9.x86_64
> 389-ds-base-snmp-1.3.11.1-3.el7_9.x86_64
> 389-adminutil-devel-1.1.22-2.el7.x86_64
> 389-adminutil-1.1.22-2.el7.x86_64
> 389-ds-base-1.3.11.1-3.el7_9.x86_64
> 389-admin-1.1.46-4.el7.x86_64
>
> # uname -r
>
> 3.10.0-1160.11.1.el7.x86_64
>
>
> We experience strange locking (?) behaviour: we have a
> synchronisation jobs that tried to delete about 1300 Accounts, always
> 10 in parallel using some simple forking perl / shell scripts. Pstree
> locks like this
>
>
>
auto_sync.pl(21776)───bash(21777)───perl(21798)─┬─perl(21951)───sh(30340)───ldap_remove_use(30341)───ldapremove(30992)───ldapdelete(7489)
>
├─perl(22691)───sh(1015)───ldap_remove_use(1016)───ldapremove(1687)───ldapdelete(7474)
>
├─perl(23474)───sh(4344)───ldap_remove_use(4345)───ldapremove(5037)───ldapdelete(7453)
>
├─perl(24243)───sh(2113)───ldap_remove_use(2114)───ldapremove(2775)───ldapdelete(7528)
>
├─perl(24979)───sh(29293)───ldap_remove_use(29294)───ldapremove(29943)───ldapdelete(7514)
>
├─perl(25718)───sh(3190)───ldap_remove_use(3191)───ldapremove(3912)───ldapdelete(7539)
>
├─perl(26456)───sh(32437)───ldap_remove_use(32438)───ldapremove(624)───ldapdelete(7468)
>
├─perl(27193)───sh(5442)───ldap_remove_use(5443)───ldapremove(6154)───ldapdelete(7553)
>
├─perl(27943)───sh(7937)───ldap_remove_use(7938)───ldapremove(8598)───ldapmodify(8683)
>
└─perl(28681)───sh(6549)───ldap_remove_use(6550)───ldapremove(7546)───ldapmodify(7637)
>
>
>
> So we run 10 ldapmodify / ldapdelete calls nearly at the same time
> and the server does not do anything. After 100s to 400s it returns an
> error:
>
>
> [16/Nov/2023:17:41:52.285305117 +0100] conn=512226 op=86 MOD
dn="cn=group,ou=...."
> [16/Nov/2023:17:43:32.599278565 +0100] conn=512226 op=86 RESULT err=16 tag=103
nentries=0 wtime=0.000075793 optime=100.313978009 etime=100.314051783
csn=655646bb000517e90000
>
> [16/Nov/2023:17:11:43.331110511 +0100] conn=509941 op=2 DEL
dn="uid=testuser,ou=People,dc=..."
> [16/Nov/2023:17:18:24.325913462 +0100] conn=509941 op=2 RESULT err=1 tag=107
nentries=0 wtime=0.000228257 optime=400.994827179 etime=400.995050827
csn=65564073000017e90000
> [16/Nov/2023:17:18:24.326834055 +0100] conn=509941 op=3 UNBIND
>
> causing the client
>
> ldap_delete: Operations error (1)
>
> Some modifies did work, but very slow either. Only a kill and restart
> the ns-slapd helped. It's not strictly reproducable, happens after a
> while...
>
> Since we have other problems (see "389 DS memory growth") with this
> version as well (the versions before did work perfectly for years!) I
> think about upgrading the whole cluster to a debian based system with
> a more recent Version on debian. We run also some debian 11 based 389
> and some IPAs in podman with Rocky and have no problems at all.
>
> Or are there any other hints we may could try to come around this
> strange behaviour on 1.3.11.1-3 ?
>
> br
>
> Harald
>
>
--
Harald Strack
Geschäftsführer
ssystems GmbH
Kastanienallee 74
10435 Berlin
--
_______________________________________________
389-devel mailing list --389-devel(a)lists.fedoraproject.org
To unsubscribe send an email to389-devel-leave(a)lists.fedoraproject.org
Fedora Code of
Conduct:https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List
Guidelines:https://fedoraproject.org/wiki/Mailing_list_guidelines
List
Archives:https://lists.fedoraproject.org/archives/list/389-devel@lists.fe...
Do not reply to spam, report it:https://pagure.io/fedora-infrastructure/new_issue