Hi,
we checked IOPs and data throughput and it seems the the hypervisor has
some problems:
fio --name TEST --eta-newline=5s --filename=temp.file --rw=randread --size=2g
--io_size=10g -
-blocksize=4k --ioengine=libaio --fsync=1 --iodepth=1 --direct=1 --numjobs=32 --runtime=60
--group_reporting
TEST: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B,
ioengine=libaio, iodepth=1
...
fio-3.7
Starting 32 processes
TEST: Laying out IO file (1 file / 2048MiB)
Jobs: 32 (f=32): [r(32)][11.7%][r=16KiB/s,w=0KiB/s][r=4,w=0 IOPS][eta 00m:53s]
Jobs: 32 (f=32): [r(32)][21.7%][r=76KiB/s,w=0KiB/s][r=19,w=0 IOPS][eta 00m:47s]
Jobs: 32 (f=32): [r(32)][31.7%][r=504KiB/s,w=0KiB/s][r=126,w=0 IOPS][eta 00m:41s]
Jobs: 32 (f=32): [r(32)][41.7%][r=624KiB/s,w=0KiB/s][r=156,w=0 IOPS][eta 00m:35s]
Jobs: 32 (f=32): [r(32)][51.7%][r=676KiB/s,w=0KiB/s][r=169,w=0 IOPS][eta 00m:29s]
Jobs: 32 (f=32): [r(32)][61.7%][r=412KiB/s,w=0KiB/s][r=103,w=0 IOPS][eta 00m:23s]
Jobs: 32 (f=32): [r(32)][71.7%][r=480KiB/s,w=0KiB/s][r=120,w=0 IOPS][eta 00m:17s]
Jobs: 32 (f=32): [r(32)][81.7%][r=616KiB/s,w=0KiB/s][r=154,w=0 IOPS][eta 00m:11s]
Jobs: 32 (f=32): [r(32)][91.7%][r=620KiB/s,w=0KiB/s][r=155,w=0 IOPS][eta 00m:05s]
Jobs: 32 (f=32): [r(32)][100.0%][r=560KiB/s,w=0KiB/s][r=140,w=0 IOPS][eta 00m:00s]
TEST: (groupid=0, jobs=32): err= 0: pid=6733: Thu Nov 16 21:05:31 2023
read: IOPS=105, BW=424KiB/s (434kB/s)(24.9MiB/60199msec)
slat (usec): min=7, max=267470, avg=2490.67, stdev=21228.41
clat (msec): min=2, max=8378, avg=298.93, stdev=506.16
lat (msec): min=2, max=8378, avg=301.42, stdev=505.58
clat percentiles (msec):
| 1.00th=[ 17], 5.00th=[ 31], 10.00th=[ 48], 20.00th=[ 83],
| 30.00th=[ 114], 40.00th=[ 155], 50.00th=[ 203], 60.00th=[ 255],
| 70.00th=[ 321], 80.00th=[ 414], 90.00th=[ 550], 95.00th=[ 693],
| 99.00th=[ 2299], 99.50th=[ 3775], 99.90th=[ 8154], 99.95th=[ 8154],
| 99.99th=[ 8356]
bw ( KiB/s): min= 7, max= 64, per=3.97%, avg=16.78, stdev= 8.89, samples=3013
iops : min= 1, max= 16, avg= 4.13, stdev= 2.24, samples=3013
lat (msec) : 4=0.08%, 10=0.16%, 20=1.74%, 50=8.67%, 100=14.83%
lat (msec) : 250=33.47%, 500=28.07%, 750=8.87%, 1000=1.86%
cpu : usr=0.01%, sys=0.03%, ctx=6547, majf=0, minf=1167
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=6381,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=424KiB/s (434kB/s), 424KiB/s-424KiB/s (434kB/s-434kB/s), io=24.9MiB (26.1MB),
run=60199-60199msec
Disk stats (read/write):
dm-0: ios=6382/2864, merge=0/0, ticks=1918969/5453392, in_queue=7375505, util=100.00%,
aggrios=6381/2792, aggrm
erge=1/73, aggrticks=1907132/5325725, aggrin_queue=7233461, aggrutil=99.98%
sda: ios=6381/2792, merge=1/73, ticks=1907132/5325725, in_queue=7233461, util=99.98%
Only 105 IOPs and some KB of IO. So seems not to be a 389 Problem.
On 16.11.23 20:33, Harald Strack wrote:
Hi,
since we updated to the latest CentOS 7 Version
# rpm -qa | grep 389 389-ds-base-libs-1.3.11.1-3.el7_9.x86_64
389-ds-base-devel-1.3.11.1-3.el7_9.x86_64
389-ds-base-snmp-1.3.11.1-3.el7_9.x86_64
389-adminutil-devel-1.1.22-2.el7.x86_64
389-adminutil-1.1.22-2.el7.x86_64
389-ds-base-1.3.11.1-3.el7_9.x86_64
389-admin-1.1.46-4.el7.x86_64
# uname -r
3.10.0-1160.11.1.el7.x86_64
We experience strange locking (?) behaviour: we have a synchronisation
jobs that tried to delete about 1300 Accounts, always 10 in parallel
using some simple forking perl / shell scripts. Pstree locks like this
auto_sync.pl(21776)───bash(21777)───perl(21798)─┬─perl(21951)───sh(30340)───ldap_remove_use(30341)───ldapremove(30992)───ldapdelete(7489)
├─perl(22691)───sh(1015)───ldap_remove_use(1016)───ldapremove(1687)───ldapdelete(7474)
├─perl(23474)───sh(4344)───ldap_remove_use(4345)───ldapremove(5037)───ldapdelete(7453)
├─perl(24243)───sh(2113)───ldap_remove_use(2114)───ldapremove(2775)───ldapdelete(7528)
├─perl(24979)───sh(29293)───ldap_remove_use(29294)───ldapremove(29943)───ldapdelete(7514)
├─perl(25718)───sh(3190)───ldap_remove_use(3191)───ldapremove(3912)───ldapdelete(7539)
├─perl(26456)───sh(32437)───ldap_remove_use(32438)───ldapremove(624)───ldapdelete(7468)
├─perl(27193)───sh(5442)───ldap_remove_use(5443)───ldapremove(6154)───ldapdelete(7553)
├─perl(27943)───sh(7937)───ldap_remove_use(7938)───ldapremove(8598)───ldapmodify(8683)
└─perl(28681)───sh(6549)───ldap_remove_use(6550)───ldapremove(7546)───ldapmodify(7637)
So we run 10 ldapmodify / ldapdelete calls nearly at the same time and
the server does not do anything. After 100s to 400s it returns an error:
[16/Nov/2023:17:41:52.285305117 +0100] conn=512226 op=86 MOD
dn="cn=group,ou=...."
[16/Nov/2023:17:43:32.599278565 +0100] conn=512226 op=86 RESULT err=16 tag=103 nentries=0
wtime=0.000075793 optime=100.313978009 etime=100.314051783 csn=655646bb000517e90000
[16/Nov/2023:17:11:43.331110511 +0100] conn=509941 op=2 DEL
dn="uid=testuser,ou=People,dc=..."
[16/Nov/2023:17:18:24.325913462 +0100] conn=509941 op=2 RESULT err=1 tag=107 nentries=0
wtime=0.000228257 optime=400.994827179 etime=400.995050827 csn=65564073000017e90000
[16/Nov/2023:17:18:24.326834055 +0100] conn=509941 op=3 UNBIND
causing the client
ldap_delete: Operations error (1)
Some modifies did work, but very slow either. Only a kill and restart
the ns-slapd helped. It's not strictly reproducable, happens after a
while...
Since we have other problems (see "389 DS memory growth") with this
version as well (the versions before did work perfectly for years!) I
think about upgrading the whole cluster to a debian based system with
a more recent Version on debian. We run also some debian 11 based 389
and some IPAs in podman with Rocky and have no problems at all.
Or are there any other hints we may could try to come around this
strange behaviour on 1.3.11.1-3 ?
br
Harald
--
Harald Strack
Geschäftsführer
ssystems GmbH
Kastanienallee 74
10435 Berlin