Hi, I need assistance in understanding what kind of processes happen behind the work of
389ds server in my case, to come with better approaches in troubleshooting further
problems.
In our environment we are running FreeIPA instances in Docker, but neither the case that
it is FreeIPA nor the Docker part actually matter that much, cause ns-slapd service is the
main actor of a "problem" I faced.
Prerequisites
-------------
Host has 4 CPU, 8gb RAM
FreeIPA v.4.9.8
ns-slapd -v
389 Project
389-Directory/1.4.3.28 B2022.074.1358
-------------
Most of the time the host does not expirience any kind of load, though RAM usage (talking
about used) can grow linery over time but it makes no odds. After some moment (now I can
say with some certainty that's because if buff/cache grows) the host begin to
expirience heavy disk write load, which I notice by CPU IOWait alerts.
cpu iowait
https://imgur.com/ifXPX5N
disk IO and latency
https://imgur.com/HyT0cNR
The increase in load is not due to external causes (increase of ldap queries). Actually,
"page out" operations were increasing.
https://imgur.com/OxofSD9
and also process exporter showed increased major page faults which served as indicator of
memory problems (memory shortage though? there was plenty of RAM)
https://imgur.com/xt2p933
Memory status during the problem
--------------------------------
Tasks: 175 total, 1 running, 174 sleeping, 0 stopped, 0 zombie
%CPU(s): 0.7 us, 0.7 sy, 0.0 ni, 29.1 id, 69.4 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 7957.3 total, 414.3 free, 2332.7 used, 5210.3 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 5300.7 avail Mem
I tried to use some other instruments that would indicate that something was wrong, but I
didn't succeed, but I will still add them here for full picture
Here output of
https://github.com/iovisor/bcc/blob/master/tools/ext4slower_example.txt
shows that fsync syscalls by ns-slapd taking to long
TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME
07:00:02 ns-slapd 4017 S 0 0 1455.28 log.0000006661
07:00:10 ns-slapd 4017 S 0 0 1133.62 log.0000006661
07:00:11 ns-slapd 4017 S 0 0 1248.36 log.0000006661
07:01:02 ns-slapd 4017 S 0 0 736.46 log.0000006661
07:01:05 ns-slapd 4017 S 0 0 1541.29 access
07:01:07 ns-slapd 4017 S 0 0 1220.32 log.0000006661
07:01:08 ns-slapd 4017 S 0 0 832.87 log.0000006661
07:01:08 ns-slapd 4017 S 0 0 93.07 entryusn.db
07:01:09 ns-slapd 4017 S 0 0 418.88 description.db
07:01:10 ns-slapd 4017 S 0 0 822.64 nsuniqueid.db
07:01:10 ns-slapd 4017 S 0 0 417.25 displayname.db
07:01:10 ns-slapd 4017 S 0 0 1493.60 log.0000006661
07:01:11 ns-slapd 4017 S 0 0 617.68 memberHost.db
07:01:12 ns-slapd 4017 S 0 0 1684.12 log.0000006661
07:01:12 ns-slapd 4017 S 0 0 434.58 parentid.db
07:01:12 ns-slapd 4017 S 0 0 601.09 accessRuleType.db
07:01:13 ns-slapd 4017 S 0 0 1051.10 memberUser.db
07:01:14 ns-slapd 4017 S 0 0 1056.28 ipakrbprincipalalias.db
07:01:14 ns-slapd 4017 S 0 0 1803.00 log.0000006661
And another output of
https://github.com/iovisor/bcc/blob/master/tools/cachetop_example.txt
I honestly not sure how to interpret those results, maybe it would be helpful
07:15:30 Buffers MB: 97 / Cached MB: 4212 / Sort: HITS / Order: descending
PID UID CMD HITS MISSES DIRTIES READ_HIT% WRITE_HIT%
4017 UNKNOWN( ns-slapd 7259 7177 7177 0.6% 0.0%
So, in the attempts to figure out the solution I tried to restart FreeIPA container but it
had no effect. In the end, I stopped container, dropped all memory caches (sync; echo 3
> /proc/sys/vm/drop_caches) and started container, and only after that excessive paging
out (and disk IO) was gone.
So my questions are:
1) What processes behind the scenes of 389ds server could cause such behavior?
2) Is it (kind of) normal?
3) What actions would you suggest to mitigate such behavior. Would ldap metrics in
cn=monitor helped to determing something was wrong? (unfortunately they are not currently
monitored, but will be).
4) Generally, what would you suggest to monitor additionally, maybe some metrics that
would be of help, but not very obvious.