389ds perfomance and paging out operations

Saturday, 5 November 2022

Hi, I need assistance in understanding what kind of processes happen behind the work of
389ds server in my case, to come with better approaches in troubleshooting further
problems.

In our environment we are running FreeIPA instances in Docker, but neither the case that
it is FreeIPA nor the Docker part actually matter that much, cause ns-slapd service is the
main actor of a "problem" I faced.

Prerequisites
-------------
Host has 4 CPU, 8gb RAM

FreeIPA v.4.9.8

ns-slapd -v
389 Project
389-Directory/1.4.3.28 B2022.074.1358
-------------

Most of the time the host does not expirience any kind of load, though RAM usage (talking
about used) can grow linery over time but it makes no odds. After some moment (now I can
say with some certainty that's because if buff/cache grows) the host begin to
expirience heavy disk write load, which I notice by CPU IOWait alerts. 

cpu iowait
https://imgur.com/ifXPX5N

disk IO and latency
https://imgur.com/HyT0cNR

The increase in load is not due to external causes (increase of ldap queries). Actually,
"page out" operations were increasing.

https://imgur.com/OxofSD9

and also process exporter showed increased major page faults which served as indicator of
memory problems (memory shortage though? there was plenty of RAM)

https://imgur.com/xt2p933

Memory status during the problem
--------------------------------
Tasks: 175 total,   1 running, 174 sleeping,   0 stopped,   0 zombie
%CPU(s):  0.7 us,  0.7 sy,  0.0 ni, 29.1 id, 69.4 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :   7957.3 total,    414.3 free,   2332.7 used,   5210.3 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   5300.7 avail Mem 

I tried to use some other instruments that would indicate that something was wrong, but I
didn't succeed, but I will still add them here for full picture

Here output of https://github.com/iovisor/bcc/blob/master/tools/ext4slower_example.txt
shows that fsync syscalls by ns-slapd taking to long

TIME     COMM           PID    T BYTES   OFF_KB   LAT(ms) FILENAME
07:00:02 ns-slapd       4017   S 0       0        1455.28 log.0000006661
07:00:10 ns-slapd       4017   S 0       0        1133.62 log.0000006661
07:00:11 ns-slapd       4017   S 0       0        1248.36 log.0000006661
07:01:02 ns-slapd       4017   S 0       0         736.46 log.0000006661
07:01:05 ns-slapd       4017   S 0       0        1541.29 access
07:01:07 ns-slapd       4017   S 0       0        1220.32 log.0000006661
07:01:08 ns-slapd       4017   S 0       0         832.87 log.0000006661
07:01:08 ns-slapd       4017   S 0       0          93.07 entryusn.db
07:01:09 ns-slapd       4017   S 0       0         418.88 description.db
07:01:10 ns-slapd       4017   S 0       0         822.64 nsuniqueid.db
07:01:10 ns-slapd       4017   S 0       0         417.25 displayname.db
07:01:10 ns-slapd       4017   S 0       0        1493.60 log.0000006661
07:01:11 ns-slapd       4017   S 0       0         617.68 memberHost.db
07:01:12 ns-slapd       4017   S 0       0        1684.12 log.0000006661
07:01:12 ns-slapd       4017   S 0       0         434.58 parentid.db
07:01:12 ns-slapd       4017   S 0       0         601.09 accessRuleType.db
07:01:13 ns-slapd       4017   S 0       0        1051.10 memberUser.db
07:01:14 ns-slapd       4017   S 0       0        1056.28 ipakrbprincipalalias.db
07:01:14 ns-slapd       4017   S 0       0        1803.00 log.0000006661

And another output of
https://github.com/iovisor/bcc/blob/master/tools/cachetop_example.txt
I honestly not sure how to interpret those results, maybe it would be helpful

07:15:30 Buffers MB: 97 / Cached MB: 4212 / Sort: HITS / Order: descending
PID      UID      CMD              HITS     MISSES   DIRTIES  READ_HIT%  WRITE_HIT%
4017 UNKNOWN( ns-slapd             7259     7177     7177       0.6%       0.0%

So, in the attempts to figure out the solution I tried to restart FreeIPA container but it
had no effect. In the end, I stopped container, dropped all memory caches (sync; echo 3
> /proc/sys/vm/drop_caches) and started container, and only after that excessive paging
out (and disk IO) was gone.

So my questions are:
1) What processes behind the scenes of 389ds server could cause such behavior?
2) Is it (kind of) normal?
3) What actions would you suggest to mitigate such behavior. Would ldap metrics in
cn=monitor helped to determing something was wrong? (unfortunately they are not currently
monitored, but will be).
4) Generally, what would you suggest to monitor additionally, maybe some metrics that
would be of help, but not very obvious. 

2024

2023

2022

2021

2020

2019

2018

2017