Hello gurus,
We are running a 3 nodes FreeIPA cluster for some time without major
trouble. One server may stale from time to time, without real trouble to
restart it.
A few days ago, we had to migrate the VMs between two clouds (disk image
copied from one to the other). They have been renumbered from old to new
IPv4 address space. Not that easy, but we finally got it done with all DNS
entries in sync. Yet, since the migration, ns-slapd process hangs randomly
way more often than before (went from once every few months to several
times a day) and is especially hard to restart on any node.
While starting up, the netstat output is like:
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
PID/Program name
tcp6 184527 0 10.217.151.3:389 10.217.151.2:52314
ESTABLISHED 29948/ns-slapd
Netstat and tcpdump show it processes very slowly the recvq (sometimes like
79 bytes per 1-2 seconds). At some point it just stops processing it and
hangs (only kill -9 works to take it down). When stale, strace shows the
process loops only on :
getpeername(8, 0x7ffe62c49fd0, 0x7ffe62c49f94) = -1 ENOTCONN (Transport
endpoint is not connected)
poll([{fd=50, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN},
{fd=9, events=POLLIN}, {fd=117, events=POLLIN}, {fd=116, events=POLLIN},
{fd=115, events=POLLIN}, {fd=114, events=POLLIN}, {fd=89, events=POLLIN},
{fd=85, events=POLLIN}, {fd=83, events=POLLIN}, {fd=82, events=POLLIN},
{fd=81, events=POLLIN}, {fd=80, events=POLLIN}, {fd=79, events=POLLIN},
{fd=78, events=POLLIN}, {fd=77, events=POLLIN}, {fd=76, events=POLLIN},
{fd=67, events=POLLIN}, {fd=72, events=POLLIN}, {fd=69, events=POLLIN},
{fd=64, events=POLLIN}, {fd=66, events=POLLIN}], 23, 250) = 0 (Timeout)
If it can go through startup replication, one of the server will hang a
little bit later, freezing the whole cluster. Forcing us to restart the
faulty node to unlock things.
When stale, the dirsrv access log only contains entries like:
[20/Oct/2019:17:52:46.950029525 +0100] conn=86 fd=131 slot=131 connection
from 10.217.151.4 to 10.217.151.4
[20/Oct/2019:17:52:51.280412883 +0100] conn=87 fd=132 slot=132 SSL
connection from 10.217.151.10 to 10.217.151.4
[20/Oct/2019:17:52:54.956204031 +0100] conn=88 fd=133 slot=133 connection
from 10.217.151.4 to 10.217.151.4
[20/Oct/2019:17:53:04.966542441 +0100] conn=89 fd=134 slot=134 connection
from 10.217.151.2 to 10.217.151.4
[20/Oct/2019:17:53:22.659053020 +0100] conn=90 fd=135 slot=135 SSL
connection from 10.217.151.10 to 10.217.151.4
[20/Oct/2019:17:53:51.006707605 +0100] conn=91 fd=136 slot=136 connection
from 10.217.151.4 to 10.217.151.4
[20/Oct/2019:17:53:54.514162543 +0100] conn=92 fd=137 slot=137 SSL
connection from 10.217.151.10 to 10.217.151.4
[20/Oct/2019:17:53:59.011602776 +0100] conn=93 fd=138 slot=138 connection
from 10.217.151.3 to 10.217.151.4
[20/Oct/2019:17:54:09.019296900 +0100] conn=94 fd=139 slot=139 connection
from 10.217.151.4 to 10.217.151.4
And netstat lists 10s of accepted network connections that are stale like :
tcp6 286 0 10.217.151.4:389 10.217.151.10:32512
ESTABLISHED 29948/ns-slapd
The underlying network seams clean and uses jumbo frames. tcpdump and ping
show 0 packet loss and no retransmit. Being afraid it could be a jumbo
frame issue, mtu was even forced down to 1500. Without success.
Entropy seems fine as well :
# cat /proc/sys/kernel/random/entropy_avail
3138
Running version on all servers:
ipa-client-4.6.5-11.el7.centos.x86_64
ipa-client-common-4.6.5-11.el7.centos.noarch
ipa-common-4.6.5-11.el7.centos.noarch
ipa-server-4.6.5-11.el7.centos.x86_64
ipa-server-common-4.6.5-11.el7.centos.noarch
ipa-server-dns-4.6.5-11.el7.centos.noarch
I'd happily listen to any hint regarding this critical problem.
/Sylvain.