On Fri, Aug 01, 2014 at 02:00:32PM -0700, Daniel Jung wrote:
Hi,
There were few cases where SSSD seems to stop working and required restart
when the server's load average gets high ~ 80 on 24threads(processors)
platform.
Running 6.5 centos with 1.9.2-129 x86_64
On 6.5,
sssd_nss.log shows the following:
Thu Jul 31 20:28:38 2014) [sssd[nss]] [sss_dp_init] (0x0010): Failed to
connect to monitor services.
(Thu Jul 31 20:28:38 2014) [sssd[nss]] [sss_process_init] (0x0010): fatal
error setting up backend connector
(Thu Jul 31 20:30:19 2014) [sssd[nss]] [nss_dp_reconnect_init] (0x0010):
Could not reconnect to LDAP provider.
(Thu Jul 31 20:44:16 2014) [sssd[nss]] [nss_dp_reconnect_init] (0x0010):
Could not reconnect to LDAP provider.
(Thu Jul 31 20:44:46 2014) [sssd[nss]] [nss_dp_reconnect_init] (0x0010):
Could not reconnect to LDAP provider.
(Thu Jul 31 20:45:16 2014) [sssd[nss]] [nss_dp_reconnect_init] (0x0010):
Could not reconnect to LDAP provider.
(Thu Jul 31 20:45:46 2014) [sssd[nss]] [nss_dp_reconnect_init] (0x0010):
Could not reconnect to LDAP provider.
<snip>
Fri Aug 1 19:55:18 2014) [sssd[nss]] [nss_dp_reconnect_init] (0x0010):
Could not reconnect to LDAP provider.
(Fri Aug 1 19:55:48 2014) [sssd[nss]] [nss_dp_reconnect_init] (0x0010):
Could not reconnect to LDAP provider.
(Fri Aug 1 19:56:18 2014) [sssd[nss]] [nss_dp_reconnect_init] (0x0010):
Could not reconnect to LDAP provider.
Also the /var/log/messages:
Jul 31 20:26:22 sssd[nss]: Shutting down
Jul 31 20:26:22 ssd[be[LDAP]]: Shutting down
Jul 31 20:26:24 sssd[be[LDAP]]: Starting up
Jul 31 20:26:24 sssd[nss]: Starting up
Jul 31 20:28:34 sssd[be[LDAP]]: Shutting down
Jul 31 20:28:34 sssd[nss]: Shutting down
Jul 31 20:28:38 sssd[nss]: Starting up
Jul 31 20:28:38 sssd[be[LDAP]]: Starting up
Jul 31 20:28:40 sssd[nss]: Starting up
Jul 31 20:30:05 sssd[be[LDAP]]: Shutting down
Jul 31 20:30:25 sssd[be[LDAP]]: Starting up
Seems like restart @20:30:25 didnt properly restart sssd daemon? I believe
1.9.2-129 is the latest avail?
For RHEL-6, yes.
enumerate is false, ldap_network_timeout is 5 and we have multiple
ldap_uri
settings where hosts are separated by ","
Would appreciate if you can shed some light on this. Thanks.
The sss_be process stopped or was killed for one reason or another. Can
you enable more verbose debugging and attach the logs, please?
Do you use enumeration? If so, can you try disabling it?