I'm encountering an interesting issue on one of our production systems. Because it's production, I'm trying not to reboot the system if not needed. The core issue being that RDS ldap users are currently unable to log into this box at all due to SSSD being in some sort of funk state. The /var/log/secure file shows a lot of pam_sss(crond:session): Request to sssd failed. Connection refused errors.
A status check against the service shows that it is running, but attempting a stop results in a failed attempt.
[root@server~]# service sssd status
sssd (pid 13504) is running...
[root@server~]# service sssd stop
Stopping sssd: [FAILED]
ps shows that the following results and attempting to kill -9 any of them doesn't seem to work. No error is reported, the processes just persist past the kill -9 order.
[root@server~]# ps aux | grep sss | grep -v grep
root 7532 0.0 0.0 150828 2168 ? D Jan03 0:00 /usr/libexec/sssd/sssd_nss -d 0 --debug-to-files
root 13504 0.1 0.0 95808 2928 ? D Jan05 1:44 /usr/sbin/sssd -f -D
root 16390 0.0 0.0 150784 2172 ? D Jan03 0:00 /usr/libexec/sssd/sssd_pam -d 0 --debug-to-files
root 24156 0.0 0.0 179396 5428 ? D 2014 8:21 /usr/libexec/sssd/sssd_be -d 0 --debug-to-files --domain default
root 29972 0.0 0.0 58988 544 ? D Jan05 0:00 rm -f /var/lib/sss/db/cache_default.ldb /var/lib/sss/db/config.ldb /var/lib/sss/db/sssd.ldb
Also, whenever I try to do anything with the /var/lib/sss/db/ directory or the files within it, the console becomes unresponsive. This even happens if I ls the path and ctrl+c won't break commands issued to this location and the console locks up.
I'm still new to strace, but I used that while trying to ls /var/lib/sss/db/ and the results stopped with getdents(3, being the final entry and the same unresponsive console issue as before. I also setup strace to monitor the pids of the sssd related processes with commands similar to the one below, but the output files aren't being populated with anything.
strace -s 2000 -o /path/to/file/out.txt -fp 12345 &
/var/log/messages doesn't really seem to include anything that relates to this issue.
/var/log/sssd/sssd_nss.log and /var/log/sssd/sssd_pam.log are both empty with no archives
/var/log/sssd/sssd_[domain].log has a timestamp of Jan 4th, but is empty. The previous archived log shows the following entries which occurred the same day that ps shows some of the sssd processes started on.
[root@eh01db01 ~]# zcat /var/log/sssd/sssd.log.1.gz
(Sat Jan 3 05:19:16 2015) [sssd] [mt_svc_sigkill] (0): [default] is not responding to SIGTERM. Sending SIGKILL.
(Sat Jan 3 05:22:56 2015) [sssd] [mt_svc_sigkill] (0): [nss] is not responding to SIGTERM. Sending SIGKILL.
(Sat Jan 3 05:34:16 2015) [sssd] [mt_svc_sigkill] (0): [pam] is not responding to SIGTERM. Sending SIGKILL.
This is a new issue that hasn't been seen on any of our systems before. I'm hoping to find what the root cause is so that if it occurs again we'll know how to react, or, better yet, can preemptively fix other systems that may be effected. I'm sort of at a loss on where to proceed from here. Rebooting seems like an option that may fix the issue if sssd starts without issue, which would be great, but it still leaves me wondering what the actual problem really is. Any advice is greatly appreciated!