Hello, Everyone!
I'm encountering an interesting issue on one of our production systems. Because it's production, I'm trying not to reboot the system if not needed. The core issue being that RDS ldap users are currently unable to log into this box at all due to SSSD being in some sort of funk state. The /var/log/secure file shows a lot of pam_sss(crond:session): Request to sssd failed. Connection refused errors.
A status check against the service shows that it is running, but attempting a stop results in a failed attempt.
[root@server~]# service sssd status sssd (pid 13504) is running... [root@server~]# service sssd stop Stopping sssd: [FAILED]
ps shows that the following results and attempting to kill -9 any of them doesn't seem to work. No error is reported, the processes just persist past the kill -9 order.
[root@server~]# ps aux | grep sss | grep -v grep root 7532 0.0 0.0 150828 2168 ? D Jan03 0:00 /usr/libexec/sssd/sssd_nss -d 0 --debug-to-files root 13504 0.1 0.0 95808 2928 ? D Jan05 1:44 /usr/sbin/sssd -f -D root 16390 0.0 0.0 150784 2172 ? D Jan03 0:00 /usr/libexec/sssd/sssd_pam -d 0 --debug-to-files root 24156 0.0 0.0 179396 5428 ? D 2014 8:21 /usr/libexec/sssd/sssd_be -d 0 --debug-to-files --domain default root 29972 0.0 0.0 58988 544 ? D Jan05 0:00 rm -f /var/lib/sss/db/cache_default.ldb /var/lib/sss/db/config.ldb /var/lib/sss/db/sssd.ldb
Also, whenever I try to do anything with the /var/lib/sss/db/ directory or the files within it, the console becomes unresponsive. This even happens if I ls the path and ctrl+c won't break commands issued to this location and the console locks up.
I'm still new to strace, but I used that while trying to ls /var/lib/sss/db/ and the results stopped with getdents(3, being the final entry and the same unresponsive console issue as before. I also setup strace to monitor the pids of the sssd related processes with commands similar to the one below, but the output files aren't being populated with anything.
strace -s 2000 -o /path/to/file/out.txt -fp 12345 &
/var/log/messages doesn't really seem to include anything that relates to this issue.
/var/log/sssd/sssd_nss.log and /var/log/sssd/sssd_pam.log are both empty with no archives
/var/log/sssd/sssd_[domain].log has a timestamp of Jan 4th, but is empty. The previous archived log shows the following entries which occurred the same day that ps shows some of the sssd processes started on.
[root@eh01db01 ~]# zcat /var/log/sssd/sssd.log.1.gz (Sat Jan 3 05:19:16 2015) [sssd] [mt_svc_sigkill] (0): [default][24156] is not responding to SIGTERM. Sending SIGKILL. (Sat Jan 3 05:22:56 2015) [sssd] [mt_svc_sigkill] (0): [nss][24159] is not responding to SIGTERM. Sending SIGKILL. (Sat Jan 3 05:34:16 2015) [sssd] [mt_svc_sigkill] (0): [pam][24160] is not responding to SIGTERM. Sending SIGKILL.
This is a new issue that hasn't been seen on any of our systems before. I'm hoping to find what the root cause is so that if it occurs again we'll know how to react, or, better yet, can preemptively fix other systems that may be effected. I'm sort of at a loss on where to proceed from here. Rebooting seems like an option that may fix the issue if sssd starts without issue, which would be great, but it still leaves me wondering what the actual problem really is. Any advice is greatly appreciated!
Thanks,
On (06/01/15 14:17), Patrick Mayo wrote:
Hello, Everyone!
I'm encountering an interesting issue on one of our production systems. Because it's production, I'm trying not to reboot the system if not needed. The core issue being that RDS ldap users are currently unable to log into this box at all due to SSSD being in some sort of funk state. The /var/log/secure file shows a lot of pam_sss(crond:session): Request to sssd failed. Connection refused errors.
A status check against the service shows that it is running, but attempting a stop results in a failed attempt.
[root@server~]# service sssd status sssd (pid 13504) is running... [root@server~]# service sssd stop Stopping sssd: [FAILED]
ps shows that the following results and attempting to kill -9 any of them doesn't seem to work. No error is reported, the processes just persist past the kill -9 order.
That's weird.
[root@server~]# ps aux | grep sss | grep -v grep root 7532 0.0 0.0 150828 2168 ? D Jan03 0:00 /usr/libexec/sssd/sssd_nss -d 0 --debug-to-files root 13504 0.1 0.0 95808 2928 ? D Jan05 1:44 /usr/sbin/sssd -f -D root 16390 0.0 0.0 150784 2172 ? D Jan03 0:00 /usr/libexec/sssd/sssd_pam -d 0 --debug-to-files root 24156 0.0 0.0 179396 5428 ? D 2014 8:21
^^^ It means: D uninterruptible sleep (usually IO)
Also, whenever I try to do anything with the /var/lib/sss/db/ directory or the files within it, the console becomes unresponsive. This even happens if I ls the path and ctrl+c won't break commands issued to this location and the console locks up.
I'm still new to strace, but I used that while trying to ls /var/lib/sss/db/ and the results stopped with getdents(3, being the final entry and the same unresponsive console issue as before. I also setup strace to monitor the pids of the sssd related processes with commands similar to the one below, but the output files aren't being populated with anything.
It can be some problem with filesystem where /var/lib/sss/db/ is stored. getdents is "get directory entries" and it is not related to sssd.
strace -s 2000 -o /path/to/file/out.txt -fp 12345 &
/var/log/messages doesn't really seem to include anything that relates to this issue.
/var/log/sssd/sssd_nss.log and /var/log/sssd/sssd_pam.log are both empty with no archives
/var/log/sssd/sssd_[domain].log has a timestamp of Jan 4th, but is empty. The previous archived log shows the following entries which occurred the same day that ps shows some of the sssd processes started on.
Debuging is not enabled by default. You can change debug level on the fly with command line utility sss_debuglevel or you can modify sssd configuration file sssd.conf. man sssd.conf -> debug_level
LS
On 01/06/2015 05:35 PM, Lukas Slebodnik wrote:
On (06/01/15 14:17), Patrick Mayo wrote:
Hello, Everyone!
I'm encountering an interesting issue on one of our production systems. Because it's production, I'm trying not to reboot the system if not needed. The core issue being that RDS ldap users are currently unable to log into this box at all due to SSSD being in some sort of funk state. The /var/log/secure file shows a lot of pam_sss(crond:session): Request to sssd failed. Connection refused errors.
A status check against the service shows that it is running, but attempting a stop results in a failed attempt.
[root@server~]# service sssd status sssd (pid 13504) is running... [root@server~]# service sssd stop Stopping sssd: [FAILED]
ps shows that the following results and attempting to kill -9 any of them doesn't seem to work. No error is reported, the processes just persist past the kill -9 order.
That's weird.
I have seen something like this many years ago on Solaris when the process is actually gone but parent and OS thinks it is still there. I think it was caused by a memory violation inside atexit() callback or something like.
[root@server~]# ps aux | grep sss | grep -v grep root 7532 0.0 0.0 150828 2168 ? D Jan03 0:00 /usr/libexec/sssd/sssd_nss -d 0 --debug-to-files root 13504 0.1 0.0 95808 2928 ? D Jan05 1:44 /usr/sbin/sssd -f -D root 16390 0.0 0.0 150784 2172 ? D Jan03 0:00 /usr/libexec/sssd/sssd_pam -d 0 --debug-to-files root 24156 0.0 0.0 179396 5428 ? D 2014 8:21
^^^ It means: D uninterruptible sleep (usually IO)Also, whenever I try to do anything with the /var/lib/sss/db/ directory or the files within it, the console becomes unresponsive. This even happens if I ls the path and ctrl+c won't break commands issued to this location and the console locks up. I'm still new to strace, but I used that while trying to ls /var/lib/sss/db/ and the results stopped with getdents(3, being the final entry and the same unresponsive console issue as before. I also setup strace to monitor the pids of the sssd related processes with commands similar to the one below, but the output files aren't being populated with anything.
It can be some problem with filesystem where /var/lib/sss/db/ is stored. getdents is "get directory entries" and it is not related to sssd.
strace -s 2000 -o /path/to/file/out.txt -fp 12345 &
/var/log/messages doesn't really seem to include anything that relates to this issue.
/var/log/sssd/sssd_nss.log and /var/log/sssd/sssd_pam.log are both empty with no archives
/var/log/sssd/sssd_[domain].log has a timestamp of Jan 4th, but is empty. The previous archived log shows the following entries which occurred the same day that ps shows some of the sssd processes started on.
Debuging is not enabled by default. You can change debug level on the fly with command line utility sss_debuglevel or you can modify sssd configuration file sssd.conf. man sssd.conf -> debug_level
LS _______________________________________________ sssd-users mailing list sssd-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/sssd-users
On Tue, Jan 06, 2015 at 11:35:36PM +0100, Lukas Slebodnik wrote:
I'm still new to strace, but I used that while trying to ls /var/lib/sss/db/ and the results stopped with getdents(3, being the final entry and the same unresponsive console issue as before. I also setup strace to monitor the pids of the sssd related processes with commands similar to the one below, but the output files aren't being populated with anything.
It can be some problem with filesystem where /var/lib/sss/db/ is stored. getdents is "get directory entries" and it is not related to sssd.
+1, I suspect a FS issue as well.
Please note that it's not supported to have the cache on an NFS share due to issues with the underlying tdb database.
sssd-users@lists.fedorahosted.org