Running Red Hat Enterprise Linux Server release 6.5 (Santiago) - 2.6.32-431.el6.x86_64 SSSD version: sssd-1.13.3-22.el6_8.4.x86_64
I'm seeing (seemingly random?) shutdown/termination of sssd across multiple nodes, all with the same configuration. To my knowledge there is no process going around killing things, we even have a scheduled job to check sssd status and restart every 5 minutes if unavailable:
/var/log/sssd/sssd.log:284469:(Mon Sep 26 12:21:29 2016) [sssd] [monitor_quit_signal] (0x2000): Received shutdown command /var/log/sssd/sssd.log:318707:(Mon Sep 26 16:19:19 2016) [sssd] [monitor_quit_signal] (0x2000): Received shutdown command /var/log/sssd/sssd.log:321889:(Mon Sep 26 16:43:12 2016) [sssd] [monitor_quit_signal] (0x2000): Received shutdown command /var/log/sssd/sssd.log:474327:(Tue Sep 27 10:29:39 2016) [sssd] [monitor_quit_signal] (0x2000): Received shutdown command /var/log/sssd/sssd.log:475205:(Tue Sep 27 10:34:36 2016) [sssd] [monitor_quit_signal] (0x2000): Received shutdown command
Right before each shutdown, there are lots of the following nss_cmd_getbynam and sss_ncache_check_str entries for 'root' in sssd_nss.log:
(Mon Sep 26 16:43:11 2016) [sssd[nss]] [nss_cmd_getbynam] (0x0400): Running command [38][SSS_NSS_INITGR] with input [root]. (Mon Sep 26 16:43:11 2016) [sssd[nss]] [sss_parse_name_for_domains] (0x0200): name 'root' matched without domain, user is root (Mon Sep 26 16:43:11 2016) [sssd[nss]] [nss_cmd_getbynam] (0x0100): Requesting info for [root] from [<ALL>] (Mon Sep 26 16:43:11 2016) [sssd[nss]] [sss_ncache_check_str] (0x2000): Checking negative cache for [NCE/USER/MYDOMAIN/root] (Mon Sep 26 16:43:11 2016) [sssd[nss]] [nss_cmd_initgroups_search] (0x0400): User [root] does not exist in [MYDOMAIN]! (negative cache) (Mon Sep 26 16:43:11 2016) [sssd[nss]] [nss_cmd_initgroups_search] (0x0080): No matching domain found for [root], fail! (Mon Sep 26 16:43:11 2016) [sssd[nss]] [reset_idle_timer] (0x4000): Idle timer re-set for client [0xf7e120][24] (Mon Sep 26 16:43:12 2016) [sssd[nss]] [sss_responder_ctx_destructor] (0x0400): Responder is being shut down (Mon Sep 26 16:43:12 2016) [sssd[nss]] [client_destructor] (0x2000): Terminated client [0xf7e120][24] (Mon Sep 26 16:43:12 2016) [sssd[nss]] [client_destructor] (0x2000): Terminated client [0xf840e0][23] (Mon Sep 26 16:43:12 2016) [sssd[nss]] [client_destructor] (0x2000): Terminated client [0xf7b500][22]
Corresponding AD log for same period:
(Mon Sep 26 16:43:10 2016) [sssd[be[MYDOMAIN]]] [sbus_dispatch] (0x4000): dbus conn: 0x142aa90 (Mon Sep 26 16:43:10 2016) [sssd[be[MYDOMAIN]]] [sbus_dispatch] (0x4000): Dispatching. (Mon Sep 26 16:43:10 2016) [sssd[be[MYDOMAIN]]] [sbus_message_handler] (0x2000): Received SBUS method org.freedesktop.sssd.service.ping on path /org/freedesktop/sssd/service (Mon Sep 26 16:43:10 2016) [sssd[be[MYDOMAIN]]] [sbus_get_sender_id_send] (0x2000): Not a sysbus message, quit (Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [sbus_remove_watch] (0x2000): 0x1440c50/0x143e080 (Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [sbus_remove_watch] (0x2000): 0x1440c50/0x143e030 (Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [sbus_dispatch] (0x4000): dbus conn: 0x143eb00 (Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [sbus_dispatch] (0x0080): Connection is not open for dispatching. (Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [be_client_destructor] (0x0400): Removed SUDO client (Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [sbus_remove_watch] (0x2000): 0x1444030/0x14420b0 (Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [sbus_remove_watch] (0x2000): 0x1444030/0x1442060 (Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [sbus_dispatch] (0x4000): dbus conn: 0x1443250 (Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [sbus_dispatch] (0x0080): Connection is not open for dispatching. (Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [be_client_destructor] (0x0400): Removed PAM client (Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [sbus_remove_watch] (0x2000): 0x143d070/0x142c0d0 (Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [sbus_remove_watch] (0x2000): 0x143d070/0x142aeb0 (Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [sbus_dispatch] (0x4000): dbus conn: 0x143c570 (Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [sbus_dispatch] (0x0080): Connection is not open for dispatching. (Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [be_client_destructor] (0x0400): Removed NSS client (Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [be_ptask_destructor] (0x0400): Terminating periodic task [SUDO Smart Refresh] (Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [be_ptask_destructor] (0x0400): Terminating periodic task [SUDO Full Refresh] (Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [sdap_handle_release] (0x2000): Trace: sh[0x14f9ff0], connected[1], ops[(nil)], ldap[0x1449c10], destructor_lock[0], release_memory[0] (Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [remove_connection_callback] (0x4000): Successfully removed connection callback. (Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [sbus_remove_watch] (0x2000): 0x142f250/0x1417480 (Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [remove_socket_symlink] (0x4000): The symlink points to [/var/lib/sss/pipes/private/sbus-dp_MYDOMAIN.11328] (Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [remove_socket_symlink] (0x4000): The path including our pid is [/var/lib/sss/pipes/private/sbus-dp_MYDOMAIN.11328] (Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [remove_socket_symlink] (0x4000): Removed the symlink
AD controllers are WIN2012R2 SSSD is configured with a single domain (MYDOMAIN)
######begin sssd.conf (redacted)##### [sssd] config_file_version = 2 services = nss, pam, sudo
domains = MYDOMAIN debug_level = 9
[nss] default_shell = /bin/bash debug_level = 9 filter_users = root filter_groups = root
[pam] debug_level = 9
[sudo] debug_level = 9
[domain/MYDOMAIN] id_provider = ldap access_provider = simple cache_credentials = false debug_level = 9 ldap_server = _srv_ ldap_search_base = ######### ldap_id_use_start_tls = true ldap_tls_reqcert = allow ldap_default_bind_dn = ######### ldap_default_authtok_type = password ldap_default_authtok = ######### ldap_user_search_base = ou=BusinessUnits,dc=mydomain ldap_user_object_class = user ldap_id_mapping = true ldap_schema = ad ldap_group_search_base = ######### ldap_group_object_class = group ldap_referrals = false enumerate = false override_homedir = /export/home/%u ldap_group_nesting_level = 5 ldap_use_tokengroups = false simple_allow_groups = sasi,sasadmin,sasmgt ldap_access_order = expire ldap_account_expire_policy = ad
######end sssd.conf#####
This document is strictly confidential and is intended for use by the addressee unless otherwise indicated. Allied Irish Banks AIB and AIB Group are registered business names of Allied Irish Banks p.l.c. Allied Irish Banks, p.l.c. is regulated by the Central Bank of Ireland. Registered Office: Bankcentre, Ballsbridge, Dublin 4. Tel: + 353 1 6600311; Registered in Ireland: Registered No. 24173. ~~~~~~~Please consider the environment before printing this Email~~~~~~~~ This email has been scanned by an external Email Security System. This Disclaimer has been generated by CMDis
On 09/27/2016 07:38 AM, Richard Collins wrote:
Running Red Hat Enterprise Linux Server release 6.5 (Santiago) - 2.6.32-431.el6.x86_64
SSSD version: sssd-1.13.3-22.el6_8.4.x86_64
I'm seeing (seemingly random?) shutdown/termination of sssd across multiple nodes, all with the same configuration. To my knowledge there is no process going around killing things, we even have a scheduled job to check sssd status and restart every 5 minutes if unavailable:
/var/log/sssd/sssd.log:284469:(Mon Sep 26 12:21:29 2016) [sssd] [monitor_quit_signal] (0x2000): Received shutdown command
/var/log/sssd/sssd.log:318707:(Mon Sep 26 16:19:19 2016) [sssd] [monitor_quit_signal] (0x2000): Received shutdown command
/var/log/sssd/sssd.log:321889:(Mon Sep 26 16:43:12 2016) [sssd] [monitor_quit_signal] (0x2000): Received shutdown command
/var/log/sssd/sssd.log:474327:(Tue Sep 27 10:29:39 2016) [sssd] [monitor_quit_signal] (0x2000): Received shutdown command
/var/log/sssd/sssd.log:475205:(Tue Sep 27 10:34:36 2016) [sssd] [monitor_quit_signal] (0x2000): Received shutdown command
The monitor_quit_signal function should only be called when the SSSD monitor process receives SIGINT or SIGTERM. It looks like you already have debug_level = 9 in the monitor section of sssd.conf, I would hope to see some useful more messages in /var/log/sssd/sssd.log around the same timeframe as above.
If that is not the case, you could try running a systemtap script like the one here to determine if there is an unexpected script or process sending these signals:
https://sourceware.org/systemtap/examples/process/sigkill.stp
Right before each shutdown, there are lots of the following nss_cmd_getbynam and sss_ncache_check_str entries for 'root' in sssd_nss.log:
(Mon Sep 26 16:43:11 2016) [sssd[nss]] [nss_cmd_getbynam] (0x0400): Running command [38][SSS_NSS_INITGR] with input [root].
(Mon Sep 26 16:43:11 2016) [sssd[nss]] [sss_parse_name_for_domains] (0x0200): name 'root' matched without domain, user is root
(Mon Sep 26 16:43:11 2016) [sssd[nss]] [nss_cmd_getbynam] (0x0100): Requesting info for [root] from [<ALL>]
(Mon Sep 26 16:43:11 2016) [sssd[nss]] [sss_ncache_check_str] (0x2000): Checking negative cache for [NCE/USER/MYDOMAIN/root]
(Mon Sep 26 16:43:11 2016) [sssd[nss]] [nss_cmd_initgroups_search] (0x0400): User [root] does not exist in [MYDOMAIN]! (negative cache)
(Mon Sep 26 16:43:11 2016) [sssd[nss]] [nss_cmd_initgroups_search] (0x0080): No matching domain found for [root], fail!
(Mon Sep 26 16:43:11 2016) [sssd[nss]] [reset_idle_timer] (0x4000): Idle timer re-set for client [0xf7e120][24]
(Mon Sep 26 16:43:12 2016) [sssd[nss]] [sss_responder_ctx_destructor] (0x0400): Responder is being shut down
(Mon Sep 26 16:43:12 2016) [sssd[nss]] [client_destructor] (0x2000): Terminated client [0xf7e120][24]
(Mon Sep 26 16:43:12 2016) [sssd[nss]] [client_destructor] (0x2000): Terminated client [0xf840e0][23]
(Mon Sep 26 16:43:12 2016) [sssd[nss]] [client_destructor] (0x2000): Terminated client [0xf7b500][22]
You have 'filter_users = root' in the sssd.conf so these messages about 'root' should be expected. When the monitor shutdown is called it will terminate child processes which is why the NSS Responder gets shut down here.
Corresponding AD log for same period:
(Mon Sep 26 16:43:10 2016) [sssd[be[MYDOMAIN]]] [sbus_dispatch] (0x4000): dbus conn: 0x142aa90
(Mon Sep 26 16:43:10 2016) [sssd[be[MYDOMAIN]]] [sbus_dispatch] (0x4000): Dispatching.
(Mon Sep 26 16:43:10 2016) [sssd[be[MYDOMAIN]]] [sbus_message_handler] (0x2000): Received SBUS method org.freedesktop.sssd.service.ping on path /org/freedesktop/sssd/service
(Mon Sep 26 16:43:10 2016) [sssd[be[MYDOMAIN]]] [sbus_get_sender_id_send] (0x2000): Not a sysbus message, quit
(Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [sbus_remove_watch] (0x2000): 0x1440c50/0x143e080
(Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [sbus_remove_watch] (0x2000): 0x1440c50/0x143e030
(Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [sbus_dispatch] (0x4000): dbus conn: 0x143eb00
(Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [sbus_dispatch] (0x0080): Connection is not open for dispatching.
(Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [be_client_destructor] (0x0400): Removed SUDO client
(Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [sbus_remove_watch] (0x2000): 0x1444030/0x14420b0
(Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [sbus_remove_watch] (0x2000): 0x1444030/0x1442060
(Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [sbus_dispatch] (0x4000): dbus conn: 0x1443250
(Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [sbus_dispatch] (0x0080): Connection is not open for dispatching.
(Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [be_client_destructor] (0x0400): Removed PAM client
(Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [sbus_remove_watch] (0x2000): 0x143d070/0x142c0d0
(Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [sbus_remove_watch] (0x2000): 0x143d070/0x142aeb0
(Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [sbus_dispatch] (0x4000): dbus conn: 0x143c570
(Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [sbus_dispatch] (0x0080): Connection is not open for dispatching.
(Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [be_client_destructor] (0x0400): Removed NSS client
(Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [be_ptask_destructor] (0x0400): Terminating periodic task [SUDO Smart Refresh]
(Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [be_ptask_destructor] (0x0400): Terminating periodic task [SUDO Full Refresh]
(Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [sdap_handle_release] (0x2000): Trace: sh[0x14f9ff0], connected[1], ops[(nil)], ldap[0x1449c10], destructor_lock[0], release_memory[0]
(Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [remove_connection_callback] (0x4000): Successfully removed connection callback.
(Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [sbus_remove_watch] (0x2000): 0x142f250/0x1417480
(Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [remove_socket_symlink] (0x4000): The symlink points to [/var/lib/sss/pipes/private/sbus-dp_MYDOMAIN.11328]
(Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [remove_socket_symlink] (0x4000): The path including our pid is [/var/lib/sss/pipes/private/sbus-dp_MYDOMAIN.11328]
(Mon Sep 26 16:43:12 2016) [sssd[be[MYDOMAIN]]] [remove_socket_symlink] (0x4000): Removed the symlink
AD controllers are WIN2012R2
SSSD is configured with a single domain (MYDOMAIN)
######begin sssd.conf (redacted)#####
[sssd]
config_file_version = 2
services = nss, pam, sudo
domains = MYDOMAIN
debug_level = 9
[nss]
default_shell = /bin/bash
debug_level = 9
filter_users = root
filter_groups = root
[pam]
debug_level = 9
[sudo]
debug_level = 9
[domain/MYDOMAIN]
id_provider = ldap
access_provider = simple
cache_credentials = false
debug_level = 9
ldap_server = _srv_
ldap_search_base = #########
ldap_id_use_start_tls = true
ldap_tls_reqcert = allow
ldap_default_bind_dn = #########
ldap_default_authtok_type = password
ldap_default_authtok = #########
ldap_user_search_base = ou=BusinessUnits,dc=mydomain
ldap_user_object_class = user
ldap_id_mapping = true
ldap_schema = ad
ldap_group_search_base = #########
ldap_group_object_class = group
ldap_referrals = false
enumerate = false
override_homedir = /export/home/%u
ldap_group_nesting_level = 5
ldap_use_tokengroups = false
simple_allow_groups = sasi,sasadmin,sasmgt ldap_access_order = expire ldap_account_expire_policy = ad
######end sssd.conf#####
For the most part this sssd.conf looks okay to me except for
ldap_server = _srv_
I could not find this option in the man page, it looks to be invalid or deprecated.
simple_allow_groups = sasi,sasadmin,sasmgt ldap_access_order = expire ldap_account_expire_policy = ad
Are these three options each defined on the same line, or is it the email formatting that may have appended these to one line?
I would fix these and see if that helps.
This document is strictly confidential and is intended for use by the addressee unless otherwise indicated. Allied Irish Banks AIB and AIB Group are registered business names of Allied Irish Banks p.l.c. Allied Irish Banks, p.l.c. is regulated by the Central Bank of Ireland. Registered Office: Bankcentre, Ballsbridge, Dublin 4. Tel: + 353 1 6600311; Registered in Ireland: Registered No. 24173. ~~~~~~~Please consider the environment before printing this Email~~~~~~~~ This email has been scanned by an external Email Security System. This Disclaimer has been generated by CMDis
sssd-users mailing list -- sssd-users@lists.fedorahosted.org To unsubscribe send an email to sssd-users-leave@lists.fedorahosted.org
Hi thanks for responding....
The monitor_quit_signal function should only be called when the SSSD monitor process receives SIGINT or SIGTERM. It looks like you already have debug_level = 9 in the monitor section of sssd.conf, I would hope to see some useful more messages in /var/log/sssd/sssd.log around the same timeframe as above.
There's not a lot in /var/log/sssd/sssd.log around the time of the termination, just the termination notifications. However I'll post the relevant excerpts when I get back into the office tomorrow.
If that is not the case, you could try running a systemtap script like the one here to determine if there is an unexpected script or process sending these signals:
https://sourceware.org/systemtap/examples/process/sigkill.stp
Thanks for that - I was wondering how I would trace the sigkill
You have 'filter_users = root' in the sssd.conf so these messages about 'root' should be expected. When the monitor shutdown is called it will terminate child processes which is why the NSS Responder gets shut down here.
I added the filter_users in the hope that it would ignore the root user requests - not sure why there are so many requests for root? Adding this setting didn't change the occurrence of the entries in the log so maybe doesn't do what I expected.
For the most part this sssd.conf looks okay to me except for
ldap_server = _srv_I could not find this option in the man page, it looks to be invalid or deprecated.
This was in the config as I found it. It was originally configured by a third party and I've picked up support for it. If this is unsupported then I'll remove it and see if it has any impact.
simple_allow_groups = sasi,sasadmin,sasmgt ldap_access_order = expire ldap_account_expire_policy = ad
Are these three options each defined on the same line, or is it the email formatting that may have appended these to one line?
Email formatting - they are set correctly one per line in the config
I'll remove the ldap_server option and see how it goes This document is strictly confidential and is intended for use by the addressee unless otherwise indicated. Allied Irish Banks AIB and AIB Group are registered business names of Allied Irish Banks p.l.c. Allied Irish Banks, p.l.c. is regulated by the Central Bank of Ireland. Registered Office: Bankcentre, Ballsbridge, Dublin 4. Tel: + 353 1 6600311; Registered in Ireland: Registered No. 24173. ~~~~~~~Please consider the environment before printing this Email~~~~~~~~ This email has been scanned by an external Email Security System. This Disclaimer has been generated by CMDis
On 09/27/2016 06:47 PM, Richard Collins wrote:
Hi thanks for responding....
The monitor_quit_signal function should only be called when the SSSD monitor process receives SIGINT or SIGTERM. It looks like you already have debug_level = 9 in the monitor section of sssd.conf, I would hope to see some useful more messages in /var/log/sssd/sssd.log around the same timeframe as above.
There's not a lot in /var/log/sssd/sssd.log around the time of the termination, just the termination notifications. However I'll post the relevant excerpts when I get back into the office tomorrow.
If that is not the case, you could try running a systemtap script like the one here to determine if there is an unexpected script or process sending these signals:
https://sourceware.org/systemtap/examples/process/sigkill.stpThanks for that - I was wondering how I would trace the sigkill
You have 'filter_users = root' in the sssd.conf so these messages about 'root' should be expected. When the monitor shutdown is called it will terminate child processes which is why the NSS Responder gets shut down here.
I added the filter_users in the hope that it would ignore the root user requests - not sure why there are so many requests for root? Adding this setting didn't change the occurrence of the entries in the log so maybe doesn't do what I expected.
I believe this is inherent to the glibc initgroups library call which will use all entries specified in the nsswitch.conf file meaning a root login would be triggered into 'sss' and not just 'files'.
The 'filter_users = root' option will cut off processing this request early in the NSS responder and keep it in the negative cache.
For the most part this sssd.conf looks okay to me except for
ldap_server = _srv_I could not find this option in the man page, it looks to be invalid or deprecated.
This was in the config as I found it. It was originally configured by a third party and I've picked up support for it. If this is unsupported then I'll remove it and see if it has any impact.
A basic template for configuring sssd.conf with the LDAP provider is at the following link(if the LDAP server is Active Directory then we recommend using the AD provider)
https://fedorahosted.org/sssd/wiki/HOWTO_Configure
simple_allow_groups = sasi,sasadmin,sasmgt ldap_access_order = expire ldap_account_expire_policy = ad
Are these three options each defined on the same line, or is it the email formatting that may have appended these to one line?
Email formatting - they are set correctly one per line in the config
I'll remove the ldap_server option and see how it goes
Yes, let us know how it goes.
This document is strictly confidential and is intended for use by the addressee unless otherwise indicated. Allied Irish Banks AIB and AIB Group are registered business names of Allied Irish Banks p.l.c. Allied Irish Banks, p.l.c. is regulated by the Central Bank of Ireland. Registered Office: Bankcentre, Ballsbridge, Dublin 4. Tel: + 353 1 6600311; Registered in Ireland: Registered No. 24173. ~~~~~~~Please consider the environment before printing this Email~~~~~~~~ This email has been scanned by an external Email Security System. This Disclaimer has been generated by CMDis _______________________________________________ sssd-users mailing list -- sssd-users@lists.fedorahosted.org To unsubscribe send an email to sssd-users-leave@lists.fedorahosted.org
We are still seeing random intermittent stoppages of the SSSD service.
Following Justin's advice I setup an stap script to catch what was killing sssd and it's related processes.
It turns out sssd is killing itself. See stap output below. Would there be any reason for this?
[Wed Nov 16 10:39:34 2016] SIGTERM was sent to sssd (pid:13831) by sssd uid:0 [Wed Nov 16 10:39:34 2016] SIGTERM was sent to sssd_sudo (pid:13835) by sssd uid:0 [Wed Nov 16 10:39:34 2016] SIGTERM was sent to sssd_sudo (pid:13835) by sssd uid:0 [Wed Nov 16 10:39:34 2016] SIGTERM was sent to sssd_pam (pid:13834) by sssd uid:0 [Wed Nov 16 10:39:34 2016] SIGTERM was sent to sssd_pam (pid:13834) by sssd uid:0 [Wed Nov 16 10:39:34 2016] SIGTERM was sent to sssd_nss (pid:13833) by sssd uid:0 [Wed Nov 16 10:39:34 2016] SIGTERM was sent to sssd_nss (pid:13833) by sssd uid:0 [Wed Nov 16 10:39:34 2016] SIGTERM was sent to sssd_be (pid:13832) by sssd uid:0 [Wed Nov 16 10:39:34 2016] SIGTERM was sent to sssd_be (pid:13832) by sssd uid:0 [Wed Nov 16 10:39:34 2016] SIGTERM was sent to sssd (pid:13831) by sssd uid:0 [Wed Nov 16 10:39:34 2016] SIGTERM was sent to oddjobd (pid:10391) by oddjobd uid:0 [Wed Nov 16 10:39:35 2016] SIGTERM was sent to oddjobd (pid:22422) by oddjobd uid:0
On 11/16/2016 06:19 AM, richard.y.collins@aib.ie wrote:
We are still seeing random intermittent stoppages of the SSSD service.
Following Justin's advice I setup an stap script to catch what was killing sssd and it's related processes.
It turns out sssd is killing itself. See stap output below. Would there be any reason for this?
[Wed Nov 16 10:39:34 2016] SIGTERM was sent to sssd (pid:13831) by sssd uid:0 [Wed Nov 16 10:39:34 2016] SIGTERM was sent to sssd_sudo (pid:13835) by sssd uid:0 [Wed Nov 16 10:39:34 2016] SIGTERM was sent to sssd_sudo (pid:13835) by sssd uid:0 [Wed Nov 16 10:39:34 2016] SIGTERM was sent to sssd_pam (pid:13834) by sssd uid:0 [Wed Nov 16 10:39:34 2016] SIGTERM was sent to sssd_pam (pid:13834) by sssd uid:0 [Wed Nov 16 10:39:34 2016] SIGTERM was sent to sssd_nss (pid:13833) by sssd uid:0 [Wed Nov 16 10:39:34 2016] SIGTERM was sent to sssd_nss (pid:13833) by sssd uid:0 [Wed Nov 16 10:39:34 2016] SIGTERM was sent to sssd_be (pid:13832) by sssd uid:0 [Wed Nov 16 10:39:34 2016] SIGTERM was sent to sssd_be (pid:13832) by sssd uid:0 [Wed Nov 16 10:39:34 2016] SIGTERM was sent to sssd (pid:13831) by sssd uid:0 [Wed Nov 16 10:39:34 2016] SIGTERM was sent to oddjobd (pid:10391) by oddjobd uid:0 [Wed Nov 16 10:39:35 2016] SIGTERM was sent to oddjobd (pid:22422) by oddjobd uid:0 _______________________________________________ sssd-users mailing list -- sssd-users@lists.fedorahosted.org To unsubscribe send an email to sssd-users-leave@lists.fedorahosted.org
The SSSD debug logs should give some indication of what's happening around the Nov 16 10:39:34 timeframe.
The primary SSSD service sends heartbeat pings to other SSSD services, if there is no response from 3 pings then SSSD will attempt to send a SIGTERM to the service.
Note: https://fedorahosted.org/sssd/wiki/DevelTips#WhenIdebuganSSSDprocessinadebug...
The 'timeout' value in sssd.conf configures the time interval between pings which defaults to 10 seconds but can be increased(it can be added to each section of sssd.conf).
For example:
[sssd] timeout = 60
[nss] timeout = 60
[pam] timeout = 60
[sudo] timeout = 60
[domain/MYDOMAIN] timeout = 60
Kind regards, Justin Stephenson
Thanks Justin.....interesting - I did have a suspicion about this timeout setting but ignored it a while back based on some other posts I found. I've set the timeout to 60 seconds and removed any debug setting in case this is causing problems.
stap should catch any SIGTERM/SIGKILL
On Wed, Nov 16, 2016 at 09:33:44AM -0500, Justin Stephenson wrote:
On 11/16/2016 06:19 AM, richard.y.collins@aib.ie wrote:
We are still seeing random intermittent stoppages of the SSSD service.
Following Justin's advice I setup an stap script to catch what was killing sssd and it's related processes.
It turns out sssd is killing itself. See stap output below. Would there be any reason for this?
[Wed Nov 16 10:39:34 2016] SIGTERM was sent to sssd (pid:13831) by sssd uid:0
I found it a bit strange that sssd sends a signal to itself here. It almost looks like a graceful shutdown...
I agree with Justin that sssd debug logs would provide a bit more insight here as well.
Knowing what version you run might help, too.
[Wed Nov 16 10:39:34 2016] SIGTERM was sent to sssd_sudo (pid:13835) by sssd uid:0 [Wed Nov 16 10:39:34 2016] SIGTERM was sent to sssd_sudo (pid:13835) by sssd uid:0 [Wed Nov 16 10:39:34 2016] SIGTERM was sent to sssd_pam (pid:13834) by sssd uid:0 [Wed Nov 16 10:39:34 2016] SIGTERM was sent to sssd_pam (pid:13834) by sssd uid:0 [Wed Nov 16 10:39:34 2016] SIGTERM was sent to sssd_nss (pid:13833) by sssd uid:0 [Wed Nov 16 10:39:34 2016] SIGTERM was sent to sssd_nss (pid:13833) by sssd uid:0 [Wed Nov 16 10:39:34 2016] SIGTERM was sent to sssd_be (pid:13832) by sssd uid:0 [Wed Nov 16 10:39:34 2016] SIGTERM was sent to sssd_be (pid:13832) by sssd uid:0 [Wed Nov 16 10:39:34 2016] SIGTERM was sent to sssd (pid:13831) by sssd uid:0 [Wed Nov 16 10:39:34 2016] SIGTERM was sent to oddjobd (pid:10391) by oddjobd uid:0 [Wed Nov 16 10:39:35 2016] SIGTERM was sent to oddjobd (pid:22422) by oddjobd uid:0 _______________________________________________ sssd-users mailing list -- sssd-users@lists.fedorahosted.org To unsubscribe send an email to sssd-users-leave@lists.fedorahosted.org
The SSSD debug logs should give some indication of what's happening around the Nov 16 10:39:34 timeframe.
The primary SSSD service sends heartbeat pings to other SSSD services, if there is no response from 3 pings then SSSD will attempt to send a SIGTERM to the service.
Note: https://fedorahosted.org/sssd/wiki/DevelTips#WhenIdebuganSSSDprocessinadebug...
The 'timeout' value in sssd.conf configures the time interval between pings which defaults to 10 seconds but can be increased(it can be added to each section of sssd.conf).
For example:
[sssd] timeout = 60
[nss] timeout = 60
[pam] timeout = 60
[sudo] timeout = 60
[domain/MYDOMAIN] timeout = 60
Kind regards, Justin Stephenson
This is correct and useful information but only valid up to and including sssd-1.13. In 1.14, we switched to 'watchdog' which means the services are no longer watched using ping-pongs from the monitor, but have a built-in timer that resets periodically. If the timer doesn't reset within the 'timeout' interval, the service kills itself.
I amended the DevelTips page to include this information.
Hi Jakub - thanks for your input.
FYI Red Hat Enterprise Linux Server release 6.5 (Santiago) - 2.6.32-431.el6.x86_64 SSSD version: sssd-1.13.3-22.el6_8.4.x86_64
So I assume the timeout setting is still valid for this version.
There have been no terminations in the last 24 hours, however this was achievable under the previous configuration. As I mentioned I have disabled debugging for now in the event that was causing problems. If I hit more terminations I will re-enable debug and send over logs
Thanks for your help
Just an update on this.
After a lot of debugging, signal tapping and head scratching it has been discovered that there is a password reset script run from an external system (bladelogic) which for some unknown reason decides to run "authconfig --update" after doing whatever it does. This stops the SSSD service and doesn't automatically restart it - is this normal behaviour?
A big thank you to Justin and Jakub for your help
On Thu, Nov 24, 2016 at 11:42:52AM -0000, richard.y.collins@aib.ie wrote:
Just an update on this.
After a lot of debugging, signal tapping and head scratching it has been discovered that there is a password reset script run from an external system (bladelogic) which for some unknown reason decides to run "authconfig --update" after doing whatever it does. This stops the SSSD service and doesn't automatically restart it - is this normal behaviour?
This sounds like an authconfig bug to me bug I haven't been able to find a related bugzilla. You should consider filing one..
sssd-users@lists.fedorahosted.org