Hi,

what is in sssd_example.com.log?

Seems like most (all?) of errors you quote are due to sssd_be[example.com] hang or crash.


On Wed, Nov 4, 2020 at 8:56 PM Josh Sonstroem <jsonstro@ucsc.edu> wrote:

Hi SSSD users,

I need some help debugging this strange failure we are having with sssd. We've got an issue on a small centos linux cluster in our data center that uses GSSAPI and active directory for auth. On the head-nodes the sssd application periodically enters an error state and begins restarting every few minutes and while its restarting host authentication fails. From the logs it appears that the sssd application is in a loop restarting itself every 2 or so minutes. There are numerous logs that are written during the occurrence of this issue. I've attached snippets below.

I've been on a quite a journey over the past few weeks trying to get sssd+ad working again in our environment. A few months ago our AD admins added a new site to our domain that is unreachable from on-premise. This lead to a bunch of strange issues until we setup our configs to limit which DCs we talk to and despite limiting the hosts to the proper AD site and servers we continue to have periodic errors on these busy head nodes. While in the error state, the sssctl domain-status command will usually fail with the following error:

# sssctl domain-status example.com
Unable to get online status [3]: Communication error
org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
Check that SSSD is running and the InfoPipe responder is enabled. Make sure 'ifp' is listed in the 'services' option in sssd.conf.
Unable to get online status

Ifp is enabled in the services and on working nodes this works correctly. To try and describe what is happening in the fail state I'll walk thru the logs. The first indication of an issue is that in /var/log/messages we see the following entries when the application restart loop starts going:

12245:Nov  4 10:53:00 host-0 sssd[be[example.com]]: Shutting down
12246:Nov  4 10:53:00 host-0 sssd[be[example.com]]: Starting up

And then these repeat every few minutes

the krb5_child log shows errors like this when the ifp backend is not working

(Wed Nov  4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [unpack_buffer] (0x0100): cmd [241] uid [266812] gid [286812] validate [true] enterprise principal [true] offline [false] UPN [user@EXAMPLE.COM]
(Wed Nov  4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [unpack_buffer] (0x0100): ccname: [KEYRING:persistent:286812] old_ccname: [not set] keytab: [/etc/krb5.keytab]
(Wed Nov  4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [check_use_fast] (0x0100): Not using FAST.
(Wed Nov  4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [privileged_krb5_setup] (0x0080): Cannot open the PAC responder socket
(Wed Nov  4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [set_lifetime_options] (0x0100): No specific renewable lifetime requested.
(Wed Nov  4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [set_lifetime_options] (0x0100): No specific lifetime requested.
(Wed Nov  4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [set_canonicalize_option] (0x0100): Canonicalization is set to [true]
(Wed Nov  4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [sss_send_pac] (0x0040): sss_pac_make_request failed [-1][2].
(Wed Nov  4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [validate_tgt] (0x0040): sss_send_pac failed, group membership for user with principal [user\@EXAMPLE.COM] might not be correct.

The nss log shows these failures and then the auto-reconnect after seeing something like this "[cache_req_common_dp_recv] (0x0040): CR #1171: Data Provider Error: 3, 5, (null)" see below

(Wed Nov  4 10:52:28 2020) [sssd[nss]] [cache_req_common_dp_recv] (0x0040): CR #1171: Data Provider Error: 3, 5, (null)
(Wed Nov  4 10:52:58 2020) [sssd[nss]] [cache_req_common_dp_recv] (0x0040): CR #1174: Data Provider Error: 3, 5, (null)
(Wed Nov  4 10:53:00 2020) [sssd[nss]] [sbus_dispatch] (0x0020): Performing auto-reconnect
(Wed Nov  4 10:53:01 2020) [sssd[nss]] [sbus_reconnect] (0x0080): Making reconnection attempt 1 to [unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov  4 10:53:01 2020) [sssd[nss]] [sbus_reconnect] (0x0080): Reconnected to [unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov  4 10:53:01 2020) [sssd[nss]] [nss_dp_reconnect_init] (0x0020): Reconnected to the Data Provider.
(Wed Nov  4 10:53:01 2020) [sssd[nss]] [cache_req_common_dp_recv] (0x0040): CR #1178: Data Provider Error: 3, 5, (null)

the ifp log shows the following, notice the killing children at the bottom

(Wed Nov  4 10:53:00 2020) [sssd[ifp]] [sbus_dispatch] (0x0020): Performing auto-reconnect
(Wed Nov  4 10:53:01 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Making reconnection attempt 1 to [unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov  4 10:53:01 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Reconnected to [unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov  4 10:55:06 2020) [sssd[ifp]] [sbus_dispatch] (0x0020): Performing auto-reconnect
(Wed Nov  4 10:55:07 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Making reconnection attempt 1 to [unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov  4 10:55:07 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Reconnected to [unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov  4 10:57:06 2020) [sssd[ifp]] [sbus_dispatch] (0x0020): Performing auto-reconnect
(Wed Nov  4 10:57:07 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Making reconnection attempt 1 to [unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov  4 10:57:07 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Reconnected to [unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov  4 10:59:10 2020) [sssd[ifp]] [sbus_dispatch] (0x0020): Performing auto-reconnect
(Wed Nov  4 10:59:11 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Making reconnection attempt 1 to [unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov  4 10:59:11 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Reconnected to [unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov  4 11:01:11 2020) [sssd[ifp]] [sbus_dispatch] (0x0020): Performing auto-reconnect
(Wed Nov  4 11:01:12 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Making reconnection attempt 1 to [unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov  4 11:01:12 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Reconnected to [unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov  4 11:02:01 2020) [sssd[ifp]] [orderly_shutdown] (0x0010): SIGTERM: killing children

Usually clearing the cache and restarting sssd are enough to fix this issue but not always. Its hard to diagnose because it effects production and peoples ability to even use the cluster and I can't recreate it on the fly.

Any idea of what might be causing these failures?

Thanks in advance,
Josh
_______________________________________________
sssd-users mailing list -- sssd-users@lists.fedorahosted.org
To unsubscribe send an email to sssd-users-leave@lists.fedorahosted.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedorahosted.org/archives/list/sssd-users@lists.fedorahosted.org