New subject: sssd + ad periodic failures of backend on the hour, then restarting every 2 or so minutes

Wednesday, 4 November 2020

Hi SSSD users,

I need some help debugging this strange failure we are having with sssd. We've got an
issue on a small centos linux cluster in our data center that uses GSSAPI and active
directory for auth. On the head-nodes the sssd application periodically enters an error
state and begins restarting every few minutes and while its restarting host authentication
fails. From the logs it appears that the sssd application is in a loop restarting itself
every 2 or so minutes. There are numerous logs that are written during the occurrence of
this issue. I've attached snippets below.

I've been on a quite a journey over the past few weeks trying to get sssd+ad working
again in our environment. A few months ago our AD admins added a new site to our domain
that is unreachable from on-premise. This lead to a bunch of strange issues until we setup
our configs to limit which DCs we talk to and despite limiting the hosts to the proper AD
site and servers we continue to have periodic errors on these busy head nodes. While in
the error state, the sssctl domain-status command will usually fail with the following
error:

# sssctl domain-status example.com
Unable to get online status [3]: Communication error
org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the
remote application did not send a reply, the message bus security policy blocked the
reply, the reply timeout expired, or the network connection was broken.
Check that SSSD is running and the InfoPipe responder is enabled. Make sure 'ifp'
is listed in the 'services' option in sssd.conf.
Unable to get online status

Ifp is enabled in the services and on working nodes this works correctly. To try and
describe what is happening in the fail state I'll walk thru the logs. The first
indication of an issue is that in /var/log/messages we see the following entries when the
application restart loop starts going:

12245:Nov  4 10:53:00 host-0 sssd[be[example.com]]: Shutting down
12246:Nov  4 10:53:00 host-0 sssd[be[example.com]]: Starting up

And then these repeat every few minutes

the krb5_child log shows errors like this when the ifp backend is not working

(Wed Nov  4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [unpack_buffer] (0x0100): cmd
[241] uid [266812] gid [286812] validate [true] enterprise principal [true] offline
[false] UPN [user(a)EXAMPLE.COM]
(Wed Nov  4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [unpack_buffer] (0x0100): ccname:
[KEYRING:persistent:286812] old_ccname: [not set] keytab: [/etc/krb5.keytab]
(Wed Nov  4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [check_use_fast] (0x0100): Not
using FAST.
(Wed Nov  4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [privileged_krb5_setup] (0x0080):
Cannot open the PAC responder socket
(Wed Nov  4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [set_lifetime_options] (0x0100):
No specific renewable lifetime requested.
(Wed Nov  4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [set_lifetime_options] (0x0100):
No specific lifetime requested.
(Wed Nov  4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [set_canonicalize_option]
(0x0100): Canonicalization is set to [true]
(Wed Nov  4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [sss_send_pac] (0x0040):
sss_pac_make_request failed [-1][2].
(Wed Nov  4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [validate_tgt] (0x0040):
sss_send_pac failed, group membership for user with principal [user\(a)EXAMPLE.COM] might
not be correct.

The nss log shows these failures and then the auto-reconnect after seeing something like
this "[cache_req_common_dp_recv] (0x0040): CR #1171: Data Provider Error: 3, 5,
(null)" see below

(Wed Nov  4 10:52:28 2020) [sssd[nss]] [cache_req_common_dp_recv] (0x0040): CR #1171: Data
Provider Error: 3, 5, (null)
(Wed Nov  4 10:52:58 2020) [sssd[nss]] [cache_req_common_dp_recv] (0x0040): CR #1174: Data
Provider Error: 3, 5, (null)
(Wed Nov  4 10:53:00 2020) [sssd[nss]] [sbus_dispatch] (0x0020): Performing
auto-reconnect
(Wed Nov  4 10:53:01 2020) [sssd[nss]] [sbus_reconnect] (0x0080): Making reconnection
attempt 1 to [unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov  4 10:53:01 2020) [sssd[nss]] [sbus_reconnect] (0x0080): Reconnected to
[unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov  4 10:53:01 2020) [sssd[nss]] [nss_dp_reconnect_init] (0x0020): Reconnected to
the Data Provider.
(Wed Nov  4 10:53:01 2020) [sssd[nss]] [cache_req_common_dp_recv] (0x0040): CR #1178: Data
Provider Error: 3, 5, (null)

the ifp log shows the following, notice the killing children at the bottom

(Wed Nov  4 10:53:00 2020) [sssd[ifp]] [sbus_dispatch] (0x0020): Performing
auto-reconnect
(Wed Nov  4 10:53:01 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Making reconnection
attempt 1 to [unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov  4 10:53:01 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Reconnected to
[unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov  4 10:55:06 2020) [sssd[ifp]] [sbus_dispatch] (0x0020): Performing
auto-reconnect
(Wed Nov  4 10:55:07 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Making reconnection
attempt 1 to [unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov  4 10:55:07 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Reconnected to
[unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov  4 10:57:06 2020) [sssd[ifp]] [sbus_dispatch] (0x0020): Performing
auto-reconnect
(Wed Nov  4 10:57:07 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Making reconnection
attempt 1 to [unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov  4 10:57:07 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Reconnected to
[unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov  4 10:59:10 2020) [sssd[ifp]] [sbus_dispatch] (0x0020): Performing
auto-reconnect
(Wed Nov  4 10:59:11 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Making reconnection
attempt 1 to [unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov  4 10:59:11 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Reconnected to
[unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov  4 11:01:11 2020) [sssd[ifp]] [sbus_dispatch] (0x0020): Performing
auto-reconnect
(Wed Nov  4 11:01:12 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Making reconnection
attempt 1 to [unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov  4 11:01:12 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Reconnected to
[unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov  4 11:02:01 2020) [sssd[ifp]] [orderly_shutdown] (0x0010): SIGTERM: killing
children

Usually clearing the cache and restarting sssd are enough to fix this issue but not
always. Its hard to diagnose because it effects production and peoples ability to even use
the cluster and I can't recreate it on the fly.

Any idea of what might be causing these failures?

Thanks in advance,
Josh

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

sssd + ad periodic failures of backend on the hour, then restarting every 2 or so minutes