Hi,
what is in sssd_example.com.log?
Seems like most (all?) of errors you quote are due to sssd_be[example.com]
hang or crash.
On Wed, Nov 4, 2020 at 8:56 PM Josh Sonstroem <jsonstro(a)ucsc.edu> wrote:
Hi SSSD users,
I need some help debugging this strange failure we are having with sssd.
We've got an issue on a small centos linux cluster in our data center that
uses GSSAPI and active directory for auth. On the head-nodes the sssd
application periodically enters an error state and begins restarting every
few minutes and while its restarting host authentication fails. From the
logs it appears that the sssd application is in a loop restarting itself
every 2 or so minutes. There are numerous logs that are written during the
occurrence of this issue. I've attached snippets below.
I've been on a quite a journey over the past few weeks trying to get
sssd+ad working again in our environment. A few months ago our AD admins
added a new site to our domain that is unreachable from on-premise. This
lead to a bunch of strange issues until we setup our configs to limit which
DCs we talk to and despite limiting the hosts to the proper AD site and
servers we continue to have periodic errors on these busy head nodes. While
in the error state, the sssctl domain-status command will usually fail with
the following error:
# sssctl domain-status
example.com
Unable to get online status [3]: Communication error
org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible
causes include: the remote application did not send a reply, the message
bus security policy blocked the reply, the reply timeout expired, or the
network connection was broken.
Check that SSSD is running and the InfoPipe responder is enabled. Make
sure 'ifp' is listed in the 'services' option in sssd.conf.
Unable to get online status
Ifp is enabled in the services and on working nodes this works correctly.
To try and describe what is happening in the fail state I'll walk thru the
logs. The first indication of an issue is that in /var/log/messages we see
the following entries when the application restart loop starts going:
12245:Nov 4 10:53:00 host-0 sssd[be[example.com]]: Shutting down
12246:Nov 4 10:53:00 host-0 sssd[be[example.com]]: Starting up
And then these repeat every few minutes
the krb5_child log shows errors like this when the ifp backend is not
working
(Wed Nov 4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [unpack_buffer]
(0x0100): cmd [241] uid [266812] gid [286812] validate [true] enterprise
principal [true] offline [false] UPN [user(a)EXAMPLE.COM]
(Wed Nov 4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [unpack_buffer]
(0x0100): ccname: [KEYRING:persistent:286812] old_ccname: [not set] keytab:
[/etc/krb5.keytab]
(Wed Nov 4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [check_use_fast]
(0x0100): Not using FAST.
(Wed Nov 4 10:51:30 2020) [[sssd[krb5_child[263295]]]]
[privileged_krb5_setup] (0x0080): Cannot open the PAC responder socket
(Wed Nov 4 10:51:30 2020) [[sssd[krb5_child[263295]]]]
[set_lifetime_options] (0x0100): No specific renewable lifetime requested.
(Wed Nov 4 10:51:30 2020) [[sssd[krb5_child[263295]]]]
[set_lifetime_options] (0x0100): No specific lifetime requested.
(Wed Nov 4 10:51:30 2020) [[sssd[krb5_child[263295]]]]
[set_canonicalize_option] (0x0100): Canonicalization is set to [true]
(Wed Nov 4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [sss_send_pac]
(0x0040): sss_pac_make_request failed [-1][2].
(Wed Nov 4 10:51:30 2020) [[sssd[krb5_child[263295]]]] [validate_tgt]
(0x0040): sss_send_pac failed, group membership for user with principal
[user\(a)EXAMPLE.COM] might not be correct.
The nss log shows these failures and then the auto-reconnect after seeing
something like this "[cache_req_common_dp_recv] (0x0040): CR #1171: Data
Provider Error: 3, 5, (null)" see below
(Wed Nov 4 10:52:28 2020) [sssd[nss]] [cache_req_common_dp_recv]
(0x0040): CR #1171: Data Provider Error: 3, 5, (null)
(Wed Nov 4 10:52:58 2020) [sssd[nss]] [cache_req_common_dp_recv]
(0x0040): CR #1174: Data Provider Error: 3, 5, (null)
(Wed Nov 4 10:53:00 2020) [sssd[nss]] [sbus_dispatch] (0x0020):
Performing auto-reconnect
(Wed Nov 4 10:53:01 2020) [sssd[nss]] [sbus_reconnect] (0x0080): Making
reconnection attempt 1 to [unix:path=/var/lib/sss/pipes/private/
sbus-dp_example.com]
(Wed Nov 4 10:53:01 2020) [sssd[nss]] [sbus_reconnect] (0x0080):
Reconnected to [
unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov 4 10:53:01 2020) [sssd[nss]] [nss_dp_reconnect_init] (0x0020):
Reconnected to the Data Provider.
(Wed Nov 4 10:53:01 2020) [sssd[nss]] [cache_req_common_dp_recv]
(0x0040): CR #1178: Data Provider Error: 3, 5, (null)
the ifp log shows the following, notice the killing children at the bottom
(Wed Nov 4 10:53:00 2020) [sssd[ifp]] [sbus_dispatch] (0x0020):
Performing auto-reconnect
(Wed Nov 4 10:53:01 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Making
reconnection attempt 1 to [unix:path=/var/lib/sss/pipes/private/
sbus-dp_example.com]
(Wed Nov 4 10:53:01 2020) [sssd[ifp]] [sbus_reconnect] (0x0080):
Reconnected to [
unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov 4 10:55:06 2020) [sssd[ifp]] [sbus_dispatch] (0x0020):
Performing auto-reconnect
(Wed Nov 4 10:55:07 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Making
reconnection attempt 1 to [unix:path=/var/lib/sss/pipes/private/
sbus-dp_example.com]
(Wed Nov 4 10:55:07 2020) [sssd[ifp]] [sbus_reconnect] (0x0080):
Reconnected to [
unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov 4 10:57:06 2020) [sssd[ifp]] [sbus_dispatch] (0x0020):
Performing auto-reconnect
(Wed Nov 4 10:57:07 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Making
reconnection attempt 1 to [unix:path=/var/lib/sss/pipes/private/
sbus-dp_example.com]
(Wed Nov 4 10:57:07 2020) [sssd[ifp]] [sbus_reconnect] (0x0080):
Reconnected to [
unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov 4 10:59:10 2020) [sssd[ifp]] [sbus_dispatch] (0x0020):
Performing auto-reconnect
(Wed Nov 4 10:59:11 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Making
reconnection attempt 1 to [unix:path=/var/lib/sss/pipes/private/
sbus-dp_example.com]
(Wed Nov 4 10:59:11 2020) [sssd[ifp]] [sbus_reconnect] (0x0080):
Reconnected to [
unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov 4 11:01:11 2020) [sssd[ifp]] [sbus_dispatch] (0x0020):
Performing auto-reconnect
(Wed Nov 4 11:01:12 2020) [sssd[ifp]] [sbus_reconnect] (0x0080): Making
reconnection attempt 1 to [unix:path=/var/lib/sss/pipes/private/
sbus-dp_example.com]
(Wed Nov 4 11:01:12 2020) [sssd[ifp]] [sbus_reconnect] (0x0080):
Reconnected to [
unix:path=/var/lib/sss/pipes/private/sbus-dp_example.com]
(Wed Nov 4 11:02:01 2020) [sssd[ifp]] [orderly_shutdown] (0x0010):
SIGTERM: killing children
Usually clearing the cache and restarting sssd are enough to fix this
issue but not always. Its hard to diagnose because it effects production
and peoples ability to even use the cluster and I can't recreate it on the
fly.
Any idea of what might be causing these failures?
Thanks in advance,
Josh
_______________________________________________
sssd-users mailing list -- sssd-users(a)lists.fedorahosted.org
To unsubscribe send an email to sssd-users-leave(a)lists.fedorahosted.org
Fedora Code of Conduct:
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines:
https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives:
https://lists.fedorahosted.org/archives/list/sssd-users@lists.fedorahoste...