What log level are you running? I'm also seeing random sssd halts but not seeing a dump. It looks like sssd recycles from systemd every 15 minutes. It usually is restarted in 0.1 seconds. But occasionally it doesn't restart until the next 15 minute cycle. That causes logins and cron jobs to fail. As soon as I turned the log level to 9, the system with the worst case count settled down and ran with no problems. Engineers paradox.

On July 21, 2022 2:41:42 AM EDT, lejeczek via FreeIPA-users <freeipa-users@lists.fedorahosted.org> wrote:
Hi guys.

One of the masters started recently to find SSSD dead and
says the killer is the WATCHDOG - but I'm not sure about that.
From sssd.log:
...
********************** BACKTRACE DUMP ENDS HERE
*********************************

(2022-07-21  7:11:01): [sssd] [svc_child_info] (0x0020):
Child [991] ('pac':'pac') was terminated by own WATCHDOG
   *  ... skipping repetitive backtrace ...
(2022-07-21  7:11:14): [sssd] [svc_child_info] (0x0020):
Child [984] ('abba.xx.priv.yy':'%BE_abba.xx.priv.yy') was
terminated by own WATCHDOG
   *  ... skipping repetitive backtrace ...
(2022-07-21  7:11:14): [sssd] [svc_child_info] (0x0040):
Child [9744] ('nss':'nss') exited with code [3]
********************** PREVIOUS MESSAGE WAS TRIGGERED BY THE
FOLLOWING BACKTRACE:
   *  (2022-07-21  7:11:14): [sssd]
[sbus_dispatch_reconnect] (0x0400): Connection lost.
Terminating active requests.
   *  (2022-07-21  7:11:14): [sssd]
[sbus_dispatch_reconnect] (0x4000): Remote client terminated
the connection. Releasing data...
   *  (2022-07-21  7:11:14): [sssd] [sbus_connection_free]
(0x4000): Connection 0x5576314d9180 will be freed during
next loop!
   *  (2022-07-21  7:11:14): [sssd] [mt_svc_restart]
(0x0400): Scheduling service abba.xx.priv.yy for restart 1
   *  (2022-07-21  7:11:14): [sssd] [get_provider_config]
(0x0100): Formed command '/usr/libexec/sssd/sssd_be --domain
abba.xx.priv.yy --uid 0 --gid 0 --logger=files' for provider
'%BE_abba.xx.priv.yy'
   *  (2022-07-21  7:11:14): [sssd] [start_service]
(0x0100): Queueing service abba.xx.priv.yy for startup
   *  (2022-07-21  7:11:14): [sssd] [mt_svc_exit_handler]
(0x1000): SIGCHLD handler of service nss called
   *  (2022-07-21  7:11:14): [sssd] [svc_child_info]
(0x0040): Child [9744] ('nss':'nss') exited with code [3]
********************** BACKTRACE DUMP ENDS HERE
*********************************

(2022-07-21  7:11:14): [sssd] [svc_child_info] (0x0040):
Child [9758] ('pac':'pac') exited with code [3]
   *  ... skipping repetitive backtrace ...
(2022-07-21  7:11:16): [sssd] [svc_child_info] (0x0040):
Child [9876] ('nss':'nss') exited with code [3]
   *  ... skipping repetitive backtrace ...
(2022-07-21  7:11:16): [sssd] [svc_child_info] (0x0040):
Child [9877] ('pac':'pac') exited with code [3]
   *  ... skipping repetitive backtrace ...
(2022-07-21  7:11:20): [sssd] [svc_child_info] (0x0040):
Child [9903] ('nss':'nss') exited with code [3]
   *  ... skipping repetitive backtrace ...
(2022-07-21  7:11:20): [sssd] [monitor_restart_service]
(0x0010): Process [nss], definitely stopped!
(2022-07-21  7:11:20): [sssd] [monitor_quit] (0x3f7c0):
Returned with: 1
(2022-07-21  7:11:20): [sssd] [monitor_quit] (0x3f7c0):
Terminating [pac][9904]
(2022-07-21  7:11:21): [sssd] [monitor_quit] (0x3f7c0):
Child [pac] terminated with a signal
(2022-07-21  7:11:21): [sssd] [monitor_quit] (0x3f7c0):
Terminating [abba.xx.priv.yy][9875]
(2022-07-21  7:11:21): [sssd] [monitor_quit] (0x3f7c0):
Child [abba.xx.priv.yy] exited gracefully
(2022-07-21  7:11:21): [sssd] [monitor_quit] (0x3f7c0):
Terminating [sudo][990]
(2022-07-21  7:11:21): [sssd] [monitor_quit] (0x3f7c0):
Child [sudo] exited gracefully
(2022-07-21  7:11:21): [sssd] [monitor_quit] (0x3f7c0):
Terminating [ssh][989]
(2022-07-21  7:11:21): [sssd] [monitor_quit] (0x3f7c0):
Child [ssh] exited gracefully
(2022-07-21  7:11:21): [sssd] [monitor_quit] (0x3f7c0):
Terminating [ifp][988]
(2022-07-21  7:11:21): [sssd] [monitor_quit] (0x3f7c0):
Child [ifp] exited gracefully
(2022-07-21  7:11:21): [sssd] [monitor_quit] (0x3f7c0):
Terminating [pam][987]
(2022-07-21  7:11:21): [sssd] [monitor_quit] (0x3f7c0):
Child [pam] exited gracefully
(2022-07-21  7:11:21): [sssd] [monitor_quit] (0x3f7c0):
Terminating [implicit_files][983]
(2022-07-21  7:11:21): [sssd] [monitor_quit] (0x3f7c0):
Child [implicit_files] exited gracefully

This "death" happens randomly, well, to me at least. Can be
just after reboot or several hours of uptime.
There is more in log files from /var/log/sssd but before I
clutter emails with more logs snippets I was hoping some
expert can share some thoughts.

many thanks, L.
FreeIPA-users mailing list -- freeipa-users@lists.fedorahosted.org
To unsubscribe send an email to freeipa-users-leave@lists.fedorahosted.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedorahosted.org/archives/list/freeipa-users@lists.fedorahosted.org
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure

--
Computers amplify human error
Super computers are really cool