OK we have a query we run in AD for machine account passwords for a certain age. In today's run, 31 - 32 days. Then we verify it's pingable.
We have found such one such suspicious candidate today (two actually, but the other Linux server is quite sick). So one good research candidate. According to both AD and /etc/krb5.keytab file, the machine account password was last set on 7/29. Today is 8/31, so that would be 32 days. This 'automatic machine account keytab renewal' background task should trigger again today.
sssd service was last started 2 weeks ago and, by all appearances, appears healthy. sssctl domain-status <domain> shows online, connected to AD servers (both domain and GC servers).. All logins and group enumerations working as expected.
Just now, we dynamically set the debug level to 9 with 'sssctl debug-level 9'. This particular server is Oracle Linux 8.4, running sssd-*-2.4.0-9.0.1.el8_4.1.x86_64. Installed July 13th, 2021. So -- very recent sssd version. (This problem occurs with both RHEL & OL 6/7/8, it's just today's candidate happens to be OL8.)
We can't keep debug level 9 up for a great many days; it swamps the /var/log filesystem. But we can leave up for a few days. We purposely did not restart sssd server as we know that would trigger a machine account renewal.
Speaking of that -- from Sumit's sssd source code in ad_provider/ad_machine_pw_renewal.c, it appears that sssd is creating a back-end task to call external program /usr/sbin/adcli with certain args. What string can I look for in which sssd log file (now that I have debug level 9 enabled) to tell me when this 'adcli update' task (aka 'automatic machine account keytab renewal') is triggered?
I'm less certain now that we've surveyed our env that this background 'adcli update' task is the reason behind 70 - 80 servers / month dropping off the domain. It might be a slight contributor, but I find only a very few pingable servers with machine account last renewal date between 30 and 40 days.
Yes, I can disable this default 30 day automatic update and roll my own 'adcli update' cron. But that's a mass deployment, to fix what might not be the problem. I want to verify this is the actual culprit before I take those drastic steps.