On Tue, Aug 31, 2021 at 6:47 PM Spike White <spikewhitetx(a)gmail.com> wrote:
All,
OK we have a query we run in AD for machine account passwords for a
certain age. In today's run, 31 - 32 days. Then we verify it's pingable.
We have found such one such suspicious candidate today (two actually, but
the other Linux server is quite sick). So one good research candidate.
According to both AD and /etc/krb5.keytab file, the machine account
password was last set on 7/29. Today is 8/31, so that would be 32 days.
This 'automatic machine account keytab renewal' background task should
trigger again today.
sssd service was last started 2 weeks ago and, by all appearances, appears
healthy. sssctl domain-status <domain> shows online, connected to AD
servers (both domain and GC servers).. All logins and group enumerations
working as expected.
Just now, we dynamically set the debug level to 9 with 'sssctl debug-level
9'. This particular server is Oracle Linux 8.4,
running sssd-*-2.4.0-9.0.1.el8_4.1.x86_64. Installed July 13th, 2021. So
-- very recent sssd version. (This problem occurs with both RHEL & OL
6/7/8, it's just today's candidate happens to be OL8.)
We can't keep debug level 9 up for a great many days; it swamps the
/var/log filesystem. But we can leave up for a few days. We purposely did
not restart sssd server as we know that would trigger a machine account
renewal.
Speaking of that -- from Sumit's sssd source code in
ad_provider/ad_machine_pw_renewal.c, it appears that sssd is creating a
back-end task to call external program /usr/sbin/adcli with certain args.
What string can I look for in which sssd log file (now that I have debug
level 9 enabled) to tell me when this 'adcli update' task (aka 'automatic
machine account keytab renewal') is triggered?
It seems SSSD itself only logs in case of errors. I didn't find any
explicit logs around `ad_machine_account_password_renewal_send()`.
But perhaps there will be something like "[be_ptask_execute] (0x0400): Task
[AD machine account password renewal]: executing task" from generic
be_ptask_* helpers in the sssd_$domain.log (I'm not sure).
Also at this verbosity level `--verbose` should be supplied to adcli itself
and I guess output should be captured in sssd_$domain.log as well. I'm not
familiar with `adcli` internals, you can take a glance at
https://gitlab.freedesktop.org/realmd/adcli to find its log messages.
I'm less certain now that we've surveyed our env that this background
'adcli update' task is the reason behind 70 - 80 servers / month dropping
off the domain. It might be a slight contributor, but I find only a very
few pingable servers with machine account last renewal date between 30 and
40 days.
Yes, I can disable this default 30 day automatic update and roll my own
'adcli update' cron. But that's a mass deployment, to fix what might not
be the problem. I want to verify this is the actual culprit before I take
those drastic steps.
Spike