Sssd experts,
*Short summary: * How can we troubleshoot sssd’s ‘Automatic Kerberos Host Keytab Renewal’ process? We have ~0.4% of our Linux servers dropping off the AD domain monthly.
*Longer explanation:*
Over the past two years, we have on-boarded sssd as our Linux AD integration component. Largely displacing a former commercial product that did the same.
We have about ~20K Linux servers that are sssd-enabled. A mix of RHEL6, RHEL7, RHEL8, Oracle Linux 6, 7 and 8. We have ~7K Linux servers still on the old commercial product. (For certain edge-case scenarios, such as DMZs, the commercial product works better.)
Our AD forest is a single AD forest, with 4 regional child domains. All with transitive trust. Sssd auto-discovers parent domain and all 4 child domains, no problem – whenever it’s adcli joined to its regional local domain.
Why are I writing this?
Because we are researching an ongoing problem reported by L1 server ops. About 70 – 80 sssd-enabled Linux servers / month drop off the domain. Out of our current sssd-enabled population of ~20K server, that’s not horrible. But still it should be better. (Our former commercial product did better.)
It’s not limited to one particular OS, OS version, build location or region. We have surveyed; it seems to occur randomly among all OS versions, regions and locations.
To be clear, it’s extremely likely that this behavior arising from some subtle misconfiguration on our part – not from any sssd or adcli or Kerberos bug. We have a couple of configuration improvements we’re pursuing. (Kerberos max ticket lifetime mismatch between AD and /etc/krb5.conf file for instance.)
We are taking sssd’s default settings for ad_maximum_machine_account_password_age and ad_machine_account_password_renewal_opts. So after 30 days, sssd will attempt daily to renew the host Kerberos keytab file. It should re-attempt daily if not renewed. By company policy, our AD disables any machine accounts that have not renewed their credentials in 40 days. So when we find servers that have dropped off the domain, it’s because they have not renewed their AD machine accounts in 40 days.
We have SR’s open with our OS vendors (Redhat and Oracle respectively) for months now. To no great help. (They gave a few suggestions, but none panned out.)
We thought we were hitting this bug:
https://github.com/SSSD/sssd/issues/4762
But packet captures proved that adcli update is using TCP on RHEL7/8. Thus, this might be a potential problem, but only on RHEL6. (BTW ‘udp_preference_limit = 0’ doesn’t force use of TCP for the kpasswd invocation in RHEL6 – it still uses UDP. Thus, the recommended work-around for this bug doesn’t work.)
So that isn’t our underlying problem.
We’re at a loss now – as you can see, we’re grasping at straws.
How can we troubleshoot sssd’s ‘automatic Kerberos Host keytab renewal’ process? Whenever we inspect a particular server it works. We can’t run all sssd clients at debug level 9; it fills up /var/log filesystem after a few days of that. We’re interested in troubleshooting that one particular sssd process on all clients; not all parts of sssd.
Other than a steep learning curve (on our part), obscure situations (like DMZ auto-discovery of AD controllers) and exotic scenarios (like above), we’re quite happy with our 2 yr journey of direct AD integration with sssd. Obviously, the troubleshooting tools on RHEL6 are very minimal. But certainly, overall the quality of sssd on RHEL7/8 is excellent. AD integration has innumerable devils in the details; I’m amazed that sssd performs as well as it does against our multi-domain forest.
Spike
PS the problem with sssd auto-discovery of AD controllers in DMZs has been fixed in a recent sssd release. The better discovery algorithm was implemented – same one used by Windows clients and commercial products. It’s just that recent sssd version is not on RHEL7 or 8.
On 8/25/21 8:32 AM, Spike White wrote:
Because we are researching an ongoing problem reported by L1 server ops. About 70 – 80 sssd-enabled Linux servers / month drop off the domain. Out of our current sssd-enabled population of ~20K server, that’s not horrible. But still it should be better. (Our former commercial product did better.)
It’s not limited to one particular OS, OS version, build location or region. We have surveyed; it seems to occur randomly among all OS versions, regions and locations. ...
We are taking sssd’s default settings for ad_maximum_machine_account_password_age and ad_machine_account_password_renewal_opts. ... So when we find servers that have dropped off the domain, it’s because they have not renewed their AD machine accounts in 40 days.
We had similar symptoms on CentOS systems at my previous employer, however, I'm mostly sure that they were resolved by an sssd update sometime in the last year or two. Are all of your systems fully patched?
If you're seeing the same issue that we saw, one indication would be that running "klist -kt /etc/krb5.keytab" would print a list including two KVNOs on systems after they'd dropped off the domain.
I'd also look up machines in AD to find systems that haven't changed their password in > 40 days, and compare the PasswordLastSet date for a system you're examining to the dates in the klist output:
https://pipe2text.com/?page_id=121
Import-Module ActiveDirectory $date = [DateTime]::Today.AddDays(-40) Get-ADComputer -Filter ‘PasswordLastSet -le $date’ -SearchBase “OU=WhereIStoreComputers,DC=pipe2,DC=Text,DC=com” -properties PasswordLastSet
But packet captures proved that adcli update is using TCP on RHEL7/8.
I *think* that our problems went away after an update changed machine password renewal to TCP only. You might have a different problem, but I'd still start with the keytab, and I'd increase the logging level for sssd because the default logging level didn't give us a lot of information to go on when we were troubleshooting password renewal. It doesn't need to be 9, but should be increased to a level that won't cause you operational difficulty.
On Wed, 2021-08-25 at 21:00 -0700, Gordon uynt wrote:
We had similar symptoms on CentOS systems at my previous employer, however, I'm mostly sure that they were resolved by an sssd update sometime in the last year or two. Are all of your systems fully patched?
At the risk of derailing the OPs search for a technical solution to his issue of ensuring Linux host's consistently refresh their machine account passwords and keytabs, I'd like to ask the community their thoughts on the necessity of doing so.
According to Microsoft:
Significantly increasing the password change interval (or disabling password changes) gives an attacker more time to undertake a brute-force password-guessing attack against one of the machine accounts.[1]
This is an example of an adcli generated machine password:
7I*1>71IFxo%H8OlP<.^#sWI7iMgUA6E[aL*tc9t7M4oWSw+18FcjJC- FJ#Z.#wm@%X6]AgW7*7v,@J3vMGLdGu^(tzqMV+O%Foe50//Gf.=Z9wA)Q+er*K>
...good luck with the brute force attempt.
The MS policy also seems to contradict NIST best practice with their guidance to "eliminate periodic resets"[2] of passwords.
While I recognize you _can_ create your own machine account passwords, in practice I suspect this is rare, and that most people use the tools MS or the opensource community provide. Assuming this is true, why bother with updating the machine account password? To mitigate the compromise of a stolen host keytab or is it to protect against admins who create machine account passwords that can be cracked? If these are the reasons for the policy I'm thinking you have bigger issues in your environment. Or is this a somewhat outdated policy that has historically benefited windows systems and for those of us trying to integrate with Active Directory, it's best to just go along?
I am genuinely curious what the community thinks about this policy, and look forward to learning of the security implications I've failed to consider.
Thank you, Mark
[1]: https://docs.microsoft.com/en-us/windows/security/threat-protection/security... [2]: https://pages.nist.gov/800-63-3/sp800-63-3.html
On Thu, Aug 26, 2021 at 8:11 PM Christian, Mark mark.christian@intel.com wrote:
[W]hy bother with updating the machine account password?
For sites that have a lot of machine churn, where machine accounts aren't reliably purged from AD when the underlying host is decommissioned, disabling and/or purging machine accounts with old passwords is essentially a garbage collection activity, to prevent stale machine accounts from continuing to exist in AD in perpetuity.
Also, some sites must conform with security guidelines that *require* frequent changes of machine account passwords:
https://www.stigviewer.com/stig/microsoft_windows_server_2016/2021-03-05/fin...
Granted, that STIG rule applies to Windows machine accounts, not Linux machine accounts, but disabling any machine account in AD whose password is older than 30 days is one way to detect any Windows clients that are nonconforming with the STIG. And in many cases it's easier to apply that rule globally than on a per-OU basis (to exempt non-Windows machine accounts).
Sumit and Gordon,
You have given me much to think on and digest. Thanks.
Gordon, we religiously patch monthly. Except for sssd in July, where a new update sssd*-2.4.0-9.0.1.el8_4.1.x86_64 broke our env and we had to roll back the update to previous version sssd*-2.4.0-9.0.1.el8.x86_64 . (We pushed a work-around and rolled out the new sssd version in Aug.) This problem with automatic machine account renewal has been a problem for many months, server ops informs us.
I see Sumit is the author of SOURCES/sssd-x.x.x/src/providers/ad/ad_machine_pw_renewal.c. I see it’s doing a be_ptask_create() to do the AD Machine PW renewal.
Sumit, we have dozens and dozens of dropped-off servers that we can survey. All it takes is some slight time to find a machine account in our OU where 'passwordLastSet' > 40 days. For instance, using the powershell invocation that Gordon calls out. (If it’s too old, chances are it’s a server decomm, where AD was not successfully cleaned up. That happens – rarely.)
Yesterday, we surveyed one such server that had dropped of the domain. We see that the msDS-KeyVersionNumber (in AD) is one higher than the latest KVNO (in the local /etc/krb5.keytab file). This indicates (to me) that AD was able to successfully update the machine account password, but this local be_task did not think AD successfully updated the entry, so it did not update the local /etc/krb5.keytab file with the new entry.
Again, this communication failure must be very infrequent, as it’s occurring only 0.3% of the time or so. On random Linux servers – no discernible geographic or build location pattern.
BTW, we are not the AD admins. We’re trying to get the content of the AD DC logs for the specified time period, but these logs are considered highly sensitive. Also, I don’t know how frequently these logs cycle.
On a random test server, we set debug level == 5 and set ad_maximum_machine_account_password_age to less than passwordLastSet duration. I.e., to trigger a machine account password renewal. Then we restarted sssd. Sure enough, 15 mins after sssd service restart, there was a machine account password renewal. We saw it in the /etc/krb5.keytab entries. But we didn’t see any indications of this in the sssd logs when we reviewed them. (Of course, our password renewal was successful – so don’t expect much logging with successful be_ptasks).
Debug level 5 does not seem to fill up the /var/log filesystem to 100% like debug level 9 does. We’ll keep monitoring disk space on this test server. Until we track this down, we might set debug level 5 across all servers.
Spike
On Thu, Aug 26, 2021 at 10:31 PM James Ralston ralston@pobox.com wrote:
On Thu, Aug 26, 2021 at 8:11 PM Christian, Mark mark.christian@intel.com wrote:
[W]hy bother with updating the machine account password?
For sites that have a lot of machine churn, where machine accounts aren't reliably purged from AD when the underlying host is decommissioned, disabling and/or purging machine accounts with old passwords is essentially a garbage collection activity, to prevent stale machine accounts from continuing to exist in AD in perpetuity.
Also, some sites must conform with security guidelines that *require* frequent changes of machine account passwords:
https://www.stigviewer.com/stig/microsoft_windows_server_2016/2021-03-05/fin...
Granted, that STIG rule applies to Windows machine accounts, not Linux machine accounts, but disabling any machine account in AD whose password is older than 30 days is one way to detect any Windows clients that are nonconforming with the STIG. And in many cases it's easier to apply that rule globally than on a per-OU basis (to exempt non-Windows machine accounts). _______________________________________________ sssd-users mailing list -- sssd-users@lists.fedorahosted.org To unsubscribe send an email to sssd-users-leave@lists.fedorahosted.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedorahosted.org/archives/list/sssd-users@lists.fedorahosted.o... Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
As a follow on to that, to keep themselves clear of debris, configuration management tools use the passwordlastset attribute with a value that's greater than XX days to cull objects as well. We had similar issues when we first implemented SSSD several years ago too. We ultimately decided to deploy a cron job with the install that ran periodically (less than the renewal period) to keep the keytab fresh (kinit -R -k $($hostname -s)). We haven't had computers falling off the domain since we implemented that.
Isn’t the issue that the keytab expires and is not renewed in time, then the computer can't change its password because the domain can't verify the keytab? Also, aren't machine passwords different from Kerberos keytabs? Related, because a machine can't change it's password if it can't prove who it is via Kerberos and the keytab. I just want to make sure those aren't being conflated. Similarly, a computer can't update it's DNS record in the domain with out Kerberos, so similarly, if the keytab is not kept fresh, little domain object maintenance can happen, because Kerberos is stale.
Computer account password changes are always initiated by clients, not the domain, even on windows.
Todd
-----Original Message----- From: James Ralston ralston@pobox.com Sent: Thursday, August 26, 2021 10:31 PM To: End-user discussions about the System Security Services Daemon sssd-users@lists.fedorahosted.org Subject: [SSSD-users]Re: Trouble-shooting sssd’s ‘Automatic Kerberos Host Keytab Renewal’ with AD back-end….
On Thu, Aug 26, 2021 at 8:11 PM Christian, Mark mark.christian@intel.com wrote:
[W]hy bother with updating the machine account password?
For sites that have a lot of machine churn, where machine accounts aren't reliably purged from AD when the underlying host is decommissioned, disabling and/or purging machine accounts with old passwords is essentially a garbage collection activity, to prevent stale machine accounts from continuing to exist in AD in perpetuity.
Also, some sites must conform with security guidelines that *require* frequent changes of machine account passwords:
https://www.stigviewer.com/stig/microsoft_windows_server_2016/2021-03-05/fin...
Granted, that STIG rule applies to Windows machine accounts, not Linux machine accounts, but disabling any machine account in AD whose password is older than 30 days is one way to detect any Windows clients that are nonconforming with the STIG. And in many cases it's easier to apply that rule globally than on a per-OU basis (to exempt non-Windows machine accounts). _______________________________________________ sssd-users mailing list -- sssd-users@lists.fedorahosted.org To unsubscribe send an email to sssd-users-leave@lists.fedorahosted.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedorahosted.org/archives/list/sssd-users@lists.fedorahosted.o... Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
This message is from an external sender. Learn more about why this << matters at https://links.utexas.edu/rtyclf. <<
Am Wed, Aug 25, 2021 at 10:32:58AM -0500 schrieb Spike White:
Sssd experts,
*Short summary: * How can we troubleshoot sssd’s ‘Automatic Kerberos Host Keytab Renewal’ process? We have ~0.4% of our Linux servers dropping off the AD domain monthly.
*Longer explanation:*
Over the past two years, we have on-boarded sssd as our Linux AD integration component. Largely displacing a former commercial product that did the same.
We have about ~20K Linux servers that are sssd-enabled. A mix of RHEL6, RHEL7, RHEL8, Oracle Linux 6, 7 and 8. We have ~7K Linux servers still on the old commercial product. (For certain edge-case scenarios, such as DMZs, the commercial product works better.)
Our AD forest is a single AD forest, with 4 regional child domains. All with transitive trust. Sssd auto-discovers parent domain and all 4 child domains, no problem – whenever it’s adcli joined to its regional local domain.
Why are I writing this?
Because we are researching an ongoing problem reported by L1 server ops. About 70 – 80 sssd-enabled Linux servers / month drop off the domain. Out of our current sssd-enabled population of ~20K server, that’s not horrible. But still it should be better. (Our former commercial product did better.)
It’s not limited to one particular OS, OS version, build location or region. We have surveyed; it seems to occur randomly among all OS versions, regions and locations.
To be clear, it’s extremely likely that this behavior arising from some subtle misconfiguration on our part – not from any sssd or adcli or Kerberos bug. We have a couple of configuration improvements we’re pursuing. (Kerberos max ticket lifetime mismatch between AD and /etc/krb5.conf file for instance.)
We are taking sssd’s default settings for ad_maximum_machine_account_password_age and ad_machine_account_password_renewal_opts. So after 30 days, sssd will attempt daily to renew the host Kerberos keytab file. It should re-attempt daily if not renewed. By company policy, our AD disables any machine accounts that have not renewed their credentials in 40 days. So when we find servers that have dropped off the domain, it’s because they have not renewed their AD machine accounts in 40 days.
Hi,
if this happens again it would be good the check the highest KVNO from the local keytab and the one stored in AD for the affected computer. The LDAP attribute in AD is called 'msDS-KeyVersionNumber' and can be looked up with 'adcli show-computer'.
As long as the KVNO is not reset by your disabling mechanism in AD this would help to understand if SSSD/adcli just didn't update the key, in this case the KVNOs should be the same. Or if there was a failed update earlier and as a result the client wasn't able to update the key again, in this case the AD KVNO should be 1 higher than the one from the keytab.
Would the users recognise if SSSD on the computer is offline, i.e. if they do not get a fresh Kerberos ticket when logging in? I'm asking because if SSSD is offline 'adcli update' is not called.
bye, Sumit
We have SR’s open with our OS vendors (Redhat and Oracle respectively) for months now. To no great help. (They gave a few suggestions, but none panned out.)
We thought we were hitting this bug:
https://github.com/SSSD/sssd/issues/4762
But packet captures proved that adcli update is using TCP on RHEL7/8. Thus, this might be a potential problem, but only on RHEL6. (BTW ‘udp_preference_limit = 0’ doesn’t force use of TCP for the kpasswd invocation in RHEL6 – it still uses UDP. Thus, the recommended work-around for this bug doesn’t work.)
So that isn’t our underlying problem.
We’re at a loss now – as you can see, we’re grasping at straws.
How can we troubleshoot sssd’s ‘automatic Kerberos Host keytab renewal’ process? Whenever we inspect a particular server it works. We can’t run all sssd clients at debug level 9; it fills up /var/log filesystem after a few days of that. We’re interested in troubleshooting that one particular sssd process on all clients; not all parts of sssd.
Other than a steep learning curve (on our part), obscure situations (like DMZ auto-discovery of AD controllers) and exotic scenarios (like above), we’re quite happy with our 2 yr journey of direct AD integration with sssd. Obviously, the troubleshooting tools on RHEL6 are very minimal. But certainly, overall the quality of sssd on RHEL7/8 is excellent. AD integration has innumerable devils in the details; I’m amazed that sssd performs as well as it does against our multi-domain forest.
Spike
PS the problem with sssd auto-discovery of AD controllers in DMZs has been fixed in a recent sssd release. The better discovery algorithm was implemented – same one used by Windows clients and commercial products. It’s just that recent sssd version is not on RHEL7 or 8.
sssd-users mailing list -- sssd-users@lists.fedorahosted.org To unsubscribe send an email to sssd-users-leave@lists.fedorahosted.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedorahosted.org/archives/list/sssd-users@lists.fedorahosted.o... Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
All,
We took Sumit’s advice and enabled sssd’s debug level 7 on the “domain” section of sssd.conf. On about 2300 non-prod Linux servers.
FYI – beware if you do this! We found occurrences where that sssd_amer.company.com_log was 8 GB after 24 hrs. So you’ll likely have to fine-tune your sssd logrotate file or even more drastic actions.
RECAP: Randomly on 0.24% of our Linux servers, after 30 days sssd will drop off the AD domain. We find this occurs during the automatic Kerberos Host Keytab renewal. The KVNO number in AD is one more than the latest KVNO number in /etc/krb5.keytab file.
Due to sssd debug level 7, we now have verbose ‘adcli update’ output in our sssd_<domain>.company.com_log files. For two such culprits. The output shows the same error condition. Here is example output:
(2021-09-28 3:44:23): [be[amer.company.com]] [ad_machine_account_password_renewal_done] (0x1000): --- adcli output start---
* Found realm in keytab: AMER.COMPANY.COM
* Found computer name in keytab: KEWNLR2CU2APP01
* Found service principal in keytab: host/kewnlr2cu2app01.amer.company.com
* Found host qualified name in keytab: kewnlr2cu2app01.amer.company.com
* Found service principal in keytab: host/KEWNLR2CU2APP01
* Found service principal in keytab: RestrictedKrbHost/KEWNLR2CU2APP01
* Found service principal in keytab: RestrictedKrbHost/ kewnlr2cu2app01.amer.company.com
* Using fully qualified name: kewnlr2cu2app01.amer.company.com
* Using domain name: amer.company.com
* Using computer account name: KEWNLR2CU2APP01
* Using domain realm: amer.company.com
* Sending NetLogon ping to domain controller: AUSDC16AMER23.amer.company.com
* Received NetLogon info from: AUSDC16AMER23.amer.company.com
* Wrote out krb5.conf snippet to /tmp/adcli-krb5-HRsQ9K/krb5.d/adcli-krb5-conf-yBNrRI
* Authenticated as default/reset computer account: KEWNLR2CU2APP01
* Using GSS-SPNEGO for SASL bind
* Looked up short domain name: AMERICAS
* Looked up domain SID: S-1-5-21-1802859667-647903414-1863928812
* Using fully qualified name: kewnlr2cu2app01.amer.company.com
* Using domain name: amer.company.com
* Using computer account name: KEWNLR2CU2APP01
* Using domain realm: amer.company.com
* Using fully qualified name: kewnlr2cu2app01.amer.company.com
* Enrolling computer name: KEWNLR2CU2APP01
* Generated 120 character computer password
* Using keytab: FILE:/etc/krb5.keytab
* Found computer account for KEWNLR2CU2APP01$ at: CN=KEWNLR2CU2APP01,OU=Servers,OU=UNIX,DC=amer,DC=company,DC=com
* Retrieved kvno '17' for computer account in directory: CN=KEWNLR2CU2APP01,OU=Servers,OU=UNIX,DC=amer,DC=company,DC=com
* Sending NetLogon ping to domain controller: AUSDC16AMER23.amer.company.com
* Received NetLogon info from: AUSDC16AMER23.amer.company.com
! Cannot change computer password: Authentication error
adcli: updating membership with domain amer.company.com failed: Cannot change computer password: Authentication error
---adcli output end---
Within 1.5 mins of the above, we receive errors in /var/log/messages as below:
Sep 28 03:45:51 kewnlr2cu2app01 sssd[ldap_child[288005]][288005]: Failed to initialize credentials using keytab [MEMORY:/etc/krb5.keytab]: Preauthentication failed. Unable to create GSSAPI-encrypted LDAP connection.
We verify in /etc/krb5.keytab file that the latest KVNO is still 17, while in AD the KVNO is now 18. Also, the time of the last password changed in AD exactly corresponds to above:
PS C:\Users\spike_white> get-adcomputer kewnlr2cu2app01 -Property 'PasswordLastSet'
DistinguishedName : CN=KEWNLR2CU2APP01,OU=Servers,OU=UNIX,DC=amer,DC=company,DC=com
DNSHostName : kewnlr2cu2app01.amer.company.com
...
Name : KEWNLR2CU2APP01
ObjectClass : computer
...
PasswordLastSet : 9/28/2021 3:44:23 AM
SamAccountName : KEWNLR2CU2APP01$
...
UserPrincipalName : host/kewnlr2cu2app01.amer.company.com@AMER.COMPANY.COM
PS C:\Users\spike_white> get-adcomputer kewnlr2cu2app01 -property msDS-KeyVersionNumber
DistinguishedName : CN=KEWNLR2CU2APP01,OU=Servers,OU=UNIX,DC=amer,DC=company,DC=com
DNSHostName : kewnlr2cu2app01.amer.company.com
...
msDS-KeyVersionNumber : 18
Of course, after this, the adcli output in _<domain>.company.com_log file, will continue to show Kerberos pre-authentication errors. Because now adcli update is using the old machine account password, while AD has the new machine account password:
(2021-09-28 4:13:42): [be[amer.company.com]] [ad_machine_account_password_renewal_done] (0x1000): --- adcli output start---
* Found realm in keytab: AMER.COMPANY.COM
* Found computer name in keytab: KEWNLR2CU2APP01
* Found service principal in keytab: host/kewnlr2cu2app01.amer.company.com
* Found host qualified name in keytab: kewnlr2cu2app01.amer.company.com
* Found service principal in keytab: host/KEWNLR2CU2APP01
* Found service principal in keytab: RestrictedKrbHost/KEWNLR2CU2APP01
* Found service principal in keytab: RestrictedKrbHost/ kewnlr2cu2app01.amer.company.com
* Using fully qualified name: kewnlr2cu2app01.amer.company.com
* Using domain name: amer.company.com
* Using computer account name: KEWNLR2CU2APP01
* Using domain realm: amer.company.com
* Discovering domain controllers: _ldap._tcp.amer.company.com
* Sending NetLogon ping to domain controller: RDUDC16AMER04.amer.company.com
* Received NetLogon info from: RDUDC16AMER04.amer.company.com
* Discovering site domain controllers: _ldap._tcp.AMERAustin._sites.dc._ msdcs.amer.company.com
* Sending NetLogon ping to domain controller: AUSDC16AMER34.amer.company.com
* Received NetLogon info from: AUSDC16AMER34.amer.company.com
* Wrote out krb5.conf snippet to /tmp/adcli-krb5-i7P6zR/krb5.d/adcli-krb5-conf-vkBoqT
! Couldn't authenticate as machine account: KEWNLR2CU2APP01: Preauthentication failed
adcli: couldn't connect to amer.company.com domain: Couldn't authenticate as machine account: KEWNLR2CU2APP01: Preauthentication failed
---adcli output end---
In summary, for some reason adcli update after attempting to set the machine account password thinks that it’s receiving a Kerberos authentication error. Actually, it has successfully set the machine account password in AD. Because it thinks that it did not successfully update the machine account password, it does not update the entiries in the local /etc/krb5.keytab file.
We have our AD admins examining the AD domain controller logs now (since we have an exact DC name, exact time and exact client FQDN above).
At this point, we’re unsure whether this is an adcli problem or an AD problem.
Does adcli update attempt to authenticate back to the same AD DC with the new password? Or does it randomly pick an AD DC to authentication back to, with the new password?
Spike White
On Wed, Aug 25, 2021 at 10:32 AM Spike White spikewhitetx@gmail.com wrote:
Sssd experts,
*Short summary: * How can we troubleshoot sssd’s ‘Automatic Kerberos Host Keytab Renewal’ process? We have ~0.4% of our Linux servers dropping off the AD domain monthly.
*Longer explanation:*
Over the past two years, we have on-boarded sssd as our Linux AD integration component. Largely displacing a former commercial product that did the same.
We have about ~20K Linux servers that are sssd-enabled. A mix of RHEL6, RHEL7, RHEL8, Oracle Linux 6, 7 and 8. We have ~7K Linux servers still on the old commercial product. (For certain edge-case scenarios, such as DMZs, the commercial product works better.)
Our AD forest is a single AD forest, with 4 regional child domains. All with transitive trust. Sssd auto-discovers parent domain and all 4 child domains, no problem – whenever it’s adcli joined to its regional local domain.
Why are I writing this?
Because we are researching an ongoing problem reported by L1 server ops. About 70 – 80 sssd-enabled Linux servers / month drop off the domain. Out of our current sssd-enabled population of ~20K server, that’s not horrible. But still it should be better. (Our former commercial product did better.)
It’s not limited to one particular OS, OS version, build location or region. We have surveyed; it seems to occur randomly among all OS versions, regions and locations.
To be clear, it’s extremely likely that this behavior arising from some subtle misconfiguration on our part – not from any sssd or adcli or Kerberos bug. We have a couple of configuration improvements we’re pursuing. (Kerberos max ticket lifetime mismatch between AD and /etc/krb5.conf file for instance.)
We are taking sssd’s default settings for ad_maximum_machine_account_password_age and ad_machine_account_password_renewal_opts. So after 30 days, sssd will attempt daily to renew the host Kerberos keytab file. It should re-attempt daily if not renewed. By company policy, our AD disables any machine accounts that have not renewed their credentials in 40 days. So when we find servers that have dropped off the domain, it’s because they have not renewed their AD machine accounts in 40 days.
We have SR’s open with our OS vendors (Redhat and Oracle respectively) for months now. To no great help. (They gave a few suggestions, but none panned out.)
We thought we were hitting this bug:
https://github.com/SSSD/sssd/issues/4762
But packet captures proved that adcli update is using TCP on RHEL7/8. Thus, this might be a potential problem, but only on RHEL6. (BTW ‘udp_preference_limit = 0’ doesn’t force use of TCP for the kpasswd invocation in RHEL6 – it still uses UDP. Thus, the recommended work-around for this bug doesn’t work.)
So that isn’t our underlying problem.
We’re at a loss now – as you can see, we’re grasping at straws.
How can we troubleshoot sssd’s ‘automatic Kerberos Host keytab renewal’ process? Whenever we inspect a particular server it works. We can’t run all sssd clients at debug level 9; it fills up /var/log filesystem after a few days of that. We’re interested in troubleshooting that one particular sssd process on all clients; not all parts of sssd.
Other than a steep learning curve (on our part), obscure situations (like DMZ auto-discovery of AD controllers) and exotic scenarios (like above), we’re quite happy with our 2 yr journey of direct AD integration with sssd. Obviously, the troubleshooting tools on RHEL6 are very minimal. But certainly, overall the quality of sssd on RHEL7/8 is excellent. AD integration has innumerable devils in the details; I’m amazed that sssd performs as well as it does against our multi-domain forest.
Spike
PS the problem with sssd auto-discovery of AD controllers in DMZs has been fixed in a recent sssd release. The better discovery algorithm was implemented – same one used by Windows clients and commercial products. It’s just that recent sssd version is not on RHEL7 or 8.
Am Tue, Sep 28, 2021 at 03:18:06PM -0500 schrieb Spike White:
All,
We took Sumit’s advice and enabled sssd’s debug level 7 on the “domain” section of sssd.conf. On about 2300 non-prod Linux servers.
FYI – beware if you do this! We found occurrences where that sssd_amer.company.com_log was 8 GB after 24 hrs. So you’ll likely have to fine-tune your sssd logrotate file or even more drastic actions.
RECAP: Randomly on 0.24% of our Linux servers, after 30 days sssd will drop off the AD domain. We find this occurs during the automatic Kerberos Host Keytab renewal. The KVNO number in AD is one more than the latest KVNO number in /etc/krb5.keytab file.
Due to sssd debug level 7, we now have verbose ‘adcli update’ output in our sssd_<domain>.company.com_log files. For two such culprits. The output shows the same error condition. Here is example output:
(2021-09-28 3:44:23): [be[amer.company.com]] [ad_machine_account_password_renewal_done] (0x1000): --- adcli output start---
Found realm in keytab: AMER.COMPANY.COM
Found computer name in keytab: KEWNLR2CU2APP01
Found service principal in keytab: host/kewnlr2cu2app01.amer.company.com
Found host qualified name in keytab: kewnlr2cu2app01.amer.company.com
Found service principal in keytab: host/KEWNLR2CU2APP01
Found service principal in keytab: RestrictedKrbHost/KEWNLR2CU2APP01
Found service principal in keytab: RestrictedKrbHost/
kewnlr2cu2app01.amer.company.com
Using fully qualified name: kewnlr2cu2app01.amer.company.com
Using domain name: amer.company.com
Using computer account name: KEWNLR2CU2APP01
Using domain realm: amer.company.com
Sending NetLogon ping to domain controller:
AUSDC16AMER23.amer.company.com
Received NetLogon info from: AUSDC16AMER23.amer.company.com
Wrote out krb5.conf snippet to
/tmp/adcli-krb5-HRsQ9K/krb5.d/adcli-krb5-conf-yBNrRI
Authenticated as default/reset computer account: KEWNLR2CU2APP01
Using GSS-SPNEGO for SASL bind
Looked up short domain name: AMERICAS
Looked up domain SID: S-1-5-21-1802859667-647903414-1863928812
Using fully qualified name: kewnlr2cu2app01.amer.company.com
Using domain name: amer.company.com
Using computer account name: KEWNLR2CU2APP01
Using domain realm: amer.company.com
Using fully qualified name: kewnlr2cu2app01.amer.company.com
Enrolling computer name: KEWNLR2CU2APP01
Generated 120 character computer password
Using keytab: FILE:/etc/krb5.keytab
Found computer account for KEWNLR2CU2APP01$ at:
CN=KEWNLR2CU2APP01,OU=Servers,OU=UNIX,DC=amer,DC=company,DC=com
- Retrieved kvno '17' for computer account in directory:
CN=KEWNLR2CU2APP01,OU=Servers,OU=UNIX,DC=amer,DC=company,DC=com
- Sending NetLogon ping to domain controller:
AUSDC16AMER23.amer.company.com
- Received NetLogon info from: AUSDC16AMER23.amer.company.com
! Cannot change computer password: Authentication error
adcli: updating membership with domain amer.company.com failed: Cannot change computer password: Authentication error
---adcli output end---
Within 1.5 mins of the above, we receive errors in /var/log/messages as below:
Sep 28 03:45:51 kewnlr2cu2app01 sssd[ldap_child[288005]][288005]: Failed to initialize credentials using keytab [MEMORY:/etc/krb5.keytab]: Preauthentication failed. Unable to create GSSAPI-encrypted LDAP connection.
We verify in /etc/krb5.keytab file that the latest KVNO is still 17, while in AD the KVNO is now 18. Also, the time of the last password changed in AD exactly corresponds to above:
PS C:\Users\spike_white> get-adcomputer kewnlr2cu2app01 -Property 'PasswordLastSet'
DistinguishedName : CN=KEWNLR2CU2APP01,OU=Servers,OU=UNIX,DC=amer,DC=company,DC=com
DNSHostName : kewnlr2cu2app01.amer.company.com
...
Name : KEWNLR2CU2APP01
ObjectClass : computer
...
PasswordLastSet : 9/28/2021 3:44:23 AM
SamAccountName : KEWNLR2CU2APP01$
...
UserPrincipalName : host/kewnlr2cu2app01.amer.company.com@AMER.COMPANY.COM
PS C:\Users\spike_white> get-adcomputer kewnlr2cu2app01 -property msDS-KeyVersionNumber
DistinguishedName : CN=KEWNLR2CU2APP01,OU=Servers,OU=UNIX,DC=amer,DC=company,DC=com
DNSHostName : kewnlr2cu2app01.amer.company.com
...
msDS-KeyVersionNumber : 18
Of course, after this, the adcli output in _<domain>.company.com_log file, will continue to show Kerberos pre-authentication errors. Because now adcli update is using the old machine account password, while AD has the new machine account password:
(2021-09-28 4:13:42): [be[amer.company.com]] [ad_machine_account_password_renewal_done] (0x1000): --- adcli output start---
Found realm in keytab: AMER.COMPANY.COM
Found computer name in keytab: KEWNLR2CU2APP01
Found service principal in keytab: host/kewnlr2cu2app01.amer.company.com
Found host qualified name in keytab: kewnlr2cu2app01.amer.company.com
Found service principal in keytab: host/KEWNLR2CU2APP01
Found service principal in keytab: RestrictedKrbHost/KEWNLR2CU2APP01
Found service principal in keytab: RestrictedKrbHost/
kewnlr2cu2app01.amer.company.com
Using fully qualified name: kewnlr2cu2app01.amer.company.com
Using domain name: amer.company.com
Using computer account name: KEWNLR2CU2APP01
Using domain realm: amer.company.com
Discovering domain controllers: _ldap._tcp.amer.company.com
Sending NetLogon ping to domain controller:
RDUDC16AMER04.amer.company.com
Received NetLogon info from: RDUDC16AMER04.amer.company.com
Discovering site domain controllers: _ldap._tcp.AMERAustin._sites.dc._
msdcs.amer.company.com
- Sending NetLogon ping to domain controller:
AUSDC16AMER34.amer.company.com
Received NetLogon info from: AUSDC16AMER34.amer.company.com
Wrote out krb5.conf snippet to
/tmp/adcli-krb5-i7P6zR/krb5.d/adcli-krb5-conf-vkBoqT
! Couldn't authenticate as machine account: KEWNLR2CU2APP01: Preauthentication failed
adcli: couldn't connect to amer.company.com domain: Couldn't authenticate as machine account: KEWNLR2CU2APP01: Preauthentication failed
---adcli output end---
In summary, for some reason adcli update after attempting to set the machine account password thinks that it’s receiving a Kerberos authentication error. Actually, it has successfully set the machine account password in AD. Because it thinks that it did not successfully update the machine account password, it does not update the entiries in the local /etc/krb5.keytab file.
Hi,
thank you for this extensive analysis.
We have our AD admins examining the AD domain controller logs now (since we have an exact DC name, exact time and exact client FQDN above).
The 'Authentication error' error is coming from the password changing operation itself. According to the related RFC (https://datatracker.ietf.org/doc/html/rfc3244) it means 'request fails due to an error in authentication processing'. Please let me know if your AD admins can fine anything odd in the logs at this time.
At this point, we’re unsure whether this is an adcli problem or an AD problem.
Does adcli update attempt to authenticate back to the same AD DC with the new password? Or does it randomly pick an AD DC to authentication back to, with the new password?
No, adcli does not try to authenticate back with the new password. But this might be some way out of this issue. If AD returns an error when trying to update a machine account password adcli can try to authenticate with new password to see if it is accepted or not. But there still might be a race condition. With the old error when using udp adcli got the error code back before the AD DC has update the password. So even when talking to the same DC to avoid replication issues it might be possible that the new password will not work immediately but only after a timeout. So checking if the new password is accepted after an error might be a workaround but it might not work in all cases.
bye, Sumit
Spike White
On Wed, Aug 25, 2021 at 10:32 AM Spike White spikewhitetx@gmail.com wrote:
Sumit,
Thanks. BTW, this occurs primarily on RHEL7, RHEL8, Oracle Linux 7 and 8. (We have very few RHEL6 servers left, but some of them are sssd enabled.)
We verified (via tcpdump, wireshark) that *L7/8 use TCP exclusively for this kpasswd operation. As we're aware of this UDP problem. So this is a new password change problem, not the old UDP-related problem.
BTW, we also verified that RHEL6 uses UDP exclusively for this kpasswd operation. Regardless of setting of udp_preference_limit in /etc/krb5.conf. That setting apparently is only for the Kerberos port (TCP/UDP 88), not for the kpasswd port.
Yes, we're very anxious to hear what our AD admins will tell us from their AD DC logs.
Spike
On Wed, Sep 29, 2021 at 5:13 AM Sumit Bose sbose@redhat.com wrote:
Am Tue, Sep 28, 2021 at 03:18:06PM -0500 schrieb Spike White:
All,
We took Sumit’s advice and enabled sssd’s debug level 7 on the “domain” section of sssd.conf. On about 2300 non-prod Linux servers.
FYI – beware if you do this! We found occurrences where that sssd_amer.company.com_log was 8 GB after 24 hrs. So you’ll likely have
to
fine-tune your sssd logrotate file or even more drastic actions.
RECAP: Randomly on 0.24% of our Linux servers, after 30 days sssd will drop off the AD domain. We find this occurs during the automatic
Kerberos
Host Keytab renewal. The KVNO number in AD is one more than the latest KVNO number in /etc/krb5.keytab file.
Due to sssd debug level 7, we now have verbose ‘adcli update’ output in our sssd_<domain>.company.com_log files. For two such culprits. The output shows the same error condition. Here is example output:
(2021-09-28 3:44:23): [be[amer.company.com]] [ad_machine_account_password_renewal_done] (0x1000): --- adcli output start---
Found realm in keytab: AMER.COMPANY.COM
Found computer name in keytab: KEWNLR2CU2APP01
Found service principal in keytab: host/
kewnlr2cu2app01.amer.company.com
Found host qualified name in keytab: kewnlr2cu2app01.amer.company.com
Found service principal in keytab: host/KEWNLR2CU2APP01
Found service principal in keytab: RestrictedKrbHost/KEWNLR2CU2APP01
Found service principal in keytab: RestrictedKrbHost/
kewnlr2cu2app01.amer.company.com
Using fully qualified name: kewnlr2cu2app01.amer.company.com
Using domain name: amer.company.com
Using computer account name: KEWNLR2CU2APP01
Using domain realm: amer.company.com
Sending NetLogon ping to domain controller:
AUSDC16AMER23.amer.company.com
Received NetLogon info from: AUSDC16AMER23.amer.company.com
Wrote out krb5.conf snippet to
/tmp/adcli-krb5-HRsQ9K/krb5.d/adcli-krb5-conf-yBNrRI
Authenticated as default/reset computer account: KEWNLR2CU2APP01
Using GSS-SPNEGO for SASL bind
Looked up short domain name: AMERICAS
Looked up domain SID: S-1-5-21-1802859667-647903414-1863928812
Using fully qualified name: kewnlr2cu2app01.amer.company.com
Using domain name: amer.company.com
Using computer account name: KEWNLR2CU2APP01
Using domain realm: amer.company.com
Using fully qualified name: kewnlr2cu2app01.amer.company.com
Enrolling computer name: KEWNLR2CU2APP01
Generated 120 character computer password
Using keytab: FILE:/etc/krb5.keytab
Found computer account for KEWNLR2CU2APP01$ at:
CN=KEWNLR2CU2APP01,OU=Servers,OU=UNIX,DC=amer,DC=company,DC=com
- Retrieved kvno '17' for computer account in directory:
CN=KEWNLR2CU2APP01,OU=Servers,OU=UNIX,DC=amer,DC=company,DC=com
- Sending NetLogon ping to domain controller:
AUSDC16AMER23.amer.company.com
- Received NetLogon info from: AUSDC16AMER23.amer.company.com
! Cannot change computer password: Authentication error
adcli: updating membership with domain amer.company.com failed: Cannot change computer password: Authentication error
---adcli output end---
Within 1.5 mins of the above, we receive errors in /var/log/messages as below:
Sep 28 03:45:51 kewnlr2cu2app01 sssd[ldap_child[288005]][288005]: Failed
to
initialize credentials using keytab [MEMORY:/etc/krb5.keytab]: Preauthentication failed. Unable to create GSSAPI-encrypted LDAP
connection.
We verify in /etc/krb5.keytab file that the latest KVNO is still 17,
while
in AD the KVNO is now 18. Also, the time of the last password changed in AD exactly corresponds to above:
PS C:\Users\spike_white> get-adcomputer kewnlr2cu2app01 -Property 'PasswordLastSet'
DistinguishedName : CN=KEWNLR2CU2APP01,OU=Servers,OU=UNIX,DC=amer,DC=company,DC=com
DNSHostName : kewnlr2cu2app01.amer.company.com
...
Name : KEWNLR2CU2APP01
ObjectClass : computer
...
PasswordLastSet : 9/28/2021 3:44:23 AM
SamAccountName : KEWNLR2CU2APP01$
...
UserPrincipalName : host/
kewnlr2cu2app01.amer.company.com@AMER.COMPANY.COM
PS C:\Users\spike_white> get-adcomputer kewnlr2cu2app01 -property msDS-KeyVersionNumber
DistinguishedName : CN=KEWNLR2CU2APP01,OU=Servers,OU=UNIX,DC=amer,DC=company,DC=com
DNSHostName : kewnlr2cu2app01.amer.company.com
...
msDS-KeyVersionNumber : 18
Of course, after this, the adcli output in _<domain>.company.com_log
file,
will continue to show Kerberos pre-authentication errors. Because now adcli update is using the old machine account password, while AD has the new machine account password:
(2021-09-28 4:13:42): [be[amer.company.com]] [ad_machine_account_password_renewal_done] (0x1000): --- adcli output start---
Found realm in keytab: AMER.COMPANY.COM
Found computer name in keytab: KEWNLR2CU2APP01
Found service principal in keytab: host/
kewnlr2cu2app01.amer.company.com
Found host qualified name in keytab: kewnlr2cu2app01.amer.company.com
Found service principal in keytab: host/KEWNLR2CU2APP01
Found service principal in keytab: RestrictedKrbHost/KEWNLR2CU2APP01
Found service principal in keytab: RestrictedKrbHost/
kewnlr2cu2app01.amer.company.com
Using fully qualified name: kewnlr2cu2app01.amer.company.com
Using domain name: amer.company.com
Using computer account name: KEWNLR2CU2APP01
Using domain realm: amer.company.com
Discovering domain controllers: _ldap._tcp.amer.company.com
Sending NetLogon ping to domain controller:
RDUDC16AMER04.amer.company.com
Received NetLogon info from: RDUDC16AMER04.amer.company.com
Discovering site domain controllers: _ldap._tcp.AMERAustin._sites.dc._
msdcs.amer.company.com
- Sending NetLogon ping to domain controller:
AUSDC16AMER34.amer.company.com
Received NetLogon info from: AUSDC16AMER34.amer.company.com
Wrote out krb5.conf snippet to
/tmp/adcli-krb5-i7P6zR/krb5.d/adcli-krb5-conf-vkBoqT
! Couldn't authenticate as machine account: KEWNLR2CU2APP01: Preauthentication failed
adcli: couldn't connect to amer.company.com domain: Couldn't
authenticate
as machine account: KEWNLR2CU2APP01: Preauthentication failed
---adcli output end---
In summary, for some reason adcli update after attempting to set the machine account password thinks that it’s receiving a Kerberos authentication error. Actually, it has successfully set the machine account password in AD. Because it thinks that it did not successfully update the machine account password, it does not update the entiries in
the
local /etc/krb5.keytab file.
Hi,
thank you for this extensive analysis.
We have our AD admins examining the AD domain controller logs now (since
we
have an exact DC name, exact time and exact client FQDN above).
The 'Authentication error' error is coming from the password changing operation itself. According to the related RFC (https://datatracker.ietf.org/doc/html/rfc3244) it means 'request fails due to an error in authentication processing'. Please let me know if your AD admins can fine anything odd in the logs at this time.
At this point, we’re unsure whether this is an adcli problem or an AD problem.
Does adcli update attempt to authenticate back to the same AD DC with the new password? Or does it randomly pick an AD DC to authentication back
to,
with the new password?
No, adcli does not try to authenticate back with the new password. But this might be some way out of this issue. If AD returns an error when trying to update a machine account password adcli can try to authenticate with new password to see if it is accepted or not. But there still might be a race condition. With the old error when using udp adcli got the error code back before the AD DC has update the password. So even when talking to the same DC to avoid replication issues it might be possible that the new password will not work immediately but only after a timeout. So checking if the new password is accepted after an error might be a workaround but it might not work in all cases.
bye, Sumit
Spike White
On Wed, Aug 25, 2021 at 10:32 AM Spike White spikewhitetx@gmail.com
wrote: _______________________________________________ sssd-users mailing list -- sssd-users@lists.fedorahosted.org To unsubscribe send an email to sssd-users-leave@lists.fedorahosted.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedorahosted.org/archives/list/sssd-users@lists.fedorahosted.o... Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
FYI -- update on this situation.
AD DC logs no help. They show the exact same response sent back to a good machine account password renewal as for a failed renewal.
One of the AD administrators have identified a particular AD DC NIC teaming configuration that they state has caused problems with Kerberos on the past. It's on a small percentage of their AD DCs and they will work to correct. They will keep us apprised as to update.
I'm skeptical that's the underlying root cause -- for two reasons: 1. If Kerberos was sensitive to this, it should affect all Kerberos operations (Kerberos auth, etc.) and not just the kpasswd operations. 2. This is not occurring on our older RHEL6 and RHEL7 builds AD integrated via our older commercial AD integration product. It's occurring only on our sssd-integrated builds.
At this point, we're turned off debug level 7 (it was filling up our /var/log filesystems and we have the verbose adcli update output from at least two failed clients). We're going to take the alternate suggestion of setting ad_maximum_machine_account_password_age to 0 (disabling sssd from updating password) and run a cron job to do 'adcli update'.
We're wrapping this adcli_update with tcpdump to get the exact kpasswd request/response packets, as well as wrapping with KRB5_TRACE.
We want to call adcli update exactly as sssd calls it. From SOURCES/sssd-2.4.0/src/providers/ad/ad_machine_pw_renewal.c, this appears to be how sssd calls external program /usr/sbin/adcli to do its adcli update:
/usr/sbin/adcli update --verbose --domain=$AD_DOMAIN --host-keytab=/etc/krb5.keytab --host-fqdn=$FQDN --computer-password-lifetime=30
because we aren't doing any Samba stuff. Is that the correct invocation? We'll set computer-password-lifetime lower, say to 7. Because we want to see examples more frequently, to find failed updates.
BTW, the packet capture on a successful machine account password renewal is only 8K, so that very targeted debug will not swamp our /var/log or /tmp filesystems.
Spike
On Wed, Aug 25, 2021 at 10:32 AM Spike White spikewhitetx@gmail.com wrote:
Sssd experts,
*Short summary: * How can we troubleshoot sssd’s ‘Automatic Kerberos Host Keytab Renewal’ process? We have ~0.4% of our Linux servers dropping off the AD domain monthly.
*Longer explanation:*
Over the past two years, we have on-boarded sssd as our Linux AD integration component. Largely displacing a former commercial product that did the same.
We have about ~20K Linux servers that are sssd-enabled. A mix of RHEL6, RHEL7, RHEL8, Oracle Linux 6, 7 and 8. We have ~7K Linux servers still on the old commercial product. (For certain edge-case scenarios, such as DMZs, the commercial product works better.)
Our AD forest is a single AD forest, with 4 regional child domains. All with transitive trust. Sssd auto-discovers parent domain and all 4 child domains, no problem – whenever it’s adcli joined to its regional local domain.
Why are I writing this?
Because we are researching an ongoing problem reported by L1 server ops. About 70 – 80 sssd-enabled Linux servers / month drop off the domain. Out of our current sssd-enabled population of ~20K server, that’s not horrible. But still it should be better. (Our former commercial product did better.)
It’s not limited to one particular OS, OS version, build location or region. We have surveyed; it seems to occur randomly among all OS versions, regions and locations.
To be clear, it’s extremely likely that this behavior arising from some subtle misconfiguration on our part – not from any sssd or adcli or Kerberos bug. We have a couple of configuration improvements we’re pursuing. (Kerberos max ticket lifetime mismatch between AD and /etc/krb5.conf file for instance.)
We are taking sssd’s default settings for ad_maximum_machine_account_password_age and ad_machine_account_password_renewal_opts. So after 30 days, sssd will attempt daily to renew the host Kerberos keytab file. It should re-attempt daily if not renewed. By company policy, our AD disables any machine accounts that have not renewed their credentials in 40 days. So when we find servers that have dropped off the domain, it’s because they have not renewed their AD machine accounts in 40 days.
We have SR’s open with our OS vendors (Redhat and Oracle respectively) for months now. To no great help. (They gave a few suggestions, but none panned out.)
We thought we were hitting this bug:
https://github.com/SSSD/sssd/issues/4762
But packet captures proved that adcli update is using TCP on RHEL7/8. Thus, this might be a potential problem, but only on RHEL6. (BTW ‘udp_preference_limit = 0’ doesn’t force use of TCP for the kpasswd invocation in RHEL6 – it still uses UDP. Thus, the recommended work-around for this bug doesn’t work.)
So that isn’t our underlying problem.
We’re at a loss now – as you can see, we’re grasping at straws.
How can we troubleshoot sssd’s ‘automatic Kerberos Host keytab renewal’ process? Whenever we inspect a particular server it works. We can’t run all sssd clients at debug level 9; it fills up /var/log filesystem after a few days of that. We’re interested in troubleshooting that one particular sssd process on all clients; not all parts of sssd.
Other than a steep learning curve (on our part), obscure situations (like DMZ auto-discovery of AD controllers) and exotic scenarios (like above), we’re quite happy with our 2 yr journey of direct AD integration with sssd. Obviously, the troubleshooting tools on RHEL6 are very minimal. But certainly, overall the quality of sssd on RHEL7/8 is excellent. AD integration has innumerable devils in the details; I’m amazed that sssd performs as well as it does against our multi-domain forest.
Spike
PS the problem with sssd auto-discovery of AD controllers in DMZs has been fixed in a recent sssd release. The better discovery algorithm was implemented – same one used by Windows clients and commercial products. It’s just that recent sssd version is not on RHEL7 or 8.
Hi Spike,
I have once seen such an issue on RHEL7. It was caused by a wrong SELinux context on /etc/krb5.keytab file. That is, SSSD updated the password in AD, attempted to update /etc/krb5.keytab, and SELinux denied this attempt. Audit log will contain a denied entry if that is the case. Maybe it will help you.
Kind regards, Grigory Trenin
чт, 7 окт. 2021 г. в 20:02, Spike White spikewhitetx@gmail.com:
FYI -- update on this situation.
AD DC logs no help. They show the exact same response sent back to a good machine account password renewal as for a failed renewal.
One of the AD administrators have identified a particular AD DC NIC teaming configuration that they state has caused problems with Kerberos on the past. It's on a small percentage of their AD DCs and they will work to correct. They will keep us apprised as to update.
I'm skeptical that's the underlying root cause -- for two reasons:
- If Kerberos was sensitive to this, it should affect all Kerberos
operations (Kerberos auth, etc.) and not just the kpasswd operations. 2. This is not occurring on our older RHEL6 and RHEL7 builds AD integrated via our older commercial AD integration product. It's occurring only on our sssd-integrated builds.
At this point, we're turned off debug level 7 (it was filling up our /var/log filesystems and we have the verbose adcli update output from at least two failed clients). We're going to take the alternate suggestion of setting ad_maximum_machine_account_password_age to 0 (disabling sssd from updating password) and run a cron job to do 'adcli update'.
We're wrapping this adcli_update with tcpdump to get the exact kpasswd request/response packets, as well as wrapping with KRB5_TRACE.
We want to call adcli update exactly as sssd calls it. From SOURCES/sssd-2.4.0/src/providers/ad/ad_machine_pw_renewal.c, this appears to be how sssd calls external program /usr/sbin/adcli to do its adcli update:
/usr/sbin/adcli update --verbose --domain=$AD_DOMAIN
--host-keytab=/etc/krb5.keytab --host-fqdn=$FQDN --computer-password-lifetime=30
because we aren't doing any Samba stuff. Is that the correct invocation? We'll set computer-password-lifetime lower, say to 7. Because we want to see examples more frequently, to find failed updates.
BTW, the packet capture on a successful machine account password renewal is only 8K, so that very targeted debug will not swamp our /var/log or /tmp filesystems.
Spike
On Wed, Aug 25, 2021 at 10:32 AM Spike White spikewhitetx@gmail.com wrote:
Sssd experts,
*Short summary: * How can we troubleshoot sssd’s ‘Automatic Kerberos Host Keytab Renewal’ process? We have ~0.4% of our Linux servers dropping off the AD domain monthly.
*Longer explanation:*
Over the past two years, we have on-boarded sssd as our Linux AD integration component. Largely displacing a former commercial product that did the same.
We have about ~20K Linux servers that are sssd-enabled. A mix of RHEL6, RHEL7, RHEL8, Oracle Linux 6, 7 and 8. We have ~7K Linux servers still on the old commercial product. (For certain edge-case scenarios, such as DMZs, the commercial product works better.)
Our AD forest is a single AD forest, with 4 regional child domains. All with transitive trust. Sssd auto-discovers parent domain and all 4 child domains, no problem – whenever it’s adcli joined to its regional local domain.
Why are I writing this?
Because we are researching an ongoing problem reported by L1 server ops. About 70 – 80 sssd-enabled Linux servers / month drop off the domain. Out of our current sssd-enabled population of ~20K server, that’s not horrible. But still it should be better. (Our former commercial product did better.)
It’s not limited to one particular OS, OS version, build location or region. We have surveyed; it seems to occur randomly among all OS versions, regions and locations.
To be clear, it’s extremely likely that this behavior arising from some subtle misconfiguration on our part – not from any sssd or adcli or Kerberos bug. We have a couple of configuration improvements we’re pursuing. (Kerberos max ticket lifetime mismatch between AD and /etc/krb5.conf file for instance.)
We are taking sssd’s default settings for ad_maximum_machine_account_password_age and ad_machine_account_password_renewal_opts. So after 30 days, sssd will attempt daily to renew the host Kerberos keytab file. It should re-attempt daily if not renewed. By company policy, our AD disables any machine accounts that have not renewed their credentials in 40 days. So when we find servers that have dropped off the domain, it’s because they have not renewed their AD machine accounts in 40 days.
We have SR’s open with our OS vendors (Redhat and Oracle respectively) for months now. To no great help. (They gave a few suggestions, but none panned out.)
We thought we were hitting this bug:
https://github.com/SSSD/sssd/issues/4762
But packet captures proved that adcli update is using TCP on RHEL7/8. Thus, this might be a potential problem, but only on RHEL6. (BTW ‘udp_preference_limit = 0’ doesn’t force use of TCP for the kpasswd invocation in RHEL6 – it still uses UDP. Thus, the recommended work-around for this bug doesn’t work.)
So that isn’t our underlying problem.
We’re at a loss now – as you can see, we’re grasping at straws.
How can we troubleshoot sssd’s ‘automatic Kerberos Host keytab renewal’ process? Whenever we inspect a particular server it works. We can’t run all sssd clients at debug level 9; it fills up /var/log filesystem after a few days of that. We’re interested in troubleshooting that one particular sssd process on all clients; not all parts of sssd.
Other than a steep learning curve (on our part), obscure situations (like DMZ auto-discovery of AD controllers) and exotic scenarios (like above), we’re quite happy with our 2 yr journey of direct AD integration with sssd. Obviously, the troubleshooting tools on RHEL6 are very minimal. But certainly, overall the quality of sssd on RHEL7/8 is excellent. AD integration has innumerable devils in the details; I’m amazed that sssd performs as well as it does against our multi-domain forest.
Spike
PS the problem with sssd auto-discovery of AD controllers in DMZs has been fixed in a recent sssd release. The better discovery algorithm was implemented – same one used by Windows clients and commercial products. It’s just that recent sssd version is not on RHEL7 or 8.
sssd-users mailing list -- sssd-users@lists.fedorahosted.org To unsubscribe send an email to sssd-users-leave@lists.fedorahosted.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedorahosted.org/archives/list/sssd-users@lists.fedorahosted.o... Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
Grigory,
It's quite likely that it's something client-related like that. But I know it's not exactly that; we turn off SELinux. in the verbose log of adcli update on a failed renewal, it says:
! Cannot change computer password: Authentication error adcli: updating membership with domain amer.dell.com failed: Cannot change computer password: Authentication error
While on a good renewal, the verbose adcli output says:
* Changed computer password * kvno incremented to 110
Sumit informs me that this output:
! Cannot change computer password: Authentication error
means that adcli update received a response back from the AD DC that it's interpreting as a failed attempt to change the computer password. But I don't what local components get traversed between the network and adcli (is it routed over dbus or polkit for instance, so if /tmp or /var is 100% full is that a problem?)
Spike
On Thu, Oct 7, 2021 at 12:15 PM Grigory Trenin gtrenin@gmail.com wrote:
Hi Spike,
I have once seen such an issue on RHEL7. It was caused by a wrong SELinux context on /etc/krb5.keytab file. That is, SSSD updated the password in AD, attempted to update /etc/krb5.keytab, and SELinux denied this attempt. Audit log will contain a denied entry if that is the case. Maybe it will help you.
Kind regards, Grigory Trenin
чт, 7 окт. 2021 г. в 20:02, Spike White spikewhitetx@gmail.com:
FYI -- update on this situation.
AD DC logs no help. They show the exact same response sent back to a good machine account password renewal as for a failed renewal.
One of the AD administrators have identified a particular AD DC NIC teaming configuration that they state has caused problems with Kerberos on the past. It's on a small percentage of their AD DCs and they will work to correct. They will keep us apprised as to update.
I'm skeptical that's the underlying root cause -- for two reasons:
- If Kerberos was sensitive to this, it should affect all Kerberos
operations (Kerberos auth, etc.) and not just the kpasswd operations. 2. This is not occurring on our older RHEL6 and RHEL7 builds AD integrated via our older commercial AD integration product. It's occurring only on our sssd-integrated builds.
At this point, we're turned off debug level 7 (it was filling up our /var/log filesystems and we have the verbose adcli update output from at least two failed clients). We're going to take the alternate suggestion of setting ad_maximum_machine_account_password_age to 0 (disabling sssd from updating password) and run a cron job to do 'adcli update'.
We're wrapping this adcli_update with tcpdump to get the exact kpasswd request/response packets, as well as wrapping with KRB5_TRACE.
We want to call adcli update exactly as sssd calls it. From SOURCES/sssd-2.4.0/src/providers/ad/ad_machine_pw_renewal.c, this appears to be how sssd calls external program /usr/sbin/adcli to do its adcli update:
/usr/sbin/adcli update --verbose --domain=$AD_DOMAIN
--host-keytab=/etc/krb5.keytab --host-fqdn=$FQDN --computer-password-lifetime=30
because we aren't doing any Samba stuff. Is that the correct invocation? We'll set computer-password-lifetime lower, say to 7. Because we want to see examples more frequently, to find failed updates.
BTW, the packet capture on a successful machine account password renewal is only 8K, so that very targeted debug will not swamp our /var/log or /tmp filesystems.
Spike
On Wed, Aug 25, 2021 at 10:32 AM Spike White spikewhitetx@gmail.com wrote:
Sssd experts,
*Short summary: * How can we troubleshoot sssd’s ‘Automatic Kerberos Host Keytab Renewal’ process? We have ~0.4% of our Linux servers dropping off the AD domain monthly.
*Longer explanation:*
Over the past two years, we have on-boarded sssd as our Linux AD integration component. Largely displacing a former commercial product that did the same.
We have about ~20K Linux servers that are sssd-enabled. A mix of RHEL6, RHEL7, RHEL8, Oracle Linux 6, 7 and 8. We have ~7K Linux servers still on the old commercial product. (For certain edge-case scenarios, such as DMZs, the commercial product works better.)
Our AD forest is a single AD forest, with 4 regional child domains. All with transitive trust. Sssd auto-discovers parent domain and all 4 child domains, no problem – whenever it’s adcli joined to its regional local domain.
Why are I writing this?
Because we are researching an ongoing problem reported by L1 server ops. About 70 – 80 sssd-enabled Linux servers / month drop off the domain. Out of our current sssd-enabled population of ~20K server, that’s not horrible. But still it should be better. (Our former commercial product did better.)
It’s not limited to one particular OS, OS version, build location or region. We have surveyed; it seems to occur randomly among all OS versions, regions and locations.
To be clear, it’s extremely likely that this behavior arising from some subtle misconfiguration on our part – not from any sssd or adcli or Kerberos bug. We have a couple of configuration improvements we’re pursuing. (Kerberos max ticket lifetime mismatch between AD and /etc/krb5.conf file for instance.)
We are taking sssd’s default settings for ad_maximum_machine_account_password_age and ad_machine_account_password_renewal_opts. So after 30 days, sssd will attempt daily to renew the host Kerberos keytab file. It should re-attempt daily if not renewed. By company policy, our AD disables any machine accounts that have not renewed their credentials in 40 days. So when we find servers that have dropped off the domain, it’s because they have not renewed their AD machine accounts in 40 days.
We have SR’s open with our OS vendors (Redhat and Oracle respectively) for months now. To no great help. (They gave a few suggestions, but none panned out.)
We thought we were hitting this bug:
https://github.com/SSSD/sssd/issues/4762
But packet captures proved that adcli update is using TCP on RHEL7/8. Thus, this might be a potential problem, but only on RHEL6. (BTW ‘udp_preference_limit = 0’ doesn’t force use of TCP for the kpasswd invocation in RHEL6 – it still uses UDP. Thus, the recommended work-around for this bug doesn’t work.)
So that isn’t our underlying problem.
We’re at a loss now – as you can see, we’re grasping at straws.
How can we troubleshoot sssd’s ‘automatic Kerberos Host keytab renewal’ process? Whenever we inspect a particular server it works. We can’t run all sssd clients at debug level 9; it fills up /var/log filesystem after a few days of that. We’re interested in troubleshooting that one particular sssd process on all clients; not all parts of sssd.
Other than a steep learning curve (on our part), obscure situations (like DMZ auto-discovery of AD controllers) and exotic scenarios (like above), we’re quite happy with our 2 yr journey of direct AD integration with sssd. Obviously, the troubleshooting tools on RHEL6 are very minimal. But certainly, overall the quality of sssd on RHEL7/8 is excellent. AD integration has innumerable devils in the details; I’m amazed that sssd performs as well as it does against our multi-domain forest.
Spike
PS the problem with sssd auto-discovery of AD controllers in DMZs has been fixed in a recent sssd release. The better discovery algorithm was implemented – same one used by Windows clients and commercial products. It’s just that recent sssd version is not on RHEL7 or 8.
sssd-users mailing list -- sssd-users@lists.fedorahosted.org To unsubscribe send an email to sssd-users-leave@lists.fedorahosted.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedorahosted.org/archives/list/sssd-users@lists.fedorahosted.o... Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
sssd-users mailing list -- sssd-users@lists.fedorahosted.org To unsubscribe send an email to sssd-users-leave@lists.fedorahosted.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedorahosted.org/archives/list/sssd-users@lists.fedorahosted.o... Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
On 10/7/21 12:01, Spike White wrote:
FYI -- update on this situation.
AD DC logs no help. They show the exact same response sent back to a good machine account password renewal as for a failed renewal.
One of the AD administrators have identified a particular AD DC NIC teaming configuration that they state has caused problems with Kerberos on the past. It's on a small percentage of their AD DCs and they will work to correct. They will keep us apprised as to update.
I'm skeptical that's the underlying root cause -- for two reasons: 1. If Kerberos was sensitive to this, it should affect all Kerberos operations (Kerberos auth, etc.) and not just the kpasswd operations. 2. This is not occurring on our older RHEL6 and RHEL7 builds AD integrated via our older commercial AD integration product. It's occurring only on our sssd-integrated builds.
At this point, we're turned off debug level 7 (it was filling up our /var/log filesystems and we have the verbose adcli update output from at least two failed clients). We're going to take the alternate suggestion of setting ad_maximum_machine_account_password_age to 0 (disabling sssd from updating password) and run a cron job to do 'adcli update'.
We're wrapping this adcli_update with tcpdump to get the exact kpasswd request/response packets, as well as wrapping with KRB5_TRACE.
We want to call adcli update exactly as sssd calls it. From SOURCES/sssd-2.4.0/src/providers/ad/ad_machine_pw_renewal.c, this appears to be how sssd calls external program /usr/sbin/adcli to do its adcli update:
/usr/sbin/adcli update --verbose --domain=$AD_DOMAIN --host-keytab=/etc/krb5.keytab --host-fqdn=$FQDN --computer-password-lifetime=30
because we aren't doing any Samba stuff.
Question: how would Samba stuff be relevant to updating the Kerberos ticket using adcli?
Is that the correct
invocation? We'll set computer-password-lifetime lower, say to 7. Because we want to see examples more frequently, to find failed updates.
BTW, the packet capture on a successful machine account password renewal is only 8K, so that very targeted debug will not swamp our /var/log or /tmp filesystems.
Spike
On Wed, Aug 25, 2021 at 10:32 AM Spike White <spikewhitetx@gmail.com mailto:spikewhitetx@gmail.com> wrote:
Sssd experts, *_Short summary:_//* How can we troubleshoot sssd’s ‘Automatic Kerberos Host Keytab Renewal’ process? We have ~0.4% of our Linux servers dropping off the AD domain monthly. *_Longer explanation:_* Over the past two years, we have on-boarded sssd as our Linux AD integration component. Largely displacing a former commercial product that did the same. We have about ~20K Linux servers that are sssd-enabled. A mix of RHEL6, RHEL7, RHEL8, Oracle Linux 6, 7 and 8. We have ~7K Linux servers still on the old commercial product. (For certain edge-case scenarios, such as DMZs, the commercial product works better.) Our AD forest is a single AD forest, with 4 regional child domains. All with transitive trust. Sssd auto-discovers parent domain and all 4 child domains, no problem – whenever it’s adcli joined to its regional local domain. Why are I writing this? Because we are researching an ongoing problem reported by L1 server ops. About 70 – 80 sssd-enabled Linux servers / month drop off the domain. Out of our current sssd-enabled population of ~20K server, that’s not horrible. But still it should be better. (Our former commercial product did better.) It’s not limited to one particular OS, OS version, build location or region. We have surveyed; it seems to occur randomly among all OS versions, regions and locations. To be clear, it’s extremely likely that this behavior arising from some subtle misconfiguration on our part – not from any sssd or adcli or Kerberos bug. We have a couple of configuration improvements we’re pursuing. (Kerberos max ticket lifetime mismatch between AD and /etc/krb5.conf file for instance.) We are taking sssd’s default settings for ad_maximum_machine_account_password_age and ad_machine_account_password_renewal_opts. So after 30 days, sssd will attempt daily to renew the host Kerberos keytab file. It should re-attempt daily if not renewed. By company policy, our AD disables any machine accounts that have not renewed their credentials in 40 days. So when we find servers that have dropped off the domain, it’s because they have not renewed their AD machine accounts in 40 days. We have SR’s open with our OS vendors (Redhat and Oracle respectively) for months now. To no great help. (They gave a few suggestions, but none panned out.) We thought we were hitting this bug: https://github.com/SSSD/sssd/issues/4762 <https://github.com/SSSD/sssd/issues/4762> But packet captures proved that adcli update is using TCP on RHEL7/8. Thus, this might be a potential problem, but only on RHEL6. (BTW ‘udp_preference_limit = 0’ doesn’t force use of TCP for the kpasswd invocation in RHEL6 – it still uses UDP. Thus, the recommended work-around for this bug doesn’t work.) So that isn’t our underlying problem. We’re at a loss now – as you can see, we’re grasping at straws. How can we troubleshoot sssd’s ‘automatic Kerberos Host keytab renewal’ process? Whenever we inspect a particular server it works. We can’t run all sssd clients at debug level 9; it fills up /var/log filesystem after a few days of that. We’re interested in troubleshooting that one particular sssd process on all clients; not all parts of sssd. Other than a steep learning curve (on our part), obscure situations (like DMZ auto-discovery of AD controllers) and exotic scenarios (like above), we’re quite happy with our 2 yr journey of direct AD integration with sssd. Obviously, the troubleshooting tools on RHEL6 are very minimal. But certainly, overall the quality of sssd on RHEL7/8 is excellent. AD integration has innumerable devils in the details; I’m amazed that sssd performs as well as it does against our multi-domain forest. Spike PS the problem with sssd auto-discovery of AD controllers in DMZs has been fixed in a recent sssd release. The better discovery algorithm was implemented – same one used by Windows clients and commercial products. It’s just that recent sssd version is not on RHEL7 or 8.
sssd-users mailing list -- sssd-users@lists.fedorahosted.org To unsubscribe send an email to sssd-users-leave@lists.fedorahosted.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedorahosted.org/archives/list/sssd-users@lists.fedorahosted.o... Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
This message is from an external sender. Learn more about why this << matters at https://links.utexas.edu/rtyclf. <<
Patrick,
Oh, I know the answer to that one!
if you're also binding in Samba to winbindd or equivalent, then when sssd renews the machine account passwords monthly it also has to inform samba of that new machine account password. So that it can stash it away in its secrets store.
I believe that Samba even provides some helper script or program that you can call -- passing in the new monthly password. So adcli update could call such a Samba heiper script.
Spike
On Fri, Oct 8, 2021 at 8:58 AM Patrick Goetz pgoetz@math.utexas.edu wrote:
On 10/7/21 12:01, Spike White wrote:
FYI -- update on this situation.
AD DC logs no help. They show the exact same response sent back to a good machine account password renewal as for a failed renewal.
One of the AD administrators have identified a particular AD DC NIC teaming configuration that they state has caused problems with Kerberos on the past. It's on a small percentage of their AD DCs and they will work to correct. They will keep us apprised as to update.
I'm skeptical that's the underlying root cause -- for two reasons:
- If Kerberos was sensitive to this, it should affect all Kerberos
operations (Kerberos auth, etc.) and not just the kpasswd operations. 2. This is not occurring on our older RHEL6 and RHEL7 builds AD integrated via our older commercial AD integration product. It's occurring only on our sssd-integrated builds.
At this point, we're turned off debug level 7 (it was filling up our /var/log filesystems and we have the verbose adcli update output from at least two failed clients). We're going to take the alternate suggestion of setting ad_maximum_machine_account_password_age to 0 (disabling sssd from updating password) and run a cron job to do 'adcli update'.
We're wrapping this adcli_update with tcpdump to get the exact kpasswd request/response packets, as well as wrapping with KRB5_TRACE.
We want to call adcli update exactly as sssd calls it. From SOURCES/sssd-2.4.0/src/providers/ad/ad_machine_pw_renewal.c, this appears to be how sssd calls external program /usr/sbin/adcli to do its adcli update:
/usr/sbin/adcli update --verbose --domain=$AD_DOMAIN
--host-keytab=/etc/krb5.keytab --host-fqdn=$FQDN --computer-password-lifetime=30
because we aren't doing any Samba stuff.
Question: how would Samba stuff be relevant to updating the Kerberos ticket using adcli?
Is that the correct
invocation? We'll set computer-password-lifetime lower, say to 7. Because we want to see examples more frequently, to find failed updates.
BTW, the packet capture on a successful machine account password renewal is only 8K, so that very targeted debug will not swamp our /var/log or /tmp filesystems.
Spike
On Wed, Aug 25, 2021 at 10:32 AM Spike White <spikewhitetx@gmail.com mailto:spikewhitetx@gmail.com> wrote:
Sssd experts, *_Short summary:_//* How can we troubleshoot sssd’s ‘Automatic Kerberos Host Keytab Renewal’ process? We have ~0.4% of our Linux servers dropping off the AD domain monthly. *_Longer explanation:_* Over the past two years, we have on-boarded sssd as our Linux AD integration component. Largely displacing a former commercial product that did the same. We have about ~20K Linux servers that are sssd-enabled. A mix of RHEL6, RHEL7, RHEL8, Oracle Linux 6, 7 and 8. We have ~7K Linux servers still on the old commercial product. (For certain edge-case scenarios, such as DMZs, the commercial product works better.) Our AD forest is a single AD forest, with 4 regional child domains. All with transitive trust. Sssd auto-discovers parent domain and all 4 child domains, no problem – whenever it’s adcli joined to its regional local domain. Why are I writing this? Because we are researching an ongoing problem reported by L1 server ops. About 70 – 80 sssd-enabled Linux servers / month drop off the domain. Out of our current sssd-enabled population of ~20K server, that’s not horrible. But still it should be better. (Our former commercial product did better.) It’s not limited to one particular OS, OS version, build location or region. We have surveyed; it seems to occur randomly among all OS versions, regions and locations. To be clear, it’s extremely likely that this behavior arising from some subtle misconfiguration on our part – not from any sssd or adcli or Kerberos bug. We have a couple of configuration improvements we’re pursuing. (Kerberos max ticket lifetime mismatch between AD and /etc/krb5.conf file for instance.) We are taking sssd’s default settings for ad_maximum_machine_account_password_age and ad_machine_account_password_renewal_opts. So after 30 days, sssd will attempt daily to renew the host Kerberos keytab file. It should re-attempt daily if not renewed. By company policy, our AD disables any machine accounts that have not renewed their credentials in 40 days. So when we find servers that have dropped off the domain, it’s because they have not renewed their AD machine accounts in 40 days. We have SR’s open with our OS vendors (Redhat and Oracle respectively) for months now. To no great help. (They gave a few suggestions, but none panned out.) We thought we were hitting this bug: https://github.com/SSSD/sssd/issues/4762 <https://github.com/SSSD/sssd/issues/4762> But packet captures proved that adcli update is using TCP on RHEL7/8. Thus, this might be a potential problem, but only on RHEL6. (BTW ‘udp_preference_limit = 0’ doesn’t force use of TCP for the kpasswd invocation in RHEL6 – it still uses UDP. Thus, the recommended work-around for this bug doesn’t work.) So that isn’t our underlying problem. We’re at a loss now – as you can see, we’re grasping at straws. How can we troubleshoot sssd’s ‘automatic Kerberos Host keytab renewal’ process? Whenever we inspect a particular server it works. We can’t run all sssd clients at debug level 9; it fills up /var/log filesystem after a few days of that. We’re interested in troubleshooting that one particular sssd process on all clients; not all parts of sssd. Other than a steep learning curve (on our part), obscure situations (like DMZ auto-discovery of AD controllers) and exotic scenarios (like above), we’re quite happy with our 2 yr journey of direct AD integration with sssd. Obviously, the troubleshooting tools on RHEL6 are very minimal. But certainly, overall the quality of sssd on RHEL7/8 is excellent. AD integration has innumerable devils in the details; I’m amazed that sssd performs as well as it does against our multi-domain forest. Spike PS the problem with sssd auto-discovery of AD controllers in DMZs has been fixed in a recent sssd release. The better discovery algorithm was implemented – same one used by Windows clients and commercial products. It’s just that recent sssd version is not on RHEL7 or 8.
sssd-users mailing list -- sssd-users@lists.fedorahosted.org To unsubscribe send an email to sssd-users-leave@lists.fedorahosted.org Fedora Code of Conduct:
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives:
https://lists.fedorahosted.org/archives/list/sssd-users@lists.fedorahosted.o...
Do not reply to spam on the list, report it:
https://pagure.io/fedora-infrastructure
This message is from an external sender. Learn more about why this << matters at https://links.utexas.edu/rtyclf. <<
sssd-users mailing list -- sssd-users@lists.fedorahosted.org To unsubscribe send an email to sssd-users-leave@lists.fedorahosted.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedorahosted.org/archives/list/sssd-users@lists.fedorahosted.o... Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
sssd-users@lists.fedorahosted.org