Hi James,
I'll try to include questions/comments/suggestions in-line below.
We have an air-gapped network of RHEL7 hosts that use sssd to
perform
PKINIT (smartcard + Kerberos) authentication against Windows Server
2016 domain controllers.
Setting this up properly entailed setting pkinit_anchors, pkinit_pool,
pkinit_cert_match, et. al. in the krb5.conf file, and enabling
smartcard authentication in gdm. It also entailed adding individual
certificates to each user object’s userCertificate property, which our
Windows guys grumbled about.
And I'm guessing the AD servers have the root and issuing CA certificates imported and
trusted right?
Since you're problem is intermittent I would guess CA certificates missing isn't
your issue. But, it can be a common one (at least during initial setup or during CA
moves/retirements).
(The way Windows performs PKINIT is to find the certificate on the
card that has a Microsoft User Principal Name X509v3 Subject
Alternative Name, extract that value, and then look for the AD user
object that has the same userPrincipalName. But the version of sssd
that shipped with RHEL7 can’t do that SAN/userPrincipalName matching.)
For the most part, this has worked, and worked well. Once again, sssd
has been an invaluable tool.
But.
For some accounts, smartcard authentication does not work, *even
though* you can use kinit to perform PKINIT against the card (e.g., if
you login via password authentication, then insert the smartcard once
you have a shell window to play with).
When you're testing with kinit, are you running something like this:
kinit -X X509_user_identity=PKCS11:module_name=/usr/lib64/opensc-pkcs11.so
principal@REALM
Just want to make sure I'm thinking of the correct test here.
For the accounts where smartcard authentication works, after you enter
your username in gdm, the card blinks for a few seconds, and then you
are prompted to enter the PIN as follows:
<CN> PIN:
…where <CN> is the value of the CN= field of the certificate Subject
of the certificate on the smartcard that contains the Microsoft UPN
SAN. E.g.:
LASTNAME.FIRSTNAME.123456789 PIN:
For the accounts where smartcard authentication fails, after you enter
your username in gdm, the card blinks for a few seconds, and then you
are prompted to enter the PIN as follows:
PIN for Smartcard:
That PIN prompt is the kiss of death: even if you enter the correct
PIN, authentication will always fail.
This may be an indication that SSSD is timing out during a step but, I'm not 100%
sure.
We know that our Kerberos configuration (e.g., pkinit_cert_match)
correctly yields one (and only one candidate certificate) from the
smartcard, which is the correct certificate:
pkinit_cert_match = &&<SAN>.*@.*
And running kinit (with PKINIT) against the smartcard works just fine.
But logins fail for some users and not others. Which almost certainly
means that something is derailing sssd. But it’s not obvious what it
is. We’ve double-checked that the userCertificate objects are correct
in AD (that is, they match the smartcard).
And this makes me think that SSSD is timing out while trying to talk to the AD server for
Kerberos communications.
Even more confusingly, the accounts for which smartcard authentication
works versus doesn’t work can change over time. For example, a few
weeks ago, my own account worked for smartcard login; now it doesn’t.
But we know we made no configuration changes and applied no package
updates to the host.
I have also had the situation where I got the “PIN for Smartcard” gdm
prompt, rebooted the host, and then got the “<CN> PIN” gdm prompt.
That almost implies an sssd caching issue, or inconsistent
data/behavior between our (two) domain controllers.
Can you try setting a couple timeouts to see if this helps? I'd suggest trying the
following:
1. add kerberos timeout to the [domain/whatever] section of the sssd.conf:
krb5_auth_timeout = 60
2. add a p11_child timeout to the pam section (less likely to be your issue from the
symptoms):
p11_child_timeout = 60
Again, these are air-gapped systems, so I can provide no logs; we are
going to have to slog through the sssd logs and figure it out on our
own.
Can you give version numbers in case there were known bugs we might be able to identify
here?
One other question related to being air-gapped, do the certificates on the cards have
OCSP/CRL info/urls set? If so, SSSD may be trying to check that if not disabled. So, if
your certificates set OCSP, you may need to disable. You can test this with something
like:
3. Disable OCSP verifications in the [sssd] section of the sssd.conf file:
certificate_verification = no_ocsp
FYI, in RHEL8 we have "soft" fail options for OCSP/CRL but, those didn't
make it into RHEL7's version of SSSD.
certificate_verification = soft_ocsp,soft_crl
Questions for the list:
* Does this sound familiar to anyone? Have you already been down this
path? If so, what did you discover?
Maybe, I'm hoping this is a simple timeout issue and the suggestions above work. From
most of your symptoms, I think it may be the kerberos timeout issue. The OCSP issue is
probably not your problem but, I've heard of (not seen personally) issues with
unreliable network connectivity to OCSP servers. So if you have something in your
air-gapped network that is acting as an OCSP server, it may be something to look into.
* sssd logging can be quite voluminous (particularly at higher
debugging levels), to the point where I fear I might miss the needle
in the haystack that is indicating the problem. Can anyone provide
some tips on specific areas where I should focus?
Yes, there is a LOT of data in sssd logs especially when using "debug_level =
9".
I usually start with the p11_child.log to make sure that SSSD properly identified the
card. This is also where you should see OCSP failures disable use of a certificate on the
card IIRC. If it finds the certificate, you might see kerberos timeouts in the
krb5_child.log file. After that, you can look through the sssd_pam.log file.
One method of sorting through the logs to find smart card related issues that I've
also used is to find a timestamp of failed attempt in /var/log/secure (if setup) or the
journal and just grep for that in /var/log/sssd and just sort through those.
Thanks in advance for any tips/advice.
I hope that helps,
Scott