Adding the list since Sumit appears to be busy. The info is anonymized so it should be ok. Hopefully, the gz file makes it through.
From: Galen Johnson
Sent: Thursday, September 21, 2017 5:36 PM
To: Sumit Bose
Cc: Philip Holman
Subject: sssd email login performance
I'm finally getting a chance to follow up on the email thread (of the same title) from the sssd list. We've seen some delays (multi-second) for auth requests when users use their email address versus their id. I've attached a tar file with several log files. Phil may need to explain the summary file if you have any questions about it. We are running Centos 7.4 now but I'm fairly certain that it's the same binaries as RHEL 7.4. These logs were taken while on 7.3. I noticed that sssd bumped to 1.15 with 7.4.
Some outstanding questions we have are:
1. The cache appears to not be used for the email attribute. Why is this not used?
2. We're also curious why the ldap requests add 2 seconds when performing the same query from the command-line returns almost immediately.
3. Is it possible to have SSSD ignore the domain and just immediately look up the address? We see "is_email_from_domain" in the domain log (reflected in the nss log). We checked the man pages and nothing really jumped out as a config option.
It should be noted that we also moved the sssd db cache to tmpfs (per a blog from Jakub).
Thanks for any insight
Phil's analysis follows:
To wrap up, I took one more look at one of the very slow email logins to pull out a trace of what it was doing. The attached files are the log snippets with line breaks marking off the incoming requests to make it more clear what each module was servicing when. The summary.txt shows the summarized entry for the connection and also gives an abridged combined view of the logs marking where the 7 seconds appear to have gone. So this seemed enough info to share if we have the opportunity for a consult with someone.
The short version is that 1 second roughly went to the bind that tests the user, but the other 6 appear to have likely been the result of interacting with local caches rather than the DCs. So that makes the cache files and related configuration look suspicious. It also makes more sense that our earlier checks (against logs or live tests) of the Exnet interactions have failed to show any latency issues on those step.
Possibly the fiddling we've already done with the cache files and cache config resolved this, but it is probably still worth passing this along to someone knowledgeable who might be able to explain what about the setup likely made everything go sideways. Otherwise, we might be facing some kind of build-up pattern where it will always look rosy after a restart and gradually degrade over time as state builds up.
It might also be a good idea to bounce and clear out sssd/pam state on the weekly restarts just to protect against any possible build-up (unless we want to intentionally avoid that for now to see if it does degrade over time).
I have a storage appliance that needs local passwd/group files loaded
onto it, which need to match the entries we get by using sssd's
ldap_id_mapping feature. So I need some way to enumerate or synthesize
passwd/group entries, for every user/group object in our domain, using
LDIF dumps from AD that includes all users/groups, along with their
respective objectSid attributes.
We know (from experience, and from discussion on this list) that
enabling enumeration in sssd is problematic, so that's out.
I could just issue individual getpwnam()/getgrnam() calls for every
user/group object, and let sssd synthesize the entries. But this would
require careful tuning of sssd's cache configuration options to avoid
significant delays, and even then, this would pound our AD domain
controllers with thousands and thousands of lookup requests every time
we regenerate the synthesized passwd/group files (which will probably
From digging around in the sssd source code, I see that sssd has a
SSS_NSS_GETIDBYSID API call that looks to be exactly what I need. But
it's not clear to me whether that's a public or private API, and
additionally, it looks like I'd be limited to C for my implementation,
as I see no other language bindings for those functions.
Has anyone already rolled (Python, Ruby, Perl, et. al.) bindings for
sssd's API calls, specifically the ID-SID mapping calls?
One potential option would be to just re-implement sssd's id mapping
code in Python. I could "cheat" in our implementation, because I know
that the only options that vary across our domains are
(ldap_idmap_range_max, ldap_idmap_range_min, ldap_idmap_range_size).
But re-implementation opens the door for a subtle error that would
cause my mapping code to return different results from sssd in some
corner cases, which I definitely don't want. So leveraging sssd's
SSS_NSS_GETIDBYSID API call would be best… if that's possible.
Another option would be to bypass the API and talk directly to the NSS
responder via its listening socket, which is easy enough to do in
other languages. But this would require me to speak the protocol
exactly the way sssd expects, and any API changes would break my code.
I'm attempting to enable LDAP server TLS certificate validation with
"ldap_tls_reqcert = demand". However, when I set that value to anything
other than "never", sssd does not work. By that I mean sssd will start
as normal but no ID lookups are successful and I see "Input/output
error" in the log. This occurs regardless of what CA certificate chain
I give it (via ldap_tls_cacert). I have even tried using a known
working chain that I use to access yum repos which uses TLS certificates
from the same CA as our Active Directory.
HPC Systems Engineer
Information Technology Services - WSU