On Wed, Apr 15, 2015 at 10:58:12PM +0200, Jean-Baptiste Denis wrote:
> A shot in the dark but maybe worth a try - can you try disabling
the
> cleanup task?
>
> ldap_purge_cache_timeout = 0
>
> in the [domain] section. The cleanup might cause some groups with no
> members to be removed, I wonder if that is your case..
Just did this, but didn't work.
Maybe I don't understand the purpose of this test, but the result does not
surprise me because the ldap cache is empty at that time. As Thomas stated in
the initial message of this thread, our actual test case implies:
. /etc/init.d/sssd stop
. rm -rf /var/lib/sss/mc/* /var/lib/sss/db/*
. /etc/init.d/sssd start
before running anything else. So I guess the ldap backend has no need to be
cleaned up at this particular time.
I was suspecting a race condition, because as well as the rest of SSSD,
the cleanup task is asynchronous. I was suspecting the following might
have happened:
- initgroups starts:
- users are written to the cache
- groups are written to the cache but not linked yet to the user
objects
- cleanup tasks starts
- cleanup task removes the group objects because they are
"empty". It shouldn't happen because the cleanup task should
only remove expired entries, but IIRC Lukas saw a similar
race-condition elsewhere.
If I run the test case again without
restarting sssd and without cleaning up the cache, I've got no problem for next
jobs (maybe until the next ldap purge. I think that this is exactly how we first
encounter the problem : sometimes, some jobs were failing with a permission
denied error while accessing a directory owned by one the user supplementary
groups. The instrumented slurmd code showed us that the initgroups was not
correctly getting the secondary groups. And the sssd backend log showed some
purge activity if I remember correcty - need confirmation -)
In a previous message, you said :
> I think this means the frontend (responder) either checks too soon or
> the back end wrote incomplete data.
We are not 100% sure that we've found the right place to look at, but each time
we instrumented the code to print the number of groups, we've got the correct
answer.
Maybe you could show us where to look exactly for :
- where the backend is writing the groups data to the sysdb cache
So the operation that evaluates what groups the user is a member of is
called initgroups. IIRC you're using the rfc2307 (non-bis) schema, so
the initgroups request that you run starts at
src/providers/ldap/sdap_async_initgroups.c:385 in function
sdap_initgr_rfc2307_send() and ends at sdap_initgr_rfc2307_recv()
- where the backend is signaling to the responder that the cache has
been updated
The schema-specific request is the one I listed above, then
returns to the generic LDAP code in ldap_common.c. The function that
signals over sbus (dbus protocol used over unix socket) is at
sdap_handler_done(), in particular be_req_terminate()
- where the responder is aware that he can now check the cache to get
the answer
This is done in src/responder/common/responder_dp.c. The request is
sent with sss_dp_get_account_send().
This code is a bit complex, because concurrent requests are just added to
queue in sss_dp_issue_request() if the corresponding request is already
found in rctx->dp_request_table hash table. But the first request that
finishes would receive an sbus message from the provider in
sss_dp_internal_get_done(). Then it would iterate over the queue of
requests and mark them as done or failed.o
The callback that should be invoked by this generic NSS code is
nss_cmd_getby_dp_callback().
- where the responder is actually getting the data from the sysdb
cache
src/responder/nss/nsssrv_cmd.c, in particular
nss_cmd_initgroups_search() and the function check_cache().