On 04/16/2015 12:31 PM, Jean-Baptiste Denis wrote:
> No, it shouldn't be. The whole backend request should run and
only then the
> backend should signal to frontend to re-check the cache. That's why I was
> suspecting the cleanup task, it's asynchronous.
I think I've got a test case without involving slurm. It is quite reproductible
on my machine. Since it looks like a race, you may need to tweak the parameter
of the python script.
The basic idea is to run a bunch of process and wait for a slight amount of time
before calling the initgroups libc function for a specific user
You have to log as root and not use sudo to prevent sssd cache to be populated
before the test is started. You also *need* to cleanup sssd state before running
the test.
usage:
## log as root
## check the number secondary group for a user using id for example
# id jbdenis
uid=21489(jbdenis) gid=110(sis)
groups=110(sis),3044(CIB),19(floppy),1177(dump-projets),56(netadm),3125(vpn-ssl-admin)
Here I've got 5 secondary groups (sis is my primary group)
## !! VERY IMPORTANT !! cleanup sssd state
# /etc/init.d/sssd stop && rm -f /var/lib/sss/mc/* /var/lib/sss/db/* &&
/etc/init.d/sssd start
## run this program
# python initgroups.py jbdenis 110 5 24 200
wrong number of secondary groups in process 17145 : 0 instead of 5 (sleep 55ms)
wrong number of secondary groups in process 17149 : 0 instead of 5 (sleep 55ms)
2/24 failed
# first parameter is a login
# second parameter is your primary gid (could be anything)
# third parameter is your number of secondary groups
# fourth parameter is the number of process you want to run concurrently
# the last parameter is the maximum delay in milliseconds before calling
initgroups (the delay is randomized up to this maximum)
I've got good results with 24 processes and randomized delay of 200ms between
startup. Those parameters are somewhat relative to the machine you're running
the script on I guess. You may have to run this test multiple time before
triggering the bug.
I'm unable to reproduce the bug when I use 0 delay and I think that why we could
reproduce it with our initial test case.
I really hope that you could reproduce the bug on your side.
Thank you for your help,
Jean-Baptiste