On Wed, Apr 13, 2016 at 10:52:15AM +1000, jupiter wrote:
Hi Jakub,
Thanks for your response, please see following embedded comments.
On Tue, Apr 12, 2016 at 6:24 PM, Jakub Hrozek <jhrozek(a)redhat.com> wrote:
> On Tue, Apr 12, 2016 at 11:03:47AM +1000, jupiter wrote:
> > Hi,
> >
> > We are running sssd version 1.12.4-47 on CentOS 6. It works fine in
> > general, but from time to time, some nodes listed all user ids with
> > "nobody",
>
> Was this problem happening only on an NFS share..?
>
I don't think it is an NFS issue, it is an SSS issue.
>
> > calling id username immediatly returned "No such user",
>
> Hmm, I guess not, this sounds like a generic issue, if neither id
> couldn't find the user.
>
The user is fine, I can run "id username" in another healthy node without
any problems.
>
> > it looks
> > the id went to cache and did not contact to the LDAP.
>
> Please note that if the user was looked up at least once before, then
> even if SSSD couldn't contact the server for one reason or another, it
> should have returned entries from the cache.
>
Once again, the user id is fine, we can verify from other health nodes.
Beside, when the node is fixed by adding debug_level = 6, everything is
back to normal.
> >
> > On one occasion, I added debug_level = 6 to the sssd.conf, restarted
> sssd,
> > the "nobody" was gone and id username was returned correct LDAP user
id.
> It
> > did not make any sense to me how adding a debug_level could fix the
> > problem.
>
> I suspect it was actually the restart, because the restart might cause
> sssd to reconnect to servers and operate online.
>
But prior to that change, I restart sssd dozen times, nothing could fix it
until I changed debug_level = 6 which fixed the issue, but it did not make
any sens to me.
>
> What you can do, if for some reason running with debugging enabled all
> the time is not practical, is use the sss_debuglevel tool to bump
> debugging on the fly.
>
> But at any rate, we need to see the sssd logs to proceed.
>
The error in log file was nss_getpwnam: name 'dhpec' not found in domain '
hpc.org'. It seems to me sssd simply got information from the invalid
cache, not from the ldap.
>
> > I could smell the issue from sssd cache, but I have no idea since
> > the all default cache setting only for some seconds, but when the node
> > caught in that problem, it can sit for many days with uids in nobody, id
> > returns no such user.
> >
> > After searching from Internet, someone suggested to run sss_cache -E to
> > invalidate all cached entries would solve the problem, I tried, it did
> not
> > work.
>
> Well, if sssd was offline at that time, then invalidating the cache
> wouldn't help, because sssd wouldn't have a way to fetch the data from..
>
> I checked sssd process before running sss_cache -E, the sssd was always on
line. My question is, how do you verify if the cache has been cleaned? Or
you simply delete /var/lib/sss/db?
sss_cache just expires user entries. If you want to really remove the
cache (careful, though..) then yes, at the moment you need to remove the
.ldb files under /var/lib/sss/db.
The next version of sss_cache should also have an option to really
delete the cache.