Hi Jakub,

Thanks for your response, please see following embedded comments.

On Tue, Apr 12, 2016 at 6:24 PM, Jakub Hrozek <jhrozek@redhat.com> wrote:
On Tue, Apr 12, 2016 at 11:03:47AM +1000, jupiter wrote:
> Hi,
>
> We are running sssd version 1.12.4-47 on CentOS 6. It works fine in
> general, but from time to time, some nodes listed all user ids with
> "nobody",

Was this problem happening only on an NFS share..?

I don't think it is an NFS issue, it is an SSS issue.

> calling id username immediatly returned "No such user",

Hmm, I guess not, this sounds like a generic issue, if neither id
couldn't find the user.

The user is fine, I can run "id username" in another healthy node without any problems.

> it looks
> the id went to cache and did not contact to the LDAP.

Please note that if the user was looked up at least once before, then
even if SSSD couldn't contact the server for one reason or another, it
should have returned entries from the cache.

Once again, the user id is fine, we can verify from other health nodes. Beside, when the node is fixed by adding debug_level = 6, everything  is back to normal.
 
>
> On one occasion, I added debug_level = 6 to the sssd.conf, restarted sssd,
> the "nobody" was gone and id username was returned correct LDAP user id. It
> did not make any sense to me how adding a debug_level could fix the
> problem.

I suspect it was actually the restart, because the restart might cause
sssd to reconnect to servers and operate online.

But prior to that change, I restart sssd dozen times, nothing could fix it until I changed debug_level = 6 which fixed the issue, but it did not make any sens to me.

What you can do, if for some reason running with debugging enabled all
the time is not practical, is use the sss_debuglevel tool to bump
debugging on the fly.

But at any rate, we need to see the sssd logs to proceed.

The error in log file was nss_getpwnam: name 'dhpec' not found in domain 'hpc.org'. It seems to me sssd simply got information from the invalid cache, not from the ldap.

> I could smell the issue from sssd cache, but I have no idea since
> the all default cache setting only for some seconds, but when the node
> caught in that problem, it can sit for many days with uids in nobody, id
> returns no such user.
>
> After searching from Internet, someone suggested to run sss_cache -E to
> invalidate all cached entries would solve the problem, I tried, it did not
> work.

Well, if sssd was offline at that time, then invalidating the cache
wouldn't help, because sssd wouldn't have a way to fetch the data from..

I checked sssd process before running sss_cache -E, the sssd was always on line. My question is, how do you verify if the cache has been cleaned? Or you simply delete /var/lib/sss/db?

Thank you.

Kind regards,

- h