We've been experiencing an intermittent issue relating to SSSD v1.15.2, we are running CentOS7.4 on our workstations. We use SSSD to communicate with our Active Directory to pull users for auth. The majority of users have a certain group set as their primary group and some departments have it as an additional group. Most of the time this group works fine on all workstations but sometimes we will run into an issue where a user can no longer access the privileges attained from the group. For users who have it set as primary, the id command returns a gid without the name and for users who have it as an additional group, it doesn't appear at all. I've managed to capture output from sssd services and there are a few interesting lines that I thought I should share with you as I don't understand what they mean. I should add that when this error occurs, restarting the sssd.service usually works, if not, sss_cache -E works, and if that doesn't work, removing the workstation from the realm, de leting the sssd db and rejoining seems to be the final trick that works.
Regarding the logs, the symptoms I noted are below: 1. getent group *mygroup* returns nothing 2. id user returns a gid without a resolved group name (if it is a primary group) 3. I had to leave the realm, delete the db and rejoin to get sssd to work properly again.
in sssd_nss.log i found this entry: (Wed Aug 21 16:22:45 2019) [sssd[nss]] [nss_get_grent] (0x0040): Incomplete group object for group@domain.com[0]! Skipping
and in the sssd_domain.com.log: (Wed Aug 21 16:22:43 2019) [sssd[be[domain.com]]] [sdap_nested_group_split_members] (0x4000): [CN=USER,OU=IT Privileged accounts,DC=domain,DC=com] is unknown object (Wed Aug 21 16:22:43 2019) [sssd[be[domain.com]]] [sdap_nested_group_process_send] (0x0400): More members were missing than the deref threshold (Wed Aug 21 16:22:43 2019) [sssd[be[domain.com]]] [sdap_nested_group_process_send] (0x2000): Looking up 11/224 members of group [CN=GROUP,OU=Security,OU=Groups,OU=Place St,OU=Offices,DC=domain,DC=com] (Wed Aug 21 16:22:43 2019) [sssd[be[domain.com]]] [sdap_nested_group_process_send] (0x2000): Dereferencing members of group [CN=GROUP,OU=Security,OU=Groups,OU=Place St,OU=Offices,DC=domain,DC=com] (Wed Aug 21 16:22:43 2019) [sssd[be[domain.com]]] [sdap_deref_search_send] (0x2000): Server supports ASQ (Wed Aug 21 16:22:43 2019) [sssd[be[domain.com]]] [sdap_asq_search_send] (0x0400): Dereferencing entry [CN=GROUP,OU=Security,OU=Groups,OU=Place St,OU=Offices,DC=domain,DC=com] using ASQ (Wed Aug 21 16:22:43 2019) [sssd[be[domain.com]]] [sdap_print_server] (0x2000): Searching XXX.XXX.XXX.XXX:389 (Wed Aug 21 16:22:43 2019) [sssd[be[domain.com]]] [sdap_get_generic_ext_send] (0x0400): WARNING: Disabling paging because scope is set to base. (Wed Aug 21 16:22:43 2019) [sssd[be[domain.com]]] [sdap_get_generic_ext_step] (0x0400): calling ldap_search_ext with [no filter][CN=GROUP,OU=Security,OU=Groups,OU=Place St,OU=Offices,DC=domain,DC=com].
I've redacted the entries but I'm sure you can get the jist of whats happening here hopefully. If there is anything else you need, please do not hesitate to ask! If these logs don't point to anything could you maybe provide some advice on what to look for when parsing?
Thanks, Jamal
On Thu, Aug 22, 2019 at 11:11:18AM -0000, Jamal Mahmoud wrote:
We've been experiencing an intermittent issue relating to SSSD v1.15.2, we are running CentOS7.4 on our workstations. We use SSSD to communicate with our Active Directory to pull users for auth. The majority of users have a certain group set as their primary group and some departments have it as an additional group. Most of the time this group works fine on all workstations but sometimes we will run into an issue where a user can no longer access the privileges attained from the group. For users who have it set as primary, the id command returns a gid without the name and for users who have it as an additional group, it doesn't appear at all. I've managed to capture output from sssd services and there are a few interesting lines that I thought I should share with you as I don't understand what they mean. I should add that when this error occurs, restarting the sssd.service usually works, if not, sss_cache -E works, and if that doesn't work, removing the workstation from the realm, de leting the sssd db and rejoining seems to be the final trick that works.
I'm not sure if this is a helpful response, but I would strongly encourage to upgrade to 1.16.x. It is quite stable and 1.15.x is not going to receive any fixes from either RH or upstream.
Regarding the logs, the symptoms I noted are below:
- getent group *mygroup* returns nothing
- id user returns a gid without a resolved group name (if it is a primary group)
- I had to leave the realm, delete the db and rejoin to get sssd to work properly again.
in sssd_nss.log i found this entry: (Wed Aug 21 16:22:45 2019) [sssd[nss]] [nss_get_grent] (0x0040): Incomplete group object for group@domain.com[0]! Skipping
An 'incomplete' group in sssd-lingo is an optimization stub. In cases where the flow doesn't need the full group to be resolved, it is not, only an entry that is internally marked as incomplete is added to the cache.
It would be more interesting to see sssd logs when you call "getent group $gidnumber" for the group whose number does not resolve.
and in the sssd_domain.com.log: (Wed Aug 21 16:22:43 2019) [sssd[be[domain.com]]] [sdap_nested_group_split_members] (0x4000): [CN=USER,OU=IT Privileged accounts,DC=domain,DC=com] is unknown object (Wed Aug 21 16:22:43 2019) [sssd[be[domain.com]]] [sdap_nested_group_process_send] (0x0400): More members were missing than the deref threshold (Wed Aug 21 16:22:43 2019) [sssd[be[domain.com]]] [sdap_nested_group_process_send] (0x2000): Looking up 11/224 members of group [CN=GROUP,OU=Security,OU=Groups,OU=Place St,OU=Offices,DC=domain,DC=com] (Wed Aug 21 16:22:43 2019) [sssd[be[domain.com]]] [sdap_nested_group_process_send] (0x2000): Dereferencing members of group [CN=GROUP,OU=Security,OU=Groups,OU=Place St,OU=Offices,DC=domain,DC=com] (Wed Aug 21 16:22:43 2019) [sssd[be[domain.com]]] [sdap_deref_search_send] (0x2000): Server supports ASQ (Wed Aug 21 16:22:43 2019) [sssd[be[domain.com]]] [sdap_asq_search_send] (0x0400): Dereferencing entry [CN=GROUP,OU=Security,OU=Groups,OU=Place St,OU=Offices,DC=domain,DC=com] using ASQ (Wed Aug 21 16:22:43 2019) [sssd[be[domain.com]]] [sdap_print_server] (0x2000): Searching XXX.XXX.XXX.XXX:389 (Wed Aug 21 16:22:43 2019) [sssd[be[domain.com]]] [sdap_get_generic_ext_send] (0x0400): WARNING: Disabling paging because scope is set to base. (Wed Aug 21 16:22:43 2019) [sssd[be[domain.com]]] [sdap_get_generic_ext_step] (0x0400): calling ldap_search_ext with [no filter][CN=GROUP,OU=Security,OU=Groups,OU=Place St,OU=Offices,DC=domain,DC=com].
This looks fine, it's just sssd looking up all the members of the group.
I've redacted the entries but I'm sure you can get the jist of whats happening here hopefully. If there is anything else you need, please do not hesitate to ask! If these logs don't point to anything could you maybe provide some advice on what to look for when parsing?
Thanks, Jamal
Hi Jakub,
Thanks for taking the time to look at this for me. I'm going to spin up a VM and test the newer version of SSSD but because the issue is so intermittent it is difficult to know whether the update fixes the issue.
I'll update this when the error appears again and try to get log outputs for the getent group gid. Thanks again for your time.
Jamal
Hi Jakub,
I've managed to catch the error again with my own machine so this time i've had time to properly capture the issue. I've been looking into the logs and what seems to be happening is that we have multiple AD Domains Active. I want to know if this is heard of, our local AD domain and a trusted forest are being used as Active domains in ldap searches. Our local AD responds to a be request from sssd_be and fills the correct group into the nss cache, then it gets a response from the trusted domain and the group doesn't exist so it overwrites the cache with no such group. I think the intermittent issue occurs because sometimes ldap will query the remote forest and other times the local. Please advise on whether this is plausible or not.
Thanks, Jamal
On Mon, Aug 26, 2019 at 04:25:38PM -0000, Jamal Mahmoud wrote:
Hi Jakub,
I've managed to catch the error again with my own machine so this time i've had time to properly capture the issue. I've been looking into the logs and what seems to be happening is that we have multiple AD Domains Active. I want to know if this is heard of, our local AD domain and a trusted forest are being used as Active domains in ldap searches. Our local AD responds to a be request from sssd_be and fills the correct group into the nss cache, then it gets a response from the trusted domain and the group doesn't exist so it overwrites the cache with no such group. I think the intermittent issue occurs because sometimes ldap will query the remote forest and other times the local. Please advise on whether this is plausible or not.
It would be nice to see some log snippet to see the behaviour exactly, but in general, requests towards the trusted back ends should be sequential. The only similar pattern might be where sssd first checks with the help of the global catalog which domain the group resides at and then queries that domain's LDAP port.
Hi Jakub,
Apologies for the long delay in response as I was dragged away for other projects! So my previous (false)theory was a result of sssd_nss not being able to see entries that the sssd_be places into the memcache. It would not be able to find it's group so it would query any other domains it could see. I've now been able to isolate the issue down to an even smaller plausible cause.
After some digging through logs of an affected machine I've discovered a very interesting set of logs. This is quite verbose so bear with me. The command I ran was: $ getent group $GID
Here we see NSS requesting a GID number to be resolved/fetched: (Thu Oct 17 17:02:07 2019) [sssd[nss]] [nss_getby_id] (0x0400): Input ID: 1000001111 (Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_send] (0x0400): CR #187: New request 'Group by ID' (Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_select_domains] (0x0400): CR #187: Performing a multi-domain search (Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_search_domains] (0x0400): CR #187: Search will check the cache and check the data provider
First it checks the NSS cache to see if an entry is present in it's cache: (Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_set_domain] (0x0400): CR #187: Using domain [mydomain.com] (Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_search_send] (0x0400): CR #187: Looking up GID:1000001111@mydomain.com (Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_search_ncache] (0x0400): CR #187: Checking negative cache for [GID:1000001111@mydomain.com] (Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_search_ncache] (0x0400): CR #187: [GID:1000001111@mydomain.com] is not present in negative cache (Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_search_cache] (0x0400): CR #187: Looking up [GID:1000001111@mydomain.com] in cache (Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_search_cache] (0x0400): CR #187: Object [GID:1000001111@mydomain.com] was not found in cache
It's not present in the cache according to NSS so it requests the backend to search the domain provider for this entry: (Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_search_dp] (0x0400): CR #187: Looking up [GID:1000001111@mydomain.com] in data provider (Thu Oct 17 17:02:07 2019) [sssd[nss]] [sss_dp_issue_request] (0x0400): Issuing request for [0x55ee5f5d1b10:2:1000001111@mydomain.com] (Thu Oct 17 17:02:07 2019) [sssd[nss]] [sss_dp_get_account_msg] (0x0400): Creating request for [mydomain.com][0x2][BE_REQ_GROUP][idnumber=1000001111:-] (Thu Oct 17 17:02:07 2019) [sssd[nss]] [sss_dp_internal_get_send] (0x0400): Entering request [0x55ee5f5d1b10:2:1000001111@mydomain.com]
The domain provider finds the entry on the domain and fills the cache: (Thu Oct 17 17:01:42 2019) [sssd[be[mydomain.com]]] [sdap_save_group] (0x0400): Storing info for group TheGroup@mydomain.com (Thu Oct 17 17:01:42 2019) [sssd[be[mydomain.com]]] [sysdb_store_group] (0x1000): The group record of TheGroup@mydomain.com did not change, only updated the timestamp cache
What's interesting is that the sssd_be tells us that the entry was already present, yet nss was unaware of any entries in the cache, nss didn't even say (not exact quote) "entry found, needs updating", which after some testing on a working machine, is what occurs when nss encounters an entry that is out of date.
Here we see sssd_nss responding to the backend's return of the group entry. (Thu Oct 17 17:02:07 2019) [sssd[nss]] [sss_dp_get_reply] (0x1000): Got reply from Data Provider - DP error code: 0 errno: 0 error message: Success (Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_search_cache] (0x0400): CR #187: Looking up [GID:1000001111@mydomain.com] in cache (Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_search_cache] (0x0400): CR #187: Object [GID:1000001111@mydomain.com] was not found in cache
Either the backend is entering data wrongly into the cache or nss is unable to read certain entries that are placed in the cache, I would like to think it is the former because the issue is so intermittent and for the most part, it works correctly. The only workaround we have found so far is stopping sssd.service, removing the ldbs from /var/lib/sss/db/ and restarting the service. This allows sssd to work correctly and return the groups correctly.
I hope this makes sense, Thanks, Jamal
Just as an update: I've managed to catch the error again on my machine by forcing the cache to update much more often. I've compared the group entry in the cache both before and after the corruption occurs and have found some interesting differences: the entryUSN is different, not sure if that matters, the isPosix flag is set to FALSE on the corrupted entry the gidNumber is set to 0, All the users and ghost members, SID number uniqueID are all the same. I managed to pull an "AD group type flag set" 0x80000004 if that means anything at all to you.
Will update with more info as I see more.
Thanks, Jamal
Just bumping this thread to see if anyone has seen this? This is an issue we are experiencing on and off and is causing a lot of trouble for our services.
Thanks, Jamal
On Fri, Oct 18, 2019, at 11:41 AM, Jamal Mahmoud wrote:
Hi Jakub,
Apologies for the long delay in response as I was dragged away for other projects! So my previous (false)theory was a result of sssd_nss not being able to see entries that the sssd_be places into the memcache. It would not be able to find it's group so it would query any other domains it could see. I've now been able to isolate the issue down to an even smaller plausible cause.
After some digging through logs of an affected machine I've discovered a very interesting set of logs. This is quite verbose so bear with me. The command I ran was: $ getent group $GID
Here we see NSS requesting a GID number to be resolved/fetched: (Thu Oct 17 17:02:07 2019) [sssd[nss]] [nss_getby_id] (0x0400): Input ID: 1000001111 (Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_send] (0x0400): CR #187: New request 'Group by ID' (Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_select_domains] (0x0400): CR #187: Performing a multi-domain search (Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_search_domains] (0x0400): CR #187: Search will check the cache and check the data provider
First it checks the NSS cache to see if an entry is present in it's cache: (Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_set_domain] (0x0400): CR #187: Using domain [mydomain.com] (Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_search_send] (0x0400): CR #187: Looking up GID:1000001111@mydomain.com (Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_search_ncache] (0x0400): CR #187: Checking negative cache for [GID:1000001111@mydomain.com] (Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_search_ncache] (0x0400): CR #187: [GID:1000001111@mydomain.com] is not present in negative cache (Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_search_cache] (0x0400): CR #187: Looking up [GID:1000001111@mydomain.com] in cache (Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_search_cache] (0x0400): CR #187: Object [GID:1000001111@mydomain.com] was not found in cache
It's not present in the cache according to NSS so it requests the backend to search the domain provider for this entry: (Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_search_dp] (0x0400): CR #187: Looking up [GID:1000001111@mydomain.com] in data provider (Thu Oct 17 17:02:07 2019) [sssd[nss]] [sss_dp_issue_request] (0x0400): Issuing request for [0x55ee5f5d1b10:2:1000001111@mydomain.com] (Thu Oct 17 17:02:07 2019) [sssd[nss]] [sss_dp_get_account_msg] (0x0400): Creating request for [mydomain.com][0x2][BE_REQ_GROUP][idnumber=1000001111:-] (Thu Oct 17 17:02:07 2019) [sssd[nss]] [sss_dp_internal_get_send] (0x0400): Entering request [0x55ee5f5d1b10:2:1000001111@mydomain.com]
The domain provider finds the entry on the domain and fills the cache: (Thu Oct 17 17:01:42 2019) [sssd[be[mydomain.com]]] [sdap_save_group] (0x0400): Storing info for group TheGroup@mydomain.com (Thu Oct 17 17:01:42 2019) [sssd[be[mydomain.com]]] [sysdb_store_group] (0x1000): The group record of TheGroup@mydomain.com did not change, only updated the timestamp cache
What's interesting is that the sssd_be tells us that the entry was already present, yet nss was unaware of any entries in the cache, nss didn't even say (not exact quote) "entry found, needs updating", which after some testing on a working machine, is what occurs when nss encounters an entry that is out of date.
Here we see sssd_nss responding to the backend's return of the group entry. (Thu Oct 17 17:02:07 2019) [sssd[nss]] [sss_dp_get_reply] (0x1000): Got reply from Data Provider - DP error code: 0 errno: 0 error message: Success (Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_search_cache] (0x0400): CR #187: Looking up [GID:1000001111@mydomain.com] in cache (Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_search_cache] (0x0400): CR #187: Object [GID:1000001111@mydomain.com] was not found in cache
Either the backend is entering data wrongly into the cache or nss is unable to read certain entries that are placed in the cache, I would like to think it is the former because the issue is so intermittent and for the most part, it works correctly. The only workaround we have found so far is stopping sssd.service, removing the ldbs from /var/lib/sss/db/ and restarting the service. This allows sssd to work correctly and return the groups correctly.
Thanks for digging thru this. I've experienced similar behavior, but never chased it down. No further workarounds I've discovered.
V/r, James Cassell
P.S., would be helpful if you include context in your replies.
Hi James,
Thanks for getting back to me, we are getting a little desperate relating to this issue.
Thanks for digging thru this. I've experienced similar behavior, but never chased it down. No further workarounds I've discovered.
I'm not really sure how this helps us. Are you saying that you are aware of the issue and are looking to get it fixed? Or are you going to actively ignore that this is a real issue? I've spent quite some time redacting those logs that was requested from Jakub so that he could maybe help point us in the right direction to finding the cause. I've not since received any responses from him and the latest response from you is short of helpful to that cause. Is there really nothing else I can do to figure out what is going wrong here? I'm more than willing to work with dev to get to the bottom of this but the enthusiasm in your response is not showing a mutual feeling. I would have thought that this is the place to find help on bugs relating to SSSD. Please get back to me on some advice other than "including context" in a reply.
Thank you, Jamal
Hi,
Can I ask why there is a lack of willingness to help us out on this issue? I would have thought that it is in the sssd's best interest to resolve issues users are having. Is there anybody on the dev team that can provide assistance on this issue? Looking forward to any response.
Thanks, Jamal
On Thu, Nov 14, 2019 at 03:54:35PM -0000, Jamal Mahmoud wrote:
Hi,
Can I ask why there is a lack of willingness to help us out on this issue? I would have thought that it is in the sssd's best interest to resolve issues users are having. Is there anybody on the dev team that can provide assistance on this issue? Looking forward to any response.
Hi,
I'm sorry for the delay in response. I silenced this thread originally since Jakub was handling it, but unfortunately Jakub has other responsibilities nowadays and I forgot the look at this thread.
Some time ago you said:
I've managed to catch the error again on my machine by forcing the cache to update much more often. I've compared the group entry in the cache both before and after the corruption occurs and have found some interesting differences: the entryUSN is different, not sure if that matters, the isPosix flag is set to FALSE on the corrupted entry the gidNumber is set to 0, All the users and ghost members, SID number uniqueID are all the same. I managed to pull an "AD group type flag set" 0x80000004 if that means anything at all to you.
The '4' at the end of the groupType attribute indicates that the given group is a group with 'Domain Local' scope, i.e. it is only valid in its parent domain.
Now it would be important to know if the group is defined in the same domain your client is joined to or if it is coming form a different domain.
In the latter case (group is coming from a different domain) the cache attributes 'isPosix: FALSE' and 'gidNumber: 0' are expected because SSSD will filter out those groups because they are not valid in the domain the client is joined to. My guess is that using the group as primary group for some users might use a code path that skips the filter so that it is added to the cache as proper group first but later on another lookup uses the filter and removes the gidNumber and sets isPosix to FALSE. To solve this you should switch the scope of the group to 'Global' or 'Universal'.
In the former case (group is from the domain the client is joined to) it would be good to know if you are using UIDs and GIDs stored in AD or if you use SSSD's id-mapping scheme (ldap_id_mapping = False / True respectively).
bye, Sumit
Thanks, Jamal _______________________________________________ sssd-users mailing list -- sssd-users@lists.fedorahosted.org To unsubscribe send an email to sssd-users-leave@lists.fedorahosted.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedorahosted.org/archives/list/sssd-users@lists.fedorahosted.o...
Hi Sumit,
Thank you for getting back to me.
Now it would be important to know if the group is defined in the same domain your client is joined to or if it is coming form a different domain.
In the latter case (group is coming from a different domain) the cache attributes 'isPosix: FALSE' and 'gidNumber: 0' are expected because SSSD will filter out those groups because they are not valid in the domain the client is joined to. My guess is that using the group as primary group for some users might use a code path that skips the filter so that it is added to the cache as proper group first but later on another lookup uses the filter and removes the gidNumber and sets isPosix to FALSE. To solve this you should switch the scope of the group to 'Global' or 'Universal'.
In the former case (group is from the domain the client is joined to) it would be good to know if you are using UIDs and GIDs stored in AD or if you use SSSD's id-mapping scheme (ldap_id_mapping = False / True respectively).
bye, Sumit
The group we are using is defined in the same domain and for the most part it does in fact return the correct GIDs for the group. We have set ldap_id_mapping to false as we are using POSIX attributes on our AD users and groups.
I noticed that whenever the backend fills the cache with the wrong data, any updates do not actually modify the cache entry, we get this:
[sdap_save_group] (0x0400): Storing info for group group@domain.com [sysdb_store_group] (0x1000): The group record of group@domain.com did not change, only updated the timestamp cache
But occasionally, seemingly out of chance it does modify the entry, fixing the problem and setting the group to isPosix: TRUE when we get this:
[sdap_save_group] (0x0400): Storing info for group group@domain.com [sysdb_set_entry_attr] (0x0200): Entry [name=group@domain.com,cn=groups,cn=domain.com,cn=sysdb] has set [cache, ts_cache] attrs.
Not sure if any of that means anything towards the issue, just trying to give as much information as I can!
Do let me know if there is anything more you need, Thanks, Jamal
On Fri, Nov 15, 2019 at 10:58:17AM -0000, Jamal Mahmoud wrote:
Hi Sumit,
Thank you for getting back to me.
Now it would be important to know if the group is defined in the same domain your client is joined to or if it is coming form a different domain.
In the latter case (group is coming from a different domain) the cache attributes 'isPosix: FALSE' and 'gidNumber: 0' are expected because SSSD will filter out those groups because they are not valid in the domain the client is joined to. My guess is that using the group as primary group for some users might use a code path that skips the filter so that it is added to the cache as proper group first but later on another lookup uses the filter and removes the gidNumber and sets isPosix to FALSE. To solve this you should switch the scope of the group to 'Global' or 'Universal'.
In the former case (group is from the domain the client is joined to) it would be good to know if you are using UIDs and GIDs stored in AD or if you use SSSD's id-mapping scheme (ldap_id_mapping = False / True respectively).
bye, Sumit
The group we are using is defined in the same domain and for the most part it does in fact return the correct GIDs for the group. We have set ldap_id_mapping to false as we are using POSIX attributes on our AD users and groups.
Ok, do you know if the LDAP attributes uidNumber and gidNumber are replicated to the Global Catalog in your environment? By default they are not.
You can check this manually as well with ldapsearch on the Global Catalog port 3268:
ldapsearch -H ldap://your-ad-dc.your.ad.domain:3268 -b 'DC=your,DC=ad,DC=domain' samAccountName=groupname
If gidNumber is missing in the Global Catalog object please try if setting
ad_enable_gc = False
in the [domain/...] section of sssd.conf makes the group lookup more reliable.
bye, Sumit
I noticed that whenever the backend fills the cache with the wrong data, any updates do not actually modify the cache entry, we get this:
[sdap_save_group] (0x0400): Storing info for group group@domain.com [sysdb_store_group] (0x1000): The group record of group@domain.com did not change, only updated the timestamp cache
But occasionally, seemingly out of chance it does modify the entry, fixing the problem and setting the group to isPosix: TRUE when we get this:
[sdap_save_group] (0x0400): Storing info for group group@domain.com [sysdb_set_entry_attr] (0x0200): Entry [name=group@domain.com,cn=groups,cn=domain.com,cn=sysdb] has set [cache, ts_cache] attrs.
Not sure if any of that means anything towards the issue, just trying to give as much information as I can!
Do let me know if there is anything more you need, Thanks, Jamal _______________________________________________ sssd-users mailing list -- sssd-users@lists.fedorahosted.org To unsubscribe send an email to sssd-users-leave@lists.fedorahosted.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedorahosted.org/archives/list/sssd-users@lists.fedorahosted.o...
On Fri, Nov 15, 2019 at 10:58:17AM -0000, Jamal Mahmoud wrote:
Ok, do you know if the LDAP attributes uidNumber and gidNumber are replicated to the Global Catalog in your environment? By default they are not.
You can check this manually as well with ldapsearch on the Global Catalog port 3268:
ldapsearch -H ldap://your-ad-dc.your.ad.domain:3268 -b
'DC=your,DC=ad,DC=domain' samAccountName=groupname
If gidNumber is missing in the Global Catalog object please try if setting
ad_enable_gc = False
in the [domain/...] section of sssd.conf makes the group lookup more reliable.
bye, Sumit
Hi Sumit,
I'm just after checking and you are correct! the ldap search through the Global Catalog does not return any POSIX attributes, we're going to apply this patch and see if the errors stop occurring. If this is the solution I owe you a drink (or 5).
Thanks, Jamal
On 2019-11-15 04:25, Jamal Mahmoud wrote:
On Fri, Nov 15, 2019 at 10:58:17AM -0000, Jamal Mahmoud wrote:
Ok, do you know if the LDAP attributes uidNumber and gidNumber are replicated to the Global Catalog in your environment? By default they are not.
You can check this manually as well with ldapsearch on the Global Catalog port 3268:
ldapsearch -H ldap://your-ad-dc.your.ad.domain:3268 -b
'DC=your,DC=ad,DC=domain' samAccountName=groupname
If gidNumber is missing in the Global Catalog object please try if setting
ad_enable_gc = False
in the [domain/...] section of sssd.conf makes the group lookup more reliable.
bye, Sumit
Hi Sumit,
I'm just after checking and you are correct! the ldap search through the Global Catalog does not return any POSIX attributes, we're going to apply this patch and see if the errors stop occurring. If this is the solution I owe you a drink (or 5).
Thanks, Jamal
Yep. The docs say that all those POSIX attributes should be marked as being part of the GC, which they aren't by default. You need to use the AD schema too to do that IIRC.
I've also encountered issues with groups going missing, and in fact I'm working such an issue now. In our case, all the POSIX stuff is replicated to the GC. What happens is that the user's groups are fine for a long time (8-10 hours), then either a single group vanishes, OR all but their login group vanishes. The only thing that brings it back immediately is stopping SSSD, removing /var/lib/sssd/db/*, and restarting it. Then the groups will be back for that semi-random period.
I had another case of this issue a few weeks ago. But in this case it turned out to be that there was an automated process on the AD that was removing users from groups, then adding them back shortly after. It seems that SSSD would sometimes catch it at the right time, and remove the user from the group, or sometimes bug out and remove all the users group except the user entry's gidNumber group (primary login group).
This appears to me to be some sort of bug with SSSD where once it removes a group in the cache, it doesn't restore it when the user comes back. Perhaps negative caching (intended, or not)?
- Jim
On Fri, Nov 15, 2019 at 04:57:27AM -0800, Jim Burwell wrote:
On 2019-11-15 04:25, Jamal Mahmoud wrote:
On Fri, Nov 15, 2019 at 10:58:17AM -0000, Jamal Mahmoud wrote:
Ok, do you know if the LDAP attributes uidNumber and gidNumber are replicated to the Global Catalog in your environment? By default they are not.
You can check this manually as well with ldapsearch on the Global Catalog port 3268:
ldapsearch -H ldap://your-ad-dc.your.ad.domain:3268 -b
'DC=your,DC=ad,DC=domain' samAccountName=groupname
If gidNumber is missing in the Global Catalog object please try if setting
ad_enable_gc = False
in the [domain/...] section of sssd.conf makes the group lookup more reliable.
bye, Sumit
Hi Sumit,
I'm just after checking and you are correct! the ldap search through the Global Catalog does not return any POSIX attributes, we're going to apply this patch and see if the errors stop occurring. If this is the solution I owe you a drink (or 5).
Thanks, Jamal
Yep. The docs say that all those POSIX attributes should be marked as being part of the GC, which they aren't by default. You need to use the AD schema too to do that IIRC.
I've also encountered issues with groups going missing, and in fact I'm working such an issue now. In our case, all the POSIX stuff is replicated to the GC. What happens is that the user's groups are fine for a long time (8-10 hours), then either a single group vanishes, OR all but their login group vanishes. The only thing that brings it back
Hi,
are the group from the domain the client is joined to or from a different domain in the forest?
immediately is stopping SSSD, removing /var/lib/sssd/db/*, and restarting it. Then the groups will be back for that semi-random period.
I had another case of this issue a few weeks ago. But in this case it turned out to be that there was an automated process on the AD that was removing users from groups, then adding them back shortly after. It
Are the groups being removed as well during this process and then added back with the same name?
Can you share your sssd.conf?
bye, Sumit
seems that SSSD would sometimes catch it at the right time, and remove the user from the group, or sometimes bug out and remove all the users group except the user entry's gidNumber group (primary login group).
This appears to me to be some sort of bug with SSSD where once it removes a group in the cache, it doesn't restore it when the user comes back. Perhaps negative caching (intended, or not)?
- Jim
sssd-users mailing list -- sssd-users@lists.fedorahosted.org To unsubscribe send an email to sssd-users-leave@lists.fedorahosted.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedorahosted.org/archives/list/sssd-users@lists.fedorahosted.o...
On 2019-11-15 06:10, Sumit Bose wrote:
On Fri, Nov 15, 2019 at 04:57:27AM -0800, Jim Burwell wrote:
On 2019-11-15 04:25, Jamal Mahmoud wrote:
On Fri, Nov 15, 2019 at 10:58:17AM -0000, Jamal Mahmoud wrote:
Ok, do you know if the LDAP attributes uidNumber and gidNumber are replicated to the Global Catalog in your environment? By default they are not.
You can check this manually as well with ldapsearch on the Global Catalog port 3268:
ldapsearch -H ldap://your-ad-dc.your.ad.domain:3268 -b
'DC=your,DC=ad,DC=domain' samAccountName=groupname
If gidNumber is missing in the Global Catalog object please try if setting
ad_enable_gc = False
in the [domain/...] section of sssd.conf makes the group lookup more reliable.
bye, Sumit
Hi Sumit,
I'm just after checking and you are correct! the ldap search through the Global Catalog does not return any POSIX attributes, we're going to apply this patch and see if the errors stop occurring. If this is the solution I owe you a drink (or 5).
Thanks, Jamal
Yep. The docs say that all those POSIX attributes should be marked as being part of the GC, which they aren't by default. You need to use the AD schema too to do that IIRC.
I've also encountered issues with groups going missing, and in fact I'm working such an issue now. In our case, all the POSIX stuff is replicated to the GC. What happens is that the user's groups are fine for a long time (8-10 hours), then either a single group vanishes, OR all but their login group vanishes. The only thing that brings it back
Hi,
are the group from the domain the client is joined to or from a different domain in the forest?
Same domain.
immediately is stopping SSSD, removing /var/lib/sssd/db/*, and restarting it. Then the groups will be back for that semi-random period.
I had another case of this issue a few weeks ago. But in this case it turned out to be that there was an automated process on the AD that was removing users from groups, then adding them back shortly after. It
Are the groups being removed as well during this process and then added back with the same name?
I'm not positive about that. I presumed that the groups themselves were being left in place and users were just being removed from the group then re-added by an automated process. But I'll inquire as to whether this was a case. I have doubts because the groups exist with the proper gidNumber, which I don't believe these automated process handles.
Can you share your sssd.conf?
# # sssd.conf for SSSD versions which have the AD provider module # (preferred method) # [sssd] config_file_version = 2 domains = {{ krb5_realm }} services = nss, pam, pac # remove PAC if it causes slow/failed login #services = nss, pam # uncomment for heavy debugging # debug_level = 0x37F0
[nss] # in case home dir isn't defined in AD/ldap unixHomeDirectory fallback_homedir = /h/%u
[pam] # debugging #pam_verbosity = 5 # custom messages pam_account_expired_message = Account Expired pam_account_locked_message = Account Locked
# domain section [domain/{{ krb5_realm }}] # uncomment for heavy debugging # debug_level = 0x37F0 #ad_gpo_access_control = permissive # disabled, causes issues w/ some OSes ad_gpo_access_control = disabled
# Use AD provider id_provider = ad access_provider = ad auth_provider = ad chpass_provider = ad ldap_id_mapping = False cache_credentials = True ldap_schema = ad
# Search base ldap_search_base = dc=widgetco,dc=com # Speed up search with narrowed user search base if required #ldap_user_search_base = OU=Users,OU=Corp,DC=widgetco,DC=com ldap_user_object_class = user
# Speed up search if required. See sssd-ldap man page for how to # specify complex search bases with multiple bases, etc. # ldap_group_search_base = ou=users,dc=widgetco,dc=com ldap_group_object_class = group
# Use AD unix attributes instead of generating UID/GID, etc ldap_id_mapping = False
# where AD keeps these attributes ldap_user_name = sAMAccountName ldap_user_home_directory = unixHomeDirectory
# expire policies ldap_access_order = expire ldap_account_expire_policy = ad ldap_force_upper_case_realm = True
# specify only if needed (DNS SRV records used otherwise) #ad_server = dcwidgetcoprim.widgetco.com
# Let DHCP client handle this dyndns_update = false
# This greatly improves login speed, but getent group groupname won't show # group members (but group lookups for the OS still work). It is basically # required for complex AD group schemas, or AD environments with LOTS of # groups. It may be possible to turn it off if a group search base can # be devised which speeds thing up. But if there were many many groups, we # still need it. ignore_group_members = true
bye, Sumit
thanks.
seems that SSSD would sometimes catch it at the right time, and remove the user from the group, or sometimes bug out and remove all the users group except the user entry's gidNumber group (primary login group).
This appears to me to be some sort of bug with SSSD where once it removes a group in the cache, it doesn't restore it when the user comes back. Perhaps negative caching (intended, or not)?
- Jim
sssd-users mailing list -- sssd-users@lists.fedorahosted.org To unsubscribe send an email to sssd-users-leave@lists.fedorahosted.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedorahosted.org/archives/list/sssd-users@lists.fedorahosted.o...
sssd-users mailing list -- sssd-users@lists.fedorahosted.org To unsubscribe send an email to sssd-users-leave@lists.fedorahosted.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedorahosted.org/archives/list/sssd-users@lists.fedorahosted.o...
On Fri, Nov 15, 2019 at 10:58:17AM -0000, Jamal Mahmoud wrote:
Ok, do you know if the LDAP attributes uidNumber and gidNumber are replicated to the Global Catalog in your environment? By default they are not.
You can check this manually as well with ldapsearch on the Global Catalog port 3268:
ldapsearch -H ldap://your-ad-dc.your.ad.domain:3268 -b
'DC=your,DC=ad,DC=domain' samAccountName=groupname
If gidNumber is missing in the Global Catalog object please try if setting
ad_enable_gc = False
in the [domain/...] section of sssd.conf makes the group lookup more reliable.
bye, Sumit
Hi Sumit,
After adding in the ad_enable_gc=false, it doesn't seem to stop the errors we are getting, the last problem we got (today) was a logged in user had only his uid and the primary GID, not sure if this is a different issue but i'm starting to get the feeling that there is something misconfigured on our SSSD client setup.
Although, since I rolled this out, the machines with the new config did not get the "non-POSIX POSIX group in the cache" problem we've been discussing, so it may be solved, or coincedentally the specific error hasn't come up again.
As an aside, I've noticed that when the backend fetches new data for the cache, sometimes it will just update the ts_cache and sometimes it will update both the cache and the ts_cache. What determines this behaviour? I'm asking because when the cache fetches and updates, it actually fixes the problem when it updates the cache but when it only changes the ts_cache the issue remains, i've added a couple of examples to explain:
Updates both cache and ts_cache [sdap_save_group] (0x0400): Storing info for group group@domain.com [sysdb_set_entry_attr] (0x0200): Entry [name=group@domain.com,cn=groups,cn=domain.com,cn=sysdb] has set [cache, ts_cache] attrs.
Updates only the ts_cache: [sdap_save_group] (0x0400): Storing info for group group@domain.com [sysdb_store_group] (0x1000): The group record of group@domain.com did not change, only updated the timestamp cache
Realistically it should see that the incoming data is different to the cached data no?
Sorry for the heavy message, please let me know if you need any specifics and I'll be glad to provide. Really appreciate the time you're giving to help us out.
Kind Regards, Jamal
On Mon, Nov 18, 2019 at 03:12:35PM -0000, Jamal Mahmoud wrote:
On Fri, Nov 15, 2019 at 10:58:17AM -0000, Jamal Mahmoud wrote:
Ok, do you know if the LDAP attributes uidNumber and gidNumber are replicated to the Global Catalog in your environment? By default they are not.
You can check this manually as well with ldapsearch on the Global Catalog port 3268:
ldapsearch -H ldap://your-ad-dc.your.ad.domain:3268 -b
'DC=your,DC=ad,DC=domain' samAccountName=groupname
If gidNumber is missing in the Global Catalog object please try if setting
ad_enable_gc = False
in the [domain/...] section of sssd.conf makes the group lookup more reliable.
bye, Sumit
Hi Sumit,
After adding in the ad_enable_gc=false, it doesn't seem to stop the errors we are getting, the last problem we got (today) was a logged in user had only his uid and the primary GID, not sure if this is a different issue but i'm starting to get the feeling that there is something misconfigured on our SSSD client setup.
Although, since I rolled this out, the machines with the new config did not get the "non-POSIX POSIX group in the cache" problem we've been discussing, so it may be solved, or coincedentally the specific error hasn't come up again.
As an aside, I've noticed that when the backend fetches new data for the cache, sometimes it will just update the ts_cache and sometimes it will update both the cache and the ts_cache. What determines this behaviour? I'm asking because when the cache fetches and updates, it actually fixes the problem when it updates the cache but when it only changes the ts_cache the issue remains, i've added a couple of examples to explain:
Updates both cache and ts_cache [sdap_save_group] (0x0400): Storing info for group group@domain.com [sysdb_set_entry_attr] (0x0200): Entry [name=group@domain.com,cn=groups,cn=domain.com,cn=sysdb] has set [cache, ts_cache] attrs.
Updates only the ts_cache: [sdap_save_group] (0x0400): Storing info for group group@domain.com [sysdb_store_group] (0x1000): The group record of group@domain.com did not change, only updated the timestamp cache
Realistically it should see that the incoming data is different to the cached data no?
Hi,
we use the timestamp LDAP attribute which stored the last modification time to decide if it is sufficient to just update the timestamp cache or if the data cache should be updated as well. In AD this attribute is called 'whenChanged' and the comparison between stored in new value is just a comparison of the strings, if they differ the data cache will be updated as well.
What you describe might have two reason. If you add a member to a group on a Domain Controller (DC) A but SSSD is connected to DC B it might refresh the group after the new member is added but before the new data is replicated to DC B. This is somewhat expected and if the data is finally replicated to DC B and the cached group is expired again the data cache will be updated.
The second reason might be that 'whenChanged' is handled even more differently then we thought. 'whenChanged' itself will not be replicated from DC A to DC B but DC B should update 'whenChanged' when the new data from DC A is added to the DC B with the current time. So it is expected that 'whenChanged' for the same object is different on every DC but is should change whenever the data is updated. That's e.g. the reason we just do a string comparison, 'whenChanged' should just be different to update the data cache and we do not care if the numerical values are larger or smaller.
Sorry for the heavy message, please let me know if you need any specifics and I'll be glad to provide. Really appreciate the time you're giving to help us out.
I wonder if you can try to monitor the replication in your environment by adding a new member to a group on a Domain Controller and check the group object on another Domain Controller in regular intervals (e.g. every 2s) with ldapsearch, especially the 'member' and the 'whenChanged' attributes. This should help to understand if replication just takes long or if 'member' and 'whenChanged' are no updated together on the other DC.
Thanks
bye, Sumit
Kind Regards, Jamal _______________________________________________ sssd-users mailing list -- sssd-users@lists.fedorahosted.org To unsubscribe send an email to sssd-users-leave@lists.fedorahosted.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedorahosted.org/archives/list/sssd-users@lists.fedorahosted.o...
sssd-users@lists.fedorahosted.org