(re-sending as I initially sent to ssd-users-owners in error)
For an AD environment using service discovery.
Periodically sssd will invalidate its cache at unexpected times. Digging around debug logs and sources leads me to understand the following:
Every 15 minutes (or as defined by ldap_connection_expire_timeout) sssd re-establishes the connection to LDAP, closing the exiting collection. When sssd is configured to auto discover (via DNS _srv_ records, where the priority is the same for each server); auto-discovery might return a different LDAP server, at which point sssd's stored uSNChanged values are invalid (as these are unique to each server), the cached values are cleared, and enumeration is run - essentially afresh - against the new LDAP server.
Is this outcome expected by design?
This behaviour is rather unfortunate as sssd_be will become CPU hog as it rebuilds the cache again.
It is possible to work around the behaviour e.g.:
1) by not using service discovery, i.e.
ad_server = server1 ad_backup = server2
which is fairly tiresome to maintain across an estate - separate configurations for different sites etc, faking load balancing by swapping configurations.
2) having different priorities for each AD server in a given site, losing load balancing - unless DNS gave out different priorities depending on the source of the request, but this seems messy.
A better approach might be to patch sssd's auto discovery to "stick" to the previously bound LDAP server, currently the first server in the list of primary servers returned by ad_sort_servers_by_dns(). I have a proof of concept patch that is straight forward, and fairly well contained, the behaviour is controlled by an ad_sticky option in sssd.conf.
Is there a better solution to this problem? Would a patch - as vaguely outlined above - likely gain acceptance?
On Fri, Jan 04, 2019 at 09:20:20AM +0000, R Davies wrote:
(re-sending as I initially sent to ssd-users-owners in error)
For an AD environment using service discovery.
Periodically sssd will invalidate its cache at unexpected times. Digging around debug logs and sources leads me to understand the following:
Every 15 minutes (or as defined by ldap_connection_expire_timeout) sssd re-establishes the connection to LDAP, closing the exiting collection. When sssd is configured to auto discover (via DNS _srv_ records, where the priority is the same for each server); auto-discovery might return a different LDAP server, at which point sssd's stored uSNChanged values are invalid (as these are unique to each server), the cached values are cleared, and enumeration is run - essentially afresh - against the new LDAP server.
Thank you very much for digging into the issue.
Is this outcome expected by design?
Honestly, I'm not sure and I would like some other developers to chime in with their opinion.
Historically, we've said that SSSD should stick to a 'working' server as long as it can, so on one hand I see the point in the sticky behaviour. On the other hand, I've also seen admins relying on the TTL validity of the SRV records, expecting that, if they change the SRV records, the client chooses a new server after the TTL expires.
This behaviour is rather unfortunate as sssd_be will become CPU hog as it rebuilds the cache again.
It is possible to work around the behaviour e.g.:
- by not using service discovery, i.e.
Yes, in this case, the same server will always be selected from the list, working around the problem.
ad_server = server1 ad_backup = server2
which is fairly tiresome to maintain across an estate - separate configurations for different sites etc, faking load balancing by swapping configurations.
- having different priorities for each AD server in a given site, losing
load balancing - unless DNS gave out different priorities depending on the source of the request, but this seems messy.
A better approach might be to patch sssd's auto discovery to "stick" to the previously bound LDAP server, currently the first server in the list of primary servers returned by ad_sort_servers_by_dns(). I have a proof of concept patch that is straight forward, and fairly well contained, the behaviour is controlled by an ad_sticky option in sssd.conf.
Is there a better solution to this problem? Would a patch - as vaguely outlined above - likely gain acceptance?
If the behaviour is controllable by an option, my opinion is that it would be a good approach.
Would the stickiness also persist across SRV priority levels? What I mean is that if server1 had originally the highest priority (the lowest priority value in the SRV record), but then the SRV record is expired and the server is suddendly in a lower priority tier, IMO then the server should be 'forgotten' and a new one chosen..
On Fri, 4 Jan 2019 at 10:19, Jakub Hrozek jhrozek@redhat.com wrote:
Would the stickiness also persist across SRV priority levels? What I mean is that if server1 had originally the highest priority (the lowest priority value in the SRV record), but then the SRV record is expired and the server is suddendly in a lower priority tier, IMO then the server should be 'forgotten' and a new one chosen..
You're right to highlight this. Different admins may have different requirements, perhaps the configuration option "ad_sticky" could control behaviour:
always - Always sticky, prefer the originally discovered server, unless the sticky server has been removed from the service record priority - Mostly sticky, prefer the originally discovered server, unless its priority in the service record has changed never - No stickiness (default, and current behaviour), i.e. always potentially change ldap server on expiry of the ldap connection.
In terms of implementation, would this be confined to the AD provider or would IPA also benefit with this? If so, the perhaps it should live in fail_over_srv.c. I'm a bit unclear as to how this might be implemented in the fail_over_srv "plugin". The fo_discover_srv_* functions have a resolv_ctx available to them, but it would seem neater to have a dedicated fo_discover_ctx structure to store the configuration, along with the sticky ldap server name.
Thanks.
On Mon, Jan 07, 2019 at 06:01:08PM +0000, R Davies wrote:
On Fri, 4 Jan 2019 at 10:19, Jakub Hrozek jhrozek@redhat.com wrote:
Would the stickiness also persist across SRV priority levels? What I mean is that if server1 had originally the highest priority (the lowest priority value in the SRV record), but then the SRV record is expired and the server is suddendly in a lower priority tier, IMO then the server should be 'forgotten' and a new one chosen..
You're right to highlight this. Different admins may have different requirements, perhaps the configuration option "ad_sticky" could control behaviour:
always - Always sticky, prefer the originally discovered server, unless the sticky server has been removed from the service record priority - Mostly sticky, prefer the originally discovered server, unless its priority in the service record has changed never - No stickiness (default, and current behaviour), i.e. always potentially change ldap server on expiry of the ldap connection.
As long as the default behaviour stays the same, I'm fine with just implementing never and always or never and priority. I think it's just important not to prevent extending the code further.
In terms of implementation, would this be confined to the AD provider or would IPA also benefit with this? If so, the perhaps it should live in fail_over_srv.c. I'm a bit unclear as to how this might be implemented in the fail_over_srv "plugin". The fo_discover_srv_* functions have a resolv_ctx available to them, but it would seem neater to have a dedicated fo_discover_ctx structure to store the configuration, along with the sticky ldap server name.
My initial idea was to create a wrapper around resolv_sort_srv_reply() that would take the previous server and optionally a flag parameter.
Then, if the previous server was present and the flags would indicate that the server should be preferred, it would just be moved to the 1st place in the list. The previous server would probably have to be kept somewhere in the fail over code, maybe struct fo_ctx could be used?
If course, fo_discover_ctx could also be used, did you think about creating it as a member of fo_ctx, maybe created during fo_set_srv_lookup_plugin() ?
sssd-users@lists.fedorahosted.org