On Tue, Feb 06, 2018 at 09:55:00AM +0100, Pavel Březina wrote:
> On 02/05/2018 03:38 PM, Jakub Hrozek wrote:
>> I was helping analyze poor performance and server-side load spikes in an
>> environment where cluster nodes running sssd were all booted up at the
>> same time.
>> It turned out that this meant cache entries were expiring at the same
>> time and also the LDAP connection was expiring and reconnecting at the
>> same time. There are some tickets we filed (the ideas were mostly
>> William's) and I wanted to discuss them here.
- Extend object lifetime if the
>> object hadn't changed in a long time
>> - I think this the most controversial and as we discussed a bit on
>> our phone call this is probably too dangerous to do by default.
>> Nonetheless, for resolving identity requests, it might be a
>> tunable that might provide a nice performance benefit.
> If configurable, it can be done. We may try to be clever here, e.g. refresh
> user if group changed.
The important thing here would be to ignore this optimization for
authentication requests and set a sane upper limit. Otherwise it might
be frustrating to admins to see that they configured e.g. an hour cache,
but changes they did on the server side don't propagate to the clients
for several hours.
- Randomize cache lifetime by a
>> couple of percent
>> - What the title says. This might prevent hammering the servers in
>> case the cluster nodes went up at the same time and had the same
>> expire timestamps for all objects. Again, I'm not sure if this
>> makes sense by default, because it adds a bit of a fuzzy
>> behaviour, but I think it makes sense as a configurable.
> Yes. This might help a lot in this scenario.
- Make sure periodical tasks use
>> - The be_ptask API already supports a bit of randomization, but
>> we're not really using it. I guess the review should be
>> case-by-case, but at least for ptasks that fetch any data from the
>> back end, I would even just randomize a bit by default.
> It can be done directly in ptask, enabled by configuration option. However
> we must make sure that randomized schedules still make sense (e.g. sudo
> smart refresh inteval << sudo full refresh interval).
OK, good idea, I added it to the ticket.
>> ldap_connection_expire_timeout either by default or w/ a configure option
>> - Again, if all connections expire and are reconnected at the same
>> time, the servers suffer.
>> Does anyone have an opinion on the issues? I think at least the
>> connection timeout is something we should look at, because IIRC that was
>> causing the most issues on the IDM servers. The other tickets are IMO
>> less important and I'm even not sure if we should implement them by
> There are also two more things we can done:
> a) Implement "throttling" behaviour on periodical refresh of expired
> We can update the object in several batches scheduled over large (and
> random) period of time.
> b) I do not remember the exact name but we wanted to implement a 389ds
> mechanism that would push changes to clients instead of clients pulling for
> them. We may want to revisit.
That's the Persistent Search control. It would be possible to use it for
sssd instances running on the IPA server itself, but not in general,
because each client then requires a thread on the server side..
A way to go would be create a similar plugin. A plugin that would keep
track of all registered SSSD clients and send information about changes
to all of them in batches. The client would then interpret the change
locally. This is out of scope for the moment but we can consider this
for some future time.