On 3 May 2016 at 17:04, Jakub Hrozek <jhrozek(a)redhat.com> wrote:
On Tue, May 03, 2016 at 05:30:23PM +0200, Jakub Hrozek wrote:
>
> Yes, every cache update does 4 of these. This is a know issue I'm
> working on right now:
>
https://fedorahosted.org/sssd/ticket/2602
> In particular:
>
https://fedorahosted.org/sssd/wiki/DesignDocs/OneFourteenPerformanceImpro...
By the way, some comparison from my WIP branch. Without the patches,
updating a user who is a member of several hundred large groups with 'id'
takes the following:
Total run time of id was: 19415 ms
Number of zero-level cache transactions: 283
--> Time spent in level-0 sysdb transactions: 7694 ms
--> Time spent writing to LDB: 2958 ms
Number of LDAP searches: 562
Time spent waiting for LDAP: 4548 ms
With the patches to avoid cache writes:
Total run time of id was: 9482 ms
Number of zero-level cache transactions: 283
--> Time spent in level-0 sysdb transactions: 1074 ms
--> Time spent writing to LDB: 38 ms
Number of LDAP searches: 562
Time spent waiting for LDAP: 4792 ms
That's good to know, and I'm happy to test any patches - just get in
touch directly.
I wanted to report back that after a few days of running with a tmpfs
at /var/lib/sss/db the issue with child processes timing out seems to
have been mostly resolved. Our most loaded machines will report
occasional timeouts but this doesn't seem to impact service:
(Wed May 4 03:32:38 2016) [sssd] [mark_service_as_started] (0x0400):
SSSD is initialized, terminating parent process
(Wed May 4 03:32:43 2016) [sssd] [services_startup_timeout] (0x0400):
Handling timeout
(Wed May 4 17:15:48 2016) [sssd] [ping_check] (0x0020): A service
PING timed out on [pam]. Attempt [0]
(Thu May 5 04:57:28 2016) [sssd] [ping_check] (0x0020): A service
PING timed out on [nss]. Attempt [0]
(Thu May 5 11:17:18 2016) [sssd] [ping_check] (0x0020): A service
PING timed out on [nss]. Attempt [0]
(Thu May 5 15:44:18 2016) [sssd] [ping_check] (0x0020): A service
PING timed out on [nss]. Attempt [0]
(Fri May 6 03:57:48 2016) [sssd] [ping_check] (0x0020): A service
PING timed out on [nss]. Attempt [0]
(Fri May 6 05:17:28 2016) [sssd] [ping_check] (0x0020): A service
PING timed out on [nss]. Attempt [0]
As far as the actual auth failures go, it's hard to quantify whether
we're now seeing _zero_ problems because they tend to not get
reported, but our metrics are showing substantially fewer failures.
Really appreciate your assistance here, and if you need any further
debugging information just let me know.
Cheers,
Patrick