URL: https://github.com/SSSD/sssd/pull/636 Author: pbrezina Title: #636: failover: tune up default timeouts Action: opened
PR body: """ Resolves: https://pagure.io/SSSD/sssd/issue/3217 """
To pull the PR as Git branch: git remote add ghsssd https://github.com/SSSD/sssd git fetch ghsssd pull/636/head:pr636 git checkout pr636
URL: https://github.com/SSSD/sssd/pull/636 Title: #636: failover: tune up default timeouts
jhrozek commented: """ 18 and 24 seem like /really/ long time outs. Can we use something lower? I can't imagine waiting for 24 seconds until sssd goes offline.. """
See the full comment at https://github.com/SSSD/sssd/pull/636#issuecomment-412483702
URL: https://github.com/SSSD/sssd/pull/636 Title: #636: failover: tune up default timeouts
pbrezina commented: """ What would you like to have? This will allow to test 3 servers before timeouting the whole operation. Can we decrease the lowest timeout from 6 to say 3? I think no. """
See the full comment at https://github.com/SSSD/sssd/pull/636#issuecomment-412497620
URL: https://github.com/SSSD/sssd/pull/636 Title: #636: failover: tune up default timeouts
pbrezina commented: """ Bump. The patch is trivial we just need to agree on the defaults. """
See the full comment at https://github.com/SSSD/sssd/pull/636#issuecomment-478521963
URL: https://github.com/SSSD/sssd/pull/636 Author: pbrezina Title: #636: failover: tune up default timeouts Action: synchronized
To pull the PR as Git branch: git remote add ghsssd https://github.com/SSSD/sssd git fetch ghsssd pull/636/head:pr636 git checkout pr636
URL: https://github.com/SSSD/sssd/pull/636 Title: #636: failover: tune up default timeouts
pbrezina commented: """ Rebased on top of current master. """
See the full comment at https://github.com/SSSD/sssd/pull/636#issuecomment-478921342
URL: https://github.com/SSSD/sssd/pull/636 Title: #636: failover: tune up default timeouts
jhrozek commented: """ The defaults are the problem. Mostly I don't like that the LDAP operation time out is 24 seconds, but the fail over timeout of 18 seconds is also too long.
If the server was really slow or was dropping (not rejecting) packets, then the fail over would take almost half a minute, that seems just too much.
btw why is the ldap timeout the longest? I thought that the top-level timeout was the failover one which gives you a server? Or is the LDAP timeout used for something like sdap_op which includes the server resolution with failover and then LDAP connection? """
See the full comment at https://github.com/SSSD/sssd/pull/636#issuecomment-479855693
URL: https://github.com/SSSD/sssd/pull/636 Title: #636: failover: tune up default timeouts
pbrezina commented: """ Yes.
1. dns_resolver_op_timeout -- timeout for single dns query 2. dns_resolver_timeout -- timeout for service resolution (it may include multiple dns queries) 3. ldap_opt_timeout -- timeout for LDAP connection in sdap_id_op code
The third one is actually used in more contexts but this one is the one that fails the failover prematurely (together with the second one). So perhaps instead of increasing the timeouts we should rework the code so we are able to failover correctly (perhaps do not use 2. and 3. at all). But this is much bigger task.
"""
See the full comment at https://github.com/SSSD/sssd/pull/636#issuecomment-480198226
URL: https://github.com/SSSD/sssd/pull/636 Title: #636: failover: tune up default timeouts
sumit-bose commented: """ Hi,
my suggestion would be to lower the "lower level" timeouts and to do this only for the AD provider for a start.
The reasoning is that imo we can assume that an AD environment is "sufficiently well maintained" that there should be no need to wait more then 1s for a reply from a DNS server.
RESOLV_TIMEOUTMS which is currently hardcode to 2000 ms (2s) is actually the time SSSD will wait for a reply from a single DNS server. After that the next DNS server in the list is tried. It would make sense to make this configurable and lower it for the AD provider to 1000 (1s) or even a bit lower.
If we say we want at least have the chance to try 3 DNS servers, dns_resolver_op_timeout can be 3s. (Btw the man page entry for dns_resolver_op_timeout has to be fixed because is says "How long would SSSD talk to a single DNS server." but it is as you said above "timeout for single dns query").
I would try to avoid touching ldap_opt_timeout since, as you said, it is used in various places. Maybe a dedicated failover timeout can be added for sdap_id_op.
@jhrozek mentioned that it might be possible to keep the c-ares state around so that when there was a timeout talking to one DNS server the state with the next server which replied in time is used for upcoming request well. Currently a new DNS request will start with the same server which showed the timeout, so for sequences like looking up a SRV record and then resolve a single host from the returned list we have to wait twice for the first DNS server not replying. But since this is an improvement which is not strictly related to the default timeout values this can of course be solve in a different ticket/PR.
HTH
bye, Sumit """
See the full comment at https://github.com/SSSD/sssd/pull/636#issuecomment-490971830
URL: https://github.com/SSSD/sssd/pull/636 Author: pbrezina Title: #636: failover: tune up default timeouts Action: synchronized
To pull the PR as Git branch: git remote add ghsssd https://github.com/SSSD/sssd git fetch ghsssd pull/636/head:pr636 git checkout pr636
URL: https://github.com/SSSD/sssd/pull/636 Title: #636: failover: tune up default timeouts
pbrezina commented: """ Ok, how about now? It adds new option `dns_resolver_server_timeout (ms)` that replaces hardcoded `RESOLV_TIMEOUTMS`. Current defaults are:
* `dns_resolver_server_timeout`: 1000ms * `dns_resolver_op_timeout`: 2s * `dns_resolver_timeout`: 4s * `ldap_opt_timeout`: 8s
I had to increase the `ldap_opt_timeout` for the values to make any sense, otherwise it would not be able to try to resolve next hostname/discovery domain if `dns_resolver_timout` fails.
`fo_resolve_service_send() (dns_resolver_timeout)` first resolvse SRV if needed, then resolvs hostname (both `dns_resolver_op_timeout`). So I think if we could remember the responsive dns server in resolver state, we can set `dns_resolver_op_timeout = dns_resolver_timeout` and then set it to say three seconds, keeping `ldap_opt_timeout = 6`. """
See the full comment at https://github.com/SSSD/sssd/pull/636#issuecomment-500817513
URL: https://github.com/SSSD/sssd/pull/636 Title: #636: failover: tune up default timeouts
jhrozek commented: """ I will run some tests before pushing, though. """
See the full comment at https://github.com/SSSD/sssd/pull/636#issuecomment-508384063
URL: https://github.com/SSSD/sssd/pull/636 Title: #636: failover: tune up default timeouts
Label: +Changes requested
URL: https://github.com/SSSD/sssd/pull/636 Author: pbrezina Title: #636: failover: tune up default timeouts Action: synchronized
To pull the PR as Git branch: git remote add ghsssd https://github.com/SSSD/sssd git fetch ghsssd pull/636/head:pr636 git checkout pr636
URL: https://github.com/SSSD/sssd/pull/636 Title: #636: failover: tune up default timeouts
jhrozek commented: """ I'm sorry, but now the tests don't work: ``` In file included from /var/lib/jenkins/workspace/ci/label/rhel7/src/providers/fail_over_srv.h:27:0, from /var/lib/jenkins/workspace/ci/label/rhel7/src/tests/cmocka/test_fo_srv.c:33: /var/lib/jenkins/workspace/ci/label/rhel7/src/resolv/async_resolv.h:54:5: note: previous declaration of 'resolv_init' was here int resolv_init(TALLOC_CTX *mem_ctx, struct tevent_context *ev_ctx, ``` """
See the full comment at https://github.com/SSSD/sssd/pull/636#issuecomment-516357561
URL: https://github.com/SSSD/sssd/pull/636 Author: pbrezina Title: #636: failover: tune up default timeouts Action: synchronized
To pull the PR as Git branch: git remote add ghsssd https://github.com/SSSD/sssd git fetch ghsssd pull/636/head:pr636 git checkout pr636
URL: https://github.com/SSSD/sssd/pull/636 Title: #636: failover: tune up default timeouts
pbrezina commented: """ It should be fixed now. """
See the full comment at https://github.com/SSSD/sssd/pull/636#issuecomment-516762783
URL: https://github.com/SSSD/sssd/pull/636 Title: #636: failover: tune up default timeouts
Label: -Changes requested
URL: https://github.com/SSSD/sssd/pull/636 Title: #636: failover: tune up default timeouts
jhrozek commented: """ F-30 failed CI and there are no logs. But I don't see anything OS-specific in the patches and at the same time all my concerns were addressed.
Thank you. ACK. """
See the full comment at https://github.com/SSSD/sssd/pull/636#issuecomment-516866347
URL: https://github.com/SSSD/sssd/pull/636 Title: #636: failover: tune up default timeouts
Label: +Accepted
URL: https://github.com/SSSD/sssd/pull/636 Title: #636: failover: tune up default timeouts
jhrozek commented: """ * master: * 049f3906b9ef2041b5e1df666bd570379ae60718 * e97ff0adb62c89cfc7e75858b7e592e0303720b0 * 99e2a107f01c625cb59cb88589db87294176d6c6 * 3807de1d97fc87cf7c25af264a8b1bbabdef54e2 * 7b4635c8428917ced63954f2c3c70491b45d7870 """
See the full comment at https://github.com/SSSD/sssd/pull/636#issuecomment-516998725
URL: https://github.com/SSSD/sssd/pull/636 Author: pbrezina Title: #636: failover: tune up default timeouts Action: closed
To pull the PR as Git branch: git remote add ghsssd https://github.com/SSSD/sssd git fetch ghsssd pull/636/head:pr636 git checkout pr636
URL: https://github.com/SSSD/sssd/pull/636 Title: #636: failover: tune up default timeouts
Label: +Pushed
sssd-devel@lists.fedorahosted.org