On 08/09/2013 04:33 PM, Michal Židek wrote:
>On 08/09/2013 11:51 AM, Jakub Hrozek wrote:
>>On Wed, Jul 24, 2013 at 02:00:38PM +0200, Jakub Hrozek wrote:
>>>On Tue, Jul 23, 2013 at 06:57:08PM +0200, Jakub Hrozek wrote:
>>>>On Mon, Jul 22, 2013 at 06:06:39PM +0200, Michal Židek wrote:
>>>>>On 07/19/2013 12:15 PM, Jakub Hrozek wrote:
>>>>>>On Mon, Jul 08, 2013 at 09:43:42PM +0200, Michal Židek wrote:
>>>>>>>On 07/04/2013 07:45 PM, Simo Sorce wrote:
>>>>>>>>On Thu, 2013-07-04 at 16:23 +0200, Michal Židek wrote:
>>>>>>>>>https://fedorahosted.org/sssd/ticket/1966 (SSSD
failover
>>>>>>>>>doesn't work if
>>>>>>>>>the first DNS server in resolv.conf is unavailable).
>>>>>>>>>
>>>>>>>>>The problem here is, that if first nameserver in
resolv.conf is
>>>>>>>>>down,
>>>>>>>>>the resolution is too slow and SSSD will not wait for
the
>>>>>>>>>result of
>>>>>>>>>ares_search and go offline. In my case the resolution
was
>>>>>>>>>sometimes more
>>>>>>>>>than a minute, because all search domains in
resolv.conf were
>>>>>>>>>searched
>>>>>>>>>inside ares_search() call using the first (not
working)
>>>>>>>>>nameserver in
>>>>>>>>>the first place and then with the working nameserver
(and
>>>>>>>>>before that,
>>>>>>>>>SSSD tried to figure out the domain name from my
incorrectly set
>>>>>>>>>hostname, which added more unnecessary DNS lookups).
>>>>>>>>>
>>>>>>>>>To avoid this problem, the option
dns_discovery_domain must be set
>>>>>>>>>properly, so that only the correct domain is
searched, but even
>>>>>>>>>that is
>>>>>>>>>not enough, because the default timeout for dns
resolver
>>>>>>>>>operation in
>>>>>>>>>sssd is too low. This patch rises the default value
to 15 seconds
>>>>>>>>>(instead of 5 seconds).
>>>>>>>>>
>>>>>>>>>Another option might be to lower the amount of time
ares waits
>>>>>>>>>for a
>>>>>>>>>nameserver to respond (currently it is 5 seconds,
that is why 5
>>>>>>>>>second
>>>>>>>>>for the entire dns resolution is not sufficient), but
I do not
>>>>>>>>>want to
>>>>>>>>>do this.
>>>>>>>>
>>>>>>>>Why not ?
>>>>>>>>
>>>>>>>>5 seconds these days is quite awfully high all in all.
>>>>>>>>I think 1 second should be plenty in most cases, so maybe
a
>>>>>>>>default of 2
>>>>>>>>seconds should be considered.
>>>>>>>>
>>>>>>>
>>>>>>>I also think that 5 seconds is high value, but in general, I
have
>>>>>>>bad
>>>>>>>feeling when lowering not configurable timeouts
(theoretically it
>>>>>>>could
>>>>>>>cause some DNS servers that are reachable now to become
>>>>>>>unreachable).
>>>>>>>
>>>>>>
>>>>>>Right, but in practice I can't think of a valid network
configuration
>>>>>>where the name server would take so long to reply. I mean, if
the
>>>>>>network was so slow that just resolving name would take 5
seconds,
>>>>>>then
>>>>>>chances are we wouldn't be able to download the entry from
LDAP at
>>>>>>all.
>>>>>>And if the nameserver is slow, then let's rather skip it and
fail
>>>>>>over
>>>>>>to another one.
>>>>>>
>>>>>>>Another possibility would be to make this value configurable.
Then I
>>>>>>>would have no problem lowering the default value. But,
should
>>>>>>>this be
>>>>>>>configurable? We already have one option to configure timeout
for
>>>>>>>the
>>>>>>>entire DNS resolution, adding another timeout for subtaks of
this
>>>>>>>operation does not sound nice to me. But I do not have
strong
>>>>>>>opinion
>>>>>>>on this.
>>>>>>>
>>>>>>
>>>>>>Here is what we discussed with Michal today:
>>>>>> * make the per-nameserver timeout lower a little. Now
it's 5
>>>>>> seconds, which is too long especially considering the
failover
>>>>>> timeout is 5 seconds as well. We could change it to 2 or
3
>>>>>> seconds.
>>>>>> * the glibc allows 3 nameservers. The failover request
timeut
>>>>>>could
>>>>>> then default to 3*nameserver_timeout to give the resolver
a
>>>>>>chance
>>>>>> to iterate over all the configured name servers.
>>>>>>
>>>>>>>Btw, once the correct data is returned from DNS servers, we
no
>>>>>>>longer
>>>>>>>require any DNS operation, so it only happens first time the
service
>>>>>>>is required (for example the first 'getent' call will
be slow, but
>>>>>>>after that the service is fast). This is not so bad IMO.
>>>>>>>
>>>>>>>>>These patches also change man pages, so probably
master only
>>>>>>>>>(string
>>>>>>>>>freeze)? Even if this is a really small change.
>>>>>>
>>>>>>We accept changes again now that we've released 1.10.1.
>>>>>>
>>>>>>But this change would be OK even for a string-frozen release,
>>>>>>because it
>>>>>>only changes the number, not translatable strings.
>>>>>>
>>>>>>>>>
>>>>>>>>>I was also thinking, would it make sense to write a
warning to
>>>>>>>>>the logs
>>>>>>>>>if the dns_discovery_domain option is not set? It
seem to be
>>>>>>>>>important
>>>>>>>>>to set it properly for cases like this one.
>>>>>>>>
>>>>>>>>I wonder, should we use the machine name to automatically
set the
>>>>>>>>discovery domain, and only if that fails fall back to the
ones
>>>>>>>>defined
>>>>>>>>in resolv.conf ?
>>>>>>>
>>>>>>>Actually, we do this. Wireshark shows that the first srv
lookup was
>>>>>>>constructed from my hostname. But the whole process is
unclear to
>>>>>>>me,
>>>>>>>for example I do not know why we call the
>>>>>>>resolv_gethostbyname_send()
>>>>>>>inside of resolv_get_domain_send(). It adds another DNS
lookups
>>>>>>>for A records (which causes another slow-down in the case of
bad
>>>>>>>nameserver searched first). In wireshark I also see searches
for
>>>>>>>A/AAA
>>>>>>>records for names like:
>>>>>>>mzidek.example.com
>>>>>>>mzidek.example.com.searchdom1.com
>>>>>>>mzidek.example.com.searchdom2.com
>>>>>>>mzidek.example.com.searchdom3.com
>>>>>>
>>>>>>IIRC we try to canonicalize the host name before chopping of the
>>>>>>first
>>>>>>part and using the rest as domain name.
>>>>>>
>>>>>>Would using gethostname() be safe in real world environments?
>>>>>>I know that host names are often set to complete garbage and
can't be
>>>>>>relied on..but is it OK to just say "fix your
environment" in that
>>>>>>case?
>>>>>>
>>>>>>Personally I would argue the current canonicalization doesn't
add
>>>>>>much,
>>>>>>because if the host name is set to complete garbage then the
>>>>>>canonicalization
>>>>>>wouldn't magically fix it..
>>>>>>
>>>>>>>
>>>>>>>where
mzidek.example.com is my hostname and searchdom1/2/3
are
>>>>>>>search
>>>>>>>domains specified in resolv.conf. Why do we even try to
resolve
>>>>>>>these
>>>>>>>names? It seems like waste of time to me. If the
>>>>>>>dns_discovery_domain is
>>>>>>>properly set, these lookups do not appear.
>>>>>>>
>>>>>>>Thanks
>>>>>>>Michal
>>>>>>>
>>>>>>>PS: Patches are the same as before.
>>>>>>>
>>>>>>>>
>>>>>>>>Simo.
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>New patches attached. As discussed above, I changed the timeout to
>>>>>contact DNS server from 5s to 2s, and timeout for the entire DNS
>>>>>resolution
>>>>>from 5 to 6 seconds (instead of 15s as proposed originally)
>>>>>(so it has enough time to contact 3 nameservers).
>>>>>
>>>>>I tested it with one and two unreachable nameservers.
>>>>>
>>>>>Thanks
>>>>>Michal
>>>>
>>>>Works for me, too, so I'll ACK the patches. But I'll leave them
on the
>>>>list for a day longer in case somebody else would like to voice their
>>>>concern about changing the default timeout value. I personally think
>>>>it's OK.
>>>
>>>Pushed to master and sssd-1-10
>>
>>Hi,
>>
>>attached is a 1.9 bacport of the patches. Michal, do you think we need
>>to lower the FO_DEFAULT_SVC_TIMEOUT as well? I guess so, but you know
>>the details better..
>>
>
>
>FO_DEFAULT_SVC_TIMEOUT is not used anywhere. It is actually removed in
>the two minor fixes you posted earlier .
>
>Ack to both patches.
>
>Michal
>
Just remembered. If you backport these patches, we may consider
backporting d910b4ebaa8e68bd24ad22b8f65aed7c6812aaf1 as well.
Backported patch is attached.
Thanks
Michal