On Wed, 2011-09-14 at 13:33 -0400, Dmitri Pal wrote:
On 09/14/2011 01:22 PM, Stephen Gallagher wrote:
> We were incorrectly using DBUS_ERROR_TIMEOUT here. The correct
> behaviour is to check for DBUS_ERROR_NO_REPLY. This way we will
> properly handle the three-tries in the tasks_check_handler().
>
> D-BUS is rather confusing with these error codes.
> DBUS_ERROR_NO_REPLY: No reply to a message expecting one, usually means
> a timeout occurred.
>
> DBUS_ERROR_TIMEOUT: Certain timeout errors, possibly ETIMEDOUT on a
> socket.
>
> And just for added confusion, there's also:
> DBUS_ERROR_TIMED_OUT: Certain timeout errors, e.g. while starting a
> service.
>
> DBUS_ERROR_NO_REPLY is the only correct one for our usage. This explains
> the intermittent bug we were seeing where the monitor lost communication
> with its services (usually the data providers). Because of this loss of
> communication, the monitor was unable to notify the providers of changes
> to the routing table or resolv.conf, leading to being stuck offline
> until SSSD was restarted.
Is it safe to use strcmp in this case? Though the string is returned
by the library in case of a bug and bad string being returned we would
crash.
Also does the call to get the string always guarantee a not NULL
result?
The way we were using it, it would have needed to be a bug in libdbus to
get back something with NULL here, but it never hurts to be cautious.
Fixed.
Also, I discovered that the way we were counting failures was incorrect
and was resetting itself before restarting services. I've fixed that as
well in the new patch.