On Wed, 2011-09-14 at 13:33 -0400, Dmitri Pal wrote:
> On 09/14/2011 01:22 PM, Stephen Gallagher wrote:
> > We were incorrectly using DBUS_ERROR_TIMEOUT here. The correct
> > behaviour is to check for DBUS_ERROR_NO_REPLY. This way we will
> > properly handle the three-tries in the tasks_check_handler().
> >
> > D-BUS is rather confusing with these error codes.
> > DBUS_ERROR_NO_REPLY: No reply to a message expecting one, usually means
> > a timeout occurred.
> >
> > DBUS_ERROR_TIMEOUT: Certain timeout errors, possibly ETIMEDOUT on a
> > socket.
> >
> > And just for added confusion, there's also:
> > DBUS_ERROR_TIMED_OUT: Certain timeout errors, e.g. while starting a
> > service.
> >
> > DBUS_ERROR_NO_REPLY is the only correct one for our usage. This
> > explains the intermittent bug we were seeing where the monitor lost
> > communication with its services (usually the data providers). Because
> > of this loss of communication, the monitor was unable to notify the
> > providers of changes to the routing table or resolv.conf, leading to
> > being stuck offline until SSSD was restarted.
>
> Is it safe to use strcmp in this case? Though the string is returned
> by the library in case of a bug and bad string being returned we would
> crash.
> Also does the call to get the string always guarantee a not NULL
> result?
The way we were using it, it would have needed to be a bug in libdbus to
get back something with NULL here, but it never hurts to be cautious.
Fixed.
Also, I discovered that the way we were counting failures was incorrect
and was resetting itself before restarting services. I've fixed that as
well in the new patch.
The patch looks good and everything works. ACK
Jan