On Thu, 2011-09-15 at 17:16 +0200, Jan Zelený wrote:
> On Wed, 2011-09-14 at 13:33 -0400, Dmitri Pal wrote:
> > On 09/14/2011 01:22 PM, Stephen Gallagher wrote:
> > > We were incorrectly using DBUS_ERROR_TIMEOUT here. The correct
> > > behaviour is to check for DBUS_ERROR_NO_REPLY. This way we will
> > > properly handle the three-tries in the tasks_check_handler().
> > >
> > > D-BUS is rather confusing with these error codes.
> > > DBUS_ERROR_NO_REPLY: No reply to a message expecting one, usually means
> > > a timeout occurred.
> > >
> > > DBUS_ERROR_TIMEOUT: Certain timeout errors, possibly ETIMEDOUT on a
> > > socket.
> > >
> > > And just for added confusion, there's also:
> > > DBUS_ERROR_TIMED_OUT: Certain timeout errors, e.g. while starting a
> > > service.
> > >
> > > DBUS_ERROR_NO_REPLY is the only correct one for our usage. This
> > > explains the intermittent bug we were seeing where the monitor lost
> > > communication with its services (usually the data providers). Because
> > > of this loss of communication, the monitor was unable to notify the
> > > providers of changes to the routing table or resolv.conf, leading to
> > > being stuck offline until SSSD was restarted.
> >
> > Is it safe to use strcmp in this case? Though the string is returned
> > by the library in case of a bug and bad string being returned we would
> > crash.
> > Also does the call to get the string always guarantee a not NULL
> > result?
>
> The way we were using it, it would have needed to be a bug in libdbus to
> get back something with NULL here, but it never hurts to be cautious.
> Fixed.
>
> Also, I discovered that the way we were counting failures was incorrect
> and was resetting itself before restarting services. I've fixed that as
> well in the new patch.
The patch looks good and everything works. ACK
Jan
I fixed up the DEBUG levels and pushed this patch to master, sssd-1-6
and sssd-1-5.