On Tue, Jan 06, 2015 at 02:10:44PM +0100, Pavel Březina wrote:
On 12/10/2014 09:16 PM, Stephen Gallagher wrote:
>
>
>
>On Wed, 2014-12-10 at 14:59 -0500, Stephen Gallagher wrote:
>>There are actually two bugs here:
>>
>>1) When either the kill(SIGTERM) or kill(SIGKILL) commands returned
>>failure (for any reason), we would talloc_free(svc) which removed it
>>from being eligible for restart, resulting in the service never
>>starting again without an SSSD service restart.
>>
>>2) There is a fairly wide race condition where it's possible for a
>>SIGKILL timer to "catch up" to the child exit handler between us
>>noticing the termination and actually restarting it. The race
>>happens because we re-enter the mainloop and add a restart
>>timeout to avoid a quick failure if we keep restarting due to a
>>transitory issue (the mt_svc object, and therefore the SIGKILL
>>timer, were never freed until we got to the actual service
>>restart).
>>
>>We can minimize this race by recording the timer_event for the
>>SIGKILL timeout in the mt_svc object. This way, if the process
>>exits via SIGTERM, we will immediately remove the timer for the
>>SIGKILL.
>>
>>This patch also removes the incorrect talloc_free(svc) calls on the
>>kill() failures and replaces them with an attempt to just start up
>>the service again and hope for the best.
>>
>>Resolves:
>>https://fedorahosted.org/sssd/ticket/2525
>
>
>Just after sending this, I noticed another enhancement I could make to
>basically eliminate the potential race condition. Updated patch
>attached.
Ack. Thank you Stephen.
* master: 152251b13a99c88054055d46600e0478c4f7bd05