Base on the second proposal:
https://fedorahosted.org/sssd/wiki/DesignDocs/SigChld
There is some old SIGCHLD handling code in src/providers/child_common.[ch], that should probably go away if this gets accepted. There was also a naming conflict with the sss_child_ctx structure. This structure is only used internally by functions defined src/providers/child_common.c. I renamed the original structure to sss_child_ctx_old for now.
Pavel
On Fri, 2011-10-21 at 15:28 +0200, Pavel Zuna wrote:
Base on the second proposal:
https://fedorahosted.org/sssd/wiki/DesignDocs/SigChld
There is some old SIGCHLD handling code in src/providers/child_common.[ch], that should probably go away if this gets accepted. There was also a naming conflict with the sss_child_ctx structure. This structure is only used internally by functions defined src/providers/child_common.c. I renamed the original structure to sss_child_ctx_old for now.
Nack (but a good start)
sss_child_destructor() cannot return the result of hash_delete(). Talloc destructors may only return 0 on success or -1 on failure. However, because destructor failure results in cancelling the free(), we should just log an error here and return 0.
sss_child_register: Don't initialize sss_child_ctx *tmp to the child_ctx passed in. We want to leave child_ctx untouched until successful completion. Also, please don't use the variable name "tmp". It's not descriptive. Call it "child".
Instead of allocating *child_ctx at the beginning, allocate "child" on the mem_ctx.
child = talloc_zero(mem_ctx, struct sss_child_ctx);
Then, only at the end of execution, right before returning EOK, do *child_ctx = child;
sss_child_invoke_cb(): I'm wary of having the callback invocation free the child_ctx. It needs to be documented well (so that consumers know they must not free it once the callback has fired or they'll get a double-free).
sss_child_handler(): You need to check whether waitpid() returned an error and handle that appropriately. waitpid() will return -1 on error and set errno. If the errno value is EINTR, you should retry (because waitpid() was interrupted by a signal at the time). Other errors should generate a DEBUG message.
Please separate any changes you make to the existing sss_child_ctx into its own patch. I don't like just renaming it to _old, either. Please find a more appropriate way to replace this old object (preferably by converting it to use your new mechanism, also in a new patch).
On 10/21/2011 06:44 PM, Stephen Gallagher wrote:
On Fri, 2011-10-21 at 15:28 +0200, Pavel Zuna wrote:
Base on the second proposal:
https://fedorahosted.org/sssd/wiki/DesignDocs/SigChld
There is some old SIGCHLD handling code in src/providers/child_common.[ch], that should probably go away if this gets accepted. There was also a naming conflict with the sss_child_ctx structure. This structure is only used internally by functions defined src/providers/child_common.c. I renamed the original structure to sss_child_ctx_old for now.
Nack (but a good start)
sss_child_destructor() cannot return the result of hash_delete(). Talloc destructors may only return 0 on success or -1 on failure. However, because destructor failure results in cancelling the free(), we should just log an error here and return 0.
sss_child_register: Don't initialize sss_child_ctx *tmp to the child_ctx passed in. We want to leave child_ctx untouched until successful completion. Also, please don't use the variable name "tmp". It's not descriptive. Call it "child".
Instead of allocating *child_ctx at the beginning, allocate "child" on the mem_ctx.
child = talloc_zero(mem_ctx, struct sss_child_ctx);
Then, only at the end of execution, right before returning EOK, do *child_ctx = child;
sss_child_invoke_cb(): I'm wary of having the callback invocation free the child_ctx. It needs to be documented well (so that consumers know they must not free it once the callback has fired or they'll get a double-free).
sss_child_handler(): You need to check whether waitpid() returned an error and handle that appropriately. waitpid() will return -1 on error and set errno. If the errno value is EINTR, you should retry (because waitpid() was interrupted by a signal at the time). Other errors should generate a DEBUG message.
Ok, I'm done with all of this. It was pretty straight forward, but...
Please separate any changes you make to the existing sss_child_ctx into its own patch. I don't like just renaming it to _old, either. Please find a more appropriate way to replace this old object (preferably by converting it to use your new mechanism, also in a new patch).
... this, I'm trying to find a way to make the old code use the new mechanism. I think it would be better to remove the old one completely and replace it with the new one everywhere where it's used.
Pavel
On 10/25/2011 01:08 PM, Pavel Zuna wrote:
On 10/21/2011 06:44 PM, Stephen Gallagher wrote:
On Fri, 2011-10-21 at 15:28 +0200, Pavel Zuna wrote:
Base on the second proposal:
https://fedorahosted.org/sssd/wiki/DesignDocs/SigChld
There is some old SIGCHLD handling code in src/providers/child_common.[ch], that should probably go away if this gets accepted. There was also a naming conflict with the sss_child_ctx structure. This structure is only used internally by functions defined src/providers/child_common.c. I renamed the original structure to sss_child_ctx_old for now.
Nack (but a good start)
sss_child_destructor() cannot return the result of hash_delete(). Talloc destructors may only return 0 on success or -1 on failure. However, because destructor failure results in cancelling the free(), we should just log an error here and return 0.
sss_child_register: Don't initialize sss_child_ctx *tmp to the child_ctx passed in. We want to leave child_ctx untouched until successful completion. Also, please don't use the variable name "tmp". It's not descriptive. Call it "child".
Instead of allocating *child_ctx at the beginning, allocate "child" on the mem_ctx.
child = talloc_zero(mem_ctx, struct sss_child_ctx);
Then, only at the end of execution, right before returning EOK, do *child_ctx = child;
sss_child_invoke_cb(): I'm wary of having the callback invocation free the child_ctx. It needs to be documented well (so that consumers know they must not free it once the callback has fired or they'll get a double-free).
sss_child_handler(): You need to check whether waitpid() returned an error and handle that appropriately. waitpid() will return -1 on error and set errno. If the errno value is EINTR, you should retry (because waitpid() was interrupted by a signal at the time). Other errors should generate a DEBUG message.
Ok, I'm done with all of this. It was pretty straight forward, but...
Please separate any changes you make to the existing sss_child_ctx into its own patch. I don't like just renaming it to _old, either. Please find a more appropriate way to replace this old object (preferably by converting it to use your new mechanism, also in a new patch).
... this, I'm trying to find a way to make the old code use the new mechanism. I think it would be better to remove the old one completely and replace it with the new one everywhere where it's used.
Pavel
New patch with the above mentioned issues fixed is attached to this email.
We agreed that we're going to replace all uses of the old system by the new. Expect about 3 patches to follow on this one - one for each provider using the old system.
Pavel
On Thu, Oct 27, 2011 at 02:03:25PM +0200, Pavel Zuna wrote:
On 10/25/2011 01:08 PM, Pavel Zuna wrote:
On 10/21/2011 06:44 PM, Stephen Gallagher wrote:
On Fri, 2011-10-21 at 15:28 +0200, Pavel Zuna wrote:
Base on the second proposal:
https://fedorahosted.org/sssd/wiki/DesignDocs/SigChld
There is some old SIGCHLD handling code in src/providers/child_common.[ch], that should probably go away if this gets accepted. There was also a naming conflict with the sss_child_ctx structure. This structure is only used internally by functions defined src/providers/child_common.c. I renamed the original structure to sss_child_ctx_old for now.
Nack (but a good start)
sss_child_destructor() cannot return the result of hash_delete(). Talloc destructors may only return 0 on success or -1 on failure. However, because destructor failure results in cancelling the free(), we should just log an error here and return 0.
sss_child_register: Don't initialize sss_child_ctx *tmp to the child_ctx passed in. We want to leave child_ctx untouched until successful completion. Also, please don't use the variable name "tmp". It's not descriptive. Call it "child".
Instead of allocating *child_ctx at the beginning, allocate "child" on the mem_ctx.
child = talloc_zero(mem_ctx, struct sss_child_ctx);
Then, only at the end of execution, right before returning EOK, do *child_ctx = child;
In sss_child_register():
+ error = hash_enter(sigchld_ctx->children, &key, &value); + if (error != HASH_SUCCESS && error != HASH_ERROR_KEY_NOT_FOUND) {
I don't think hash_enter can return HASH_ERROR_KEY_NOT_FOUND (what would it mean?) and even if it did, why ignore it?
sss_child_invoke_cb(): I'm wary of having the callback invocation free the child_ctx. It needs to be documented well (so that consumers know they must not free it once the callback has fired or they'll get a double-free).
This still needs fixing.
sss_child_handler(): You need to check whether waitpid() returned an error and handle that appropriately. waitpid() will return -1 on error and set errno. If the errno value is EINTR, you should retry (because waitpid() was interrupted by a signal at the time). Other errors should generate a DEBUG message.
I think you misunderstood how the error handling should be done:
+ if (pid == -1) { + if (errno == ECHILD) { + DEBUG(0, ("waitpid failed with ECHILD\n")); + } else if (errno == EINVAL) { + DEBUG(0, ("waitpid failed with EINVAL\n")); + } + return; + }
Instead of having a DEBUG() call per retval, it would be better to inform user on why the child process exited, similarly to how child_sig_handler() does it now.
Ok, I'm done with all of this. It was pretty straight forward, but...
Please separate any changes you make to the existing sss_child_ctx into its own patch. I don't like just renaming it to _old, either. Please find a more appropriate way to replace this old object (preferably by converting it to use your new mechanism, also in a new patch).
... this, I'm trying to find a way to make the old code use the new mechanism. I think it would be better to remove the old one completely and replace it with the new one everywhere where it's used.
Pavel
New patch with the above mentioned issues fixed is attached to this email.
We agreed that we're going to replace all uses of the old system by the new. Expect about 3 patches to follow on this one - one for each provider using the old system.
Pavel
On 10/31/2011 04:29 PM, Jakub Hrozek wrote:
On Thu, Oct 27, 2011 at 02:03:25PM +0200, Pavel Zuna wrote:
On 10/25/2011 01:08 PM, Pavel Zuna wrote:
On 10/21/2011 06:44 PM, Stephen Gallagher wrote:
On Fri, 2011-10-21 at 15:28 +0200, Pavel Zuna wrote:
Base on the second proposal:
https://fedorahosted.org/sssd/wiki/DesignDocs/SigChld
There is some old SIGCHLD handling code in src/providers/child_common.[ch], that should probably go away if this gets accepted. There was also a naming conflict with the sss_child_ctx structure. This structure is only used internally by functions defined src/providers/child_common.c. I renamed the original structure to sss_child_ctx_old for now.
Nack (but a good start)
sss_child_destructor() cannot return the result of hash_delete(). Talloc destructors may only return 0 on success or -1 on failure. However, because destructor failure results in cancelling the free(), we should just log an error here and return 0.
sss_child_register: Don't initialize sss_child_ctx *tmp to the child_ctx passed in. We want to leave child_ctx untouched until successful completion. Also, please don't use the variable name "tmp". It's not descriptive. Call it "child".
Instead of allocating *child_ctx at the beginning, allocate "child" on the mem_ctx.
child = talloc_zero(mem_ctx, struct sss_child_ctx);
Then, only at the end of execution, right before returning EOK, do *child_ctx = child;
In sss_child_register():
- error = hash_enter(sigchld_ctx->children,&key,&value);
- if (error != HASH_SUCCESS&& error != HASH_ERROR_KEY_NOT_FOUND) {
I don't think hash_enter can return HASH_ERROR_KEY_NOT_FOUND (what would it mean?) and even if it did, why ignore it?
sss_child_invoke_cb(): I'm wary of having the callback invocation free the child_ctx. It needs to be documented well (so that consumers know they must not free it once the callback has fired or they'll get a double-free).
This still needs fixing.
Forgot about that, sorry. Replaced the talloc_free with removing child_ctx from the hash table only.
sss_child_handler(): You need to check whether waitpid() returned an error and handle that appropriately. waitpid() will return -1 on error and set errno. If the errno value is EINTR, you should retry (because waitpid() was interrupted by a signal at the time). Other errors should generate a DEBUG message.
I think you misunderstood how the error handling should be done:
if (pid == -1) {
if (errno == ECHILD) {
DEBUG(0, ("waitpid failed with ECHILD\n"));
} else if (errno == EINVAL) {
DEBUG(0, ("waitpid failed with EINVAL\n"));
}
return;
}
Instead of having a DEBUG() call per retval, it would be better to inform user on why the child process exited, similarly to how child_sig_handler() does it now.
I don't agree. The system is designed to inform providers that a child has exited by invoking their callbacks. It's up to the provider to decide what to do with the result and inform the user if appropriate. This would also make sense only in the case of a successful call (pid >= 0).
In case of EINTR, we need to repeat the call to waitpid, because it was interrupted by a signal. In other cases (ECHILD and EINVAL) the call is going to always fail, so there's won't be any callbacks to invoke.
Ok, I'm done with all of this. It was pretty straight forward, but...
Please separate any changes you make to the existing sss_child_ctx into its own patch. I don't like just renaming it to _old, either. Please find a more appropriate way to replace this old object (preferably by converting it to use your new mechanism, also in a new patch).
... this, I'm trying to find a way to make the old code use the new mechanism. I think it would be better to remove the old one completely and replace it with the new one everywhere where it's used.
Pavel
New patch with the above mentioned issues fixed is attached to this email.
We agreed that we're going to replace all uses of the old system by the new. Expect about 3 patches to follow on this one - one for each provider using the old system.
Pavel
Thanks for taking the time to review this. New patch attached.
Pavel
On Tue, Nov 01, 2011 at 11:34:27AM +0100, Pavel Zuna wrote:
On 10/31/2011 04:29 PM, Jakub Hrozek wrote:
On Thu, Oct 27, 2011 at 02:03:25PM +0200, Pavel Zuna wrote:
On 10/25/2011 01:08 PM, Pavel Zuna wrote:
On 10/21/2011 06:44 PM, Stephen Gallagher wrote:
On Fri, 2011-10-21 at 15:28 +0200, Pavel Zuna wrote:
Base on the second proposal:
https://fedorahosted.org/sssd/wiki/DesignDocs/SigChld
There is some old SIGCHLD handling code in src/providers/child_common.[ch], that should probably go away if this gets accepted. There was also a naming conflict with the sss_child_ctx structure. This structure is only used internally by functions defined src/providers/child_common.c. I renamed the original structure to sss_child_ctx_old for now.
Nack (but a good start)
sss_child_destructor() cannot return the result of hash_delete(). Talloc destructors may only return 0 on success or -1 on failure. However, because destructor failure results in cancelling the free(), we should just log an error here and return 0.
sss_child_register: Don't initialize sss_child_ctx *tmp to the child_ctx passed in. We want to leave child_ctx untouched until successful completion. Also, please don't use the variable name "tmp". It's not descriptive. Call it "child".
Instead of allocating *child_ctx at the beginning, allocate "child" on the mem_ctx.
child = talloc_zero(mem_ctx, struct sss_child_ctx);
Then, only at the end of execution, right before returning EOK, do *child_ctx = child;
In sss_child_register():
- error = hash_enter(sigchld_ctx->children,&key,&value);
- if (error != HASH_SUCCESS&& error != HASH_ERROR_KEY_NOT_FOUND) {
I don't think hash_enter can return HASH_ERROR_KEY_NOT_FOUND (what would it mean?) and even if it did, why ignore it?
sss_child_invoke_cb(): I'm wary of having the callback invocation free the child_ctx. It needs to be documented well (so that consumers know they must not free it once the callback has fired or they'll get a double-free).
This still needs fixing.
Forgot about that, sorry. Replaced the talloc_free with removing child_ctx from the hash table only.
sss_child_handler(): You need to check whether waitpid() returned an error and handle that appropriately. waitpid() will return -1 on error and set errno. If the errno value is EINTR, you should retry (because waitpid() was interrupted by a signal at the time). Other errors should generate a DEBUG message.
I think you misunderstood how the error handling should be done:
if (pid == -1) {
if (errno == ECHILD) {
DEBUG(0, ("waitpid failed with ECHILD\n"));
} else if (errno == EINVAL) {
DEBUG(0, ("waitpid failed with EINVAL\n"));
}
return;
}
Instead of having a DEBUG() call per retval, it would be better to inform user on why the child process exited, similarly to how child_sig_handler() does it now.
I don't agree. The system is designed to inform providers that a child has exited by invoking their callbacks. It's up to the provider to decide what to do with the result and inform the user if appropriate. This would also make sense only in the case of a successful call (pid >= 0).
OK, then why the separate branches for ECHILD and EINVAL? Isn't it easier to just print errno and strerror(errno) ?
In case of EINTR, we need to repeat the call to waitpid, because it was interrupted by a signal. In other cases (ECHILD and EINVAL) the call is going to always fail, so there's won't be any callbacks to invoke.
Yup.
Ok, I'm done with all of this. It was pretty straight forward, but...
Please separate any changes you make to the existing sss_child_ctx into its own patch. I don't like just renaming it to _old, either. Please find a more appropriate way to replace this old object (preferably by converting it to use your new mechanism, also in a new patch).
... this, I'm trying to find a way to make the old code use the new mechanism. I think it would be better to remove the old one completely and replace it with the new one everywhere where it's used.
Pavel
New patch with the above mentioned issues fixed is attached to this email.
We agreed that we're going to replace all uses of the old system by the new. Expect about 3 patches to follow on this one - one for each provider using the old system.
Pavel
Thanks for taking the time to review this. New patch attached.
Pavel
Two more issues (sorry for not catching all this at once):
Please don't just use DEBUG(0) for all the messages. DEBUG(0) is supposed to be used for fatal errors that cause the provider to quit completely.
See https://fedorahosted.org/pipermail/sssd-devel/2011-June/006407.html for explanation of the log levels. In general, this patch is going to be included in 1.7 and onwards -- please use the new debug levels (see src/util/util.h) defined as macros instead of integers.
The second issue is that I'm getting a segfault when logging in via Kerberos, the backtrace is pointing to tevent_common_check_signal().
On 11/01/2011 02:58 PM, Jakub Hrozek wrote:
On Tue, Nov 01, 2011 at 11:34:27AM +0100, Pavel Zuna wrote:
On 10/31/2011 04:29 PM, Jakub Hrozek wrote:
On Thu, Oct 27, 2011 at 02:03:25PM +0200, Pavel Zuna wrote:
On 10/25/2011 01:08 PM, Pavel Zuna wrote:
On 10/21/2011 06:44 PM, Stephen Gallagher wrote:
On Fri, 2011-10-21 at 15:28 +0200, Pavel Zuna wrote: > Base on the second proposal: > > https://fedorahosted.org/sssd/wiki/DesignDocs/SigChld > > There is some old SIGCHLD handling code in src/providers/child_common.[ch], that > should probably go away if this gets accepted. There was also a naming conflict > with the sss_child_ctx structure. This structure is only used internally by > functions defined src/providers/child_common.c. I renamed the original structure > to sss_child_ctx_old for now.
Nack (but a good start)
sss_child_destructor() cannot return the result of hash_delete(). Talloc destructors may only return 0 on success or -1 on failure. However, because destructor failure results in cancelling the free(), we should just log an error here and return 0.
sss_child_register: Don't initialize sss_child_ctx *tmp to the child_ctx passed in. We want to leave child_ctx untouched until successful completion. Also, please don't use the variable name "tmp". It's not descriptive. Call it "child".
Instead of allocating *child_ctx at the beginning, allocate "child" on the mem_ctx.
child = talloc_zero(mem_ctx, struct sss_child_ctx);
Then, only at the end of execution, right before returning EOK, do *child_ctx = child;
In sss_child_register():
- error = hash_enter(sigchld_ctx->children,&key,&value);
- if (error != HASH_SUCCESS&& error != HASH_ERROR_KEY_NOT_FOUND) {
I don't think hash_enter can return HASH_ERROR_KEY_NOT_FOUND (what would it mean?) and even if it did, why ignore it?
sss_child_invoke_cb(): I'm wary of having the callback invocation free the child_ctx. It needs to be documented well (so that consumers know they must not free it once the callback has fired or they'll get a double-free).
This still needs fixing.
Forgot about that, sorry. Replaced the talloc_free with removing child_ctx from the hash table only.
sss_child_handler(): You need to check whether waitpid() returned an error and handle that appropriately. waitpid() will return -1 on error and set errno. If the errno value is EINTR, you should retry (because waitpid() was interrupted by a signal at the time). Other errors should generate a DEBUG message.
I think you misunderstood how the error handling should be done:
if (pid == -1) {
if (errno == ECHILD) {
DEBUG(0, ("waitpid failed with ECHILD\n"));
} else if (errno == EINVAL) {
DEBUG(0, ("waitpid failed with EINVAL\n"));
}
return;
}
Instead of having a DEBUG() call per retval, it would be better to inform user on why the child process exited, similarly to how child_sig_handler() does it now.
I don't agree. The system is designed to inform providers that a child has exited by invoking their callbacks. It's up to the provider to decide what to do with the result and inform the user if appropriate. This would also make sense only in the case of a successful call (pid>= 0).
OK, then why the separate branches for ECHILD and EINVAL? Isn't it easier to just print errno and strerror(errno) ?
Yeah, it's definitely better. Improved in current patch. :)
In case of EINTR, we need to repeat the call to waitpid, because it was interrupted by a signal. In other cases (ECHILD and EINVAL) the call is going to always fail, so there's won't be any callbacks to invoke.
Yup.
Ok, I'm done with all of this. It was pretty straight forward, but...
Please separate any changes you make to the existing sss_child_ctx into its own patch. I don't like just renaming it to _old, either. Please find a more appropriate way to replace this old object (preferably by converting it to use your new mechanism, also in a new patch).
... this, I'm trying to find a way to make the old code use the new mechanism. I think it would be better to remove the old one completely and replace it with the new one everywhere where it's used.
Pavel
New patch with the above mentioned issues fixed is attached to this email.
We agreed that we're going to replace all uses of the old system by the new. Expect about 3 patches to follow on this one - one for each provider using the old system.
Pavel
Thanks for taking the time to review this. New patch attached.
Pavel
Two more issues (sorry for not catching all this at once):
Please don't just use DEBUG(0) for all the messages. DEBUG(0) is supposed to be used for fatal errors that cause the provider to quit completely.
See https://fedorahosted.org/pipermail/sssd-devel/2011-June/006407.html for explanation of the log levels. In general, this patch is going to be included in 1.7 and onwards -- please use the new debug levels (see src/util/util.h) defined as macros instead of integers.
Ok.
The second issue is that I'm getting a segfault when logging in via Kerberos, the backtrace is pointing to tevent_common_check_signal().
Finally found what the problem was. I suspect there might be a bug in tevent.
The SEGFAULT was generated from tevent_common_check_signal as you pointed out. It happened when copying a portion of siginfo that my signal handler wasn't using.
Simply adding the SA_SIGINFO flag to tevent_add_signal solved the issue.
Updated patch attached.
Pavel
On Wed, 2011-11-09 at 15:31 +0100, Pavel Zuna wrote:
Updated patch attached.
Nack.
Please make "struct sss_sigchild_ctx" an opaque structure and provide an init routine to create the hash and set up the signal handler. This way we only have to call sss_sigchild_init(ctx, &ctx->sigchld_ctx); in be_process_init() (or any other executable that needs child control).
In sss_child_handler(), don't consider it fatal if the PID isn't found in the hash. It could mean that something we're linked against did a fork and was trying to maintain itself. This is bad, but not fatal. Lower the debug level a bit (I think SSSDBG_OP_FAILURE is fine) and allow the waitpid() loop to continue, rather than exiting immediately.
sss_child_ctx should be moved to child_common.c. Its internals should not be visible elsewhere.
Please reorder sss_child_invoke_cb() to remove the entry from the hash before invoking the callback (you cannot assume that the callback won't do something funny to the hash structure or child_ctx, though it should be safe). Also, please raise the debug level for this failure to SSSDBG_OP_FAILURE.
On Tue, 2011-11-22 at 08:58 -0500, Stephen Gallagher wrote:
On Wed, 2011-11-09 at 15:31 +0100, Pavel Zuna wrote:
Updated patch attached.
Nack.
Please make "struct sss_sigchild_ctx" an opaque structure and provide an init routine to create the hash and set up the signal handler. This way we only have to call sss_sigchild_init(ctx, &ctx->sigchld_ctx); in be_process_init() (or any other executable that needs child control).
In sss_child_handler(), don't consider it fatal if the PID isn't found in the hash. It could mean that something we're linked against did a fork and was trying to maintain itself. This is bad, but not fatal. Lower the debug level a bit (I think SSSDBG_OP_FAILURE is fine) and allow the waitpid() loop to continue, rather than exiting immediately.
sss_child_ctx should be moved to child_common.c. Its internals should not be visible elsewhere.
Please reorder sss_child_invoke_cb() to remove the entry from the hash before invoking the callback (you cannot assume that the callback won't do something funny to the hash structure or child_ctx, though it should be safe). Also, please raise the debug level for this failure to SSSDBG_OP_FAILURE.
I've taken over the development of this patch. I'm sending my changes to address the above issues except for a slight modification to my comments about the sss_child_handler loop - it was already ignoring unknown errors, but it was exiting on unexpected failures. I added a comment and will allow it to try to continue looping through in the hopes that it will minimize the potential zombies.
This patch does not currently consume the functions that it creates. I am working on converting the existing child handlers to use this code, but I want to get this patch looked at and acked in the meantime.
On Tue, Dec 13, 2011 at 11:14:31AM -0500, Stephen Gallagher wrote:
On Tue, 2011-11-22 at 08:58 -0500, Stephen Gallagher wrote:
On Wed, 2011-11-09 at 15:31 +0100, Pavel Zuna wrote:
Updated patch attached.
Nack.
Please make "struct sss_sigchild_ctx" an opaque structure and provide an init routine to create the hash and set up the signal handler. This way we only have to call sss_sigchild_init(ctx, &ctx->sigchld_ctx); in be_process_init() (or any other executable that needs child control).
In sss_child_handler(), don't consider it fatal if the PID isn't found in the hash. It could mean that something we're linked against did a fork and was trying to maintain itself. This is bad, but not fatal. Lower the debug level a bit (I think SSSDBG_OP_FAILURE is fine) and allow the waitpid() loop to continue, rather than exiting immediately.
sss_child_ctx should be moved to child_common.c. Its internals should not be visible elsewhere.
Please reorder sss_child_invoke_cb() to remove the entry from the hash before invoking the callback (you cannot assume that the callback won't do something funny to the hash structure or child_ctx, though it should be safe). Also, please raise the debug level for this failure to SSSDBG_OP_FAILURE.
I've taken over the development of this patch. I'm sending my changes to address the above issues except for a slight modification to my comments about the sss_child_handler loop - it was already ignoring unknown errors, but it was exiting on unexpected failures. I added a comment and will allow it to try to continue looping through in the hopes that it will minimize the potential zombies.
This patch does not currently consume the functions that it creates. I am working on converting the existing child handlers to use this code, but I want to get this patch looked at and acked in the meantime.
Are you going to fix the two comments you had originally in the later patch? Those were don't consider it fatal if the PID isn't found in the hash in sss_child_handler() and reordering sss_child_invoke_cb()
If so, then ack
On Dec 14, 2011, at 9:09 AM, Jakub Hrozek jhrozek@redhat.com wrote:
On Tue, Dec 13, 2011 at 11:14:31AM -0500, Stephen Gallagher wrote:
On Tue, 2011-11-22 at 08:58 -0500, Stephen Gallagher wrote:
On Wed, 2011-11-09 at 15:31 +0100, Pavel Zuna wrote:
Updated patch attached.
Nack.
Please make "struct sss_sigchild_ctx" an opaque structure and provide an init routine to create the hash and set up the signal handler. This way we only have to call sss_sigchild_init(ctx, &ctx->sigchld_ctx); in be_process_init() (or any other executable that needs child control).
In sss_child_handler(), don't consider it fatal if the PID isn't found in the hash. It could mean that something we're linked against did a fork and was trying to maintain itself. This is bad, but not fatal. Lower the debug level a bit (I think SSSDBG_OP_FAILURE is fine) and allow the waitpid() loop to continue, rather than exiting immediately.
sss_child_ctx should be moved to child_common.c. Its internals should not be visible elsewhere.
Please reorder sss_child_invoke_cb() to remove the entry from the hash before invoking the callback (you cannot assume that the callback won't do something funny to the hash structure or child_ctx, though it should be safe). Also, please raise the debug level for this failure to SSSDBG_OP_FAILURE.
I've taken over the development of this patch. I'm sending my changes to address the above issues except for a slight modification to my comments about the sss_child_handler loop - it was already ignoring unknown errors, but it was exiting on unexpected failures. I added a comment and will allow it to try to continue looping through in the hopes that it will minimize the potential zombies.
This patch does not currently consume the functions that it creates. I am working on converting the existing child handlers to use this code, but I want to get this patch looked at and acked in the meantime.
Are you going to fix the two comments you had originally in the later patch? Those were don't consider it fatal if the PID isn't found in the hash in sss_child_handler() and reordering sss_child_invoke_cb()
If so, then ack
The first part of that I already described above as being an incorrect reading of the source. I missed the second part and I'll fix that shortly.
sssd-devel mailing list sssd-devel@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/sssd-devel
On Wed, 2011-12-14 at 09:15 -0500, Stephen Gallagher wrote:
On Dec 14, 2011, at 9:09 AM, Jakub Hrozek jhrozek@redhat.com wrote:
Are you going to fix the two comments you had originally in the later patch? Those were don't consider it fatal if the PID isn't found in the hash in sss_child_handler() and reordering sss_child_invoke_cb()
If so, then ack
The first part of that I already described above as being an incorrect reading of the source. I missed the second part and I'll fix that shortly.
New patch attached fixes the missing reordering of the sssd_child_invoke_cb() and adds more explicit handling of unknown PIDs (with a new DEBUG log message).
On Wed, 2011-12-14 at 13:03 -0500, Stephen Gallagher wrote:
On Wed, 2011-12-14 at 09:15 -0500, Stephen Gallagher wrote:
On Dec 14, 2011, at 9:09 AM, Jakub Hrozek jhrozek@redhat.com wrote:
Are you going to fix the two comments you had originally in the later patch? Those were don't consider it fatal if the PID isn't found in the hash in sss_child_handler() and reordering sss_child_invoke_cb()
If so, then ack
The first part of that I already described above as being an incorrect reading of the source. I missed the second part and I'll fix that shortly.
New patch attached fixes the missing reordering of the sssd_child_invoke_cb() and adds more explicit handling of unknown PIDs (with a new DEBUG log message).
I located one very serious bug while working on some dependent patches. The waitpid() in this patch was using a bitwise AND with the WNOHANG instead of a bitwise OR (resulting in the WNOHANG being disabled and causing waitpid() to spin forever in some situations). New patch attached.
I also moved the child_common routines out of the provider directory and into util. I did this in a second patch so it's easier to follow the changes.
On Wed, 2011-12-14 at 13:03 -0500, Stephen Gallagher wrote:
On Wed, 2011-12-14 at 09:15 -0500, Stephen Gallagher wrote:
On Dec 14, 2011, at 9:09 AM, Jakub Hrozek jhrozek@redhat.com wrote:
Are you going to fix the two comments you had originally in the later patch? Those were don't consider it fatal if the PID isn't found in the hash in sss_child_handler() and reordering sss_child_invoke_cb()
If so, then ack
The first part of that I already described above as being an incorrect reading of the source. I missed the second part and I'll fix that shortly.
New patch attached fixes the missing reordering of the sssd_child_invoke_cb() and adds more explicit handling of unknown PIDs (with a new DEBUG log message).
I discovered two more issues with this patch, one serious, one cosmetic but annoying.
1) sss_child_register() wasn't assigning child->sigchld_ctx, so if the child was freed, the destructor would segfault trying to remove this child from the hash table.
2) sss_child_handler() needed to ignore a return of pid=0 (which would happen every time as the last pass through) or else it would print an error trying to find pid 0 in the hash table (which it had better not be. I don't want SSSD monitoring init!)
New patches attached.
On Wed, Dec 14, 2011 at 10:00:58PM -0500, Stephen Gallagher wrote:
On Wed, 2011-12-14 at 13:03 -0500, Stephen Gallagher wrote:
On Wed, 2011-12-14 at 09:15 -0500, Stephen Gallagher wrote:
On Dec 14, 2011, at 9:09 AM, Jakub Hrozek jhrozek@redhat.com wrote:
Are you going to fix the two comments you had originally in the later patch? Those were don't consider it fatal if the PID isn't found in the hash in sss_child_handler() and reordering sss_child_invoke_cb()
If so, then ack
The first part of that I already described above as being an incorrect reading of the source. I missed the second part and I'll fix that shortly.
New patch attached fixes the missing reordering of the sssd_child_invoke_cb() and adds more explicit handling of unknown PIDs (with a new DEBUG log message).
I discovered two more issues with this patch, one serious, one cosmetic but annoying.
- sss_child_register() wasn't assigning child->sigchld_ctx, so if the
child was freed, the destructor would segfault trying to remove this child from the hash table.
- sss_child_handler() needed to ignore a return of pid=0 (which would
happen every time as the last pass through) or else it would print an error trying to find pid 0 in the hash table (which it had better not be. I don't want SSSD monitoring init!)
init is pid=1 :-)
New patches attached.
Ack to this version.
On Mon, 2011-12-19 at 12:59 +0100, Jakub Hrozek wrote:
On Wed, Dec 14, 2011 at 10:00:58PM -0500, Stephen Gallagher wrote:
On Wed, 2011-12-14 at 13:03 -0500, Stephen Gallagher wrote:
On Wed, 2011-12-14 at 09:15 -0500, Stephen Gallagher wrote:
On Dec 14, 2011, at 9:09 AM, Jakub Hrozek jhrozek@redhat.com wrote:
Are you going to fix the two comments you had originally in the later patch? Those were don't consider it fatal if the PID isn't found in the hash in sss_child_handler() and reordering sss_child_invoke_cb()
If so, then ack
The first part of that I already described above as being an incorrect reading of the source. I missed the second part and I'll fix that shortly.
New patch attached fixes the missing reordering of the sssd_child_invoke_cb() and adds more explicit handling of unknown PIDs (with a new DEBUG log message).
I discovered two more issues with this patch, one serious, one cosmetic but annoying.
- sss_child_register() wasn't assigning child->sigchld_ctx, so if the
child was freed, the destructor would segfault trying to remove this child from the hash table.
- sss_child_handler() needed to ignore a return of pid=0 (which would
happen every time as the last pass through) or else it would print an error trying to find pid 0 in the hash table (which it had better not be. I don't want SSSD monitoring init!)
init is pid=1 :-)
New patches attached.
Ack to this version.
Pushed to master.
On Wed, Nov 09, 2011 at 03:31:42PM +0100, Pavel Zuna wrote:
On 11/01/2011 02:58 PM, Jakub Hrozek wrote:
On Tue, Nov 01, 2011 at 11:34:27AM +0100, Pavel Zuna wrote:
On 10/31/2011 04:29 PM, Jakub Hrozek wrote:
On Thu, Oct 27, 2011 at 02:03:25PM +0200, Pavel Zuna wrote:
On 10/25/2011 01:08 PM, Pavel Zuna wrote:
On 10/21/2011 06:44 PM, Stephen Gallagher wrote: >On Fri, 2011-10-21 at 15:28 +0200, Pavel Zuna wrote: >>Base on the second proposal: >> >>https://fedorahosted.org/sssd/wiki/DesignDocs/SigChld >> >>There is some old SIGCHLD handling code in src/providers/child_common.[ch], that >>should probably go away if this gets accepted. There was also a naming conflict >>with the sss_child_ctx structure. This structure is only used internally by >>functions defined src/providers/child_common.c. I renamed the original structure >>to sss_child_ctx_old for now. > > >Nack (but a good start) > >sss_child_destructor() cannot return the result of hash_delete(). Talloc >destructors may only return 0 on success or -1 on failure. However, >because destructor failure results in cancelling the free(), we should >just log an error here and return 0. > >sss_child_register: >Don't initialize sss_child_ctx *tmp to the child_ctx passed in. We want >to leave child_ctx untouched until successful completion. Also, please >don't use the variable name "tmp". It's not descriptive. Call it >"child". > >Instead of allocating *child_ctx at the beginning, allocate "child" on >the mem_ctx. > >child = talloc_zero(mem_ctx, struct sss_child_ctx); > >Then, only at the end of execution, right before returning EOK, do >*child_ctx = child; >
In sss_child_register():
- error = hash_enter(sigchld_ctx->children,&key,&value);
- if (error != HASH_SUCCESS&& error != HASH_ERROR_KEY_NOT_FOUND) {
I don't think hash_enter can return HASH_ERROR_KEY_NOT_FOUND (what would it mean?) and even if it did, why ignore it?
> >sss_child_invoke_cb(): >I'm wary of having the callback invocation free the child_ctx. It needs >to be documented well (so that consumers know they must not free it once >the callback has fired or they'll get a double-free).
This still needs fixing.
Forgot about that, sorry. Replaced the talloc_free with removing child_ctx from the hash table only.
> >sss_child_handler(): >You need to check whether waitpid() returned an error and handle that >appropriately. waitpid() will return -1 on error and set errno. If the >errno value is EINTR, you should retry (because waitpid() was >interrupted by a signal at the time). Other errors should generate a >DEBUG message.
I think you misunderstood how the error handling should be done:
if (pid == -1) {
if (errno == ECHILD) {
DEBUG(0, ("waitpid failed with ECHILD\n"));
} else if (errno == EINVAL) {
DEBUG(0, ("waitpid failed with EINVAL\n"));
}
return;
}
Instead of having a DEBUG() call per retval, it would be better to inform user on why the child process exited, similarly to how child_sig_handler() does it now.
I don't agree. The system is designed to inform providers that a child has exited by invoking their callbacks. It's up to the provider to decide what to do with the result and inform the user if appropriate. This would also make sense only in the case of a successful call (pid>= 0).
OK, then why the separate branches for ECHILD and EINVAL? Isn't it easier to just print errno and strerror(errno) ?
Yeah, it's definitely better. Improved in current patch. :)
In case of EINTR, we need to repeat the call to waitpid, because it was interrupted by a signal. In other cases (ECHILD and EINVAL) the call is going to always fail, so there's won't be any callbacks to invoke.
Yup.
>
Ok, I'm done with all of this. It was pretty straight forward, but...
> >Please separate any changes you make to the existing sss_child_ctx into >its own patch. I don't like just renaming it to _old, either. Please >find a more appropriate way to replace this old object (preferably by >converting it to use your new mechanism, also in a new patch). >
... this, I'm trying to find a way to make the old code use the new mechanism. I think it would be better to remove the old one completely and replace it with the new one everywhere where it's used.
Pavel
New patch with the above mentioned issues fixed is attached to this email.
We agreed that we're going to replace all uses of the old system by the new. Expect about 3 patches to follow on this one - one for each provider using the old system.
Pavel
Thanks for taking the time to review this. New patch attached.
Pavel
Two more issues (sorry for not catching all this at once):
Please don't just use DEBUG(0) for all the messages. DEBUG(0) is supposed to be used for fatal errors that cause the provider to quit completely.
See https://fedorahosted.org/pipermail/sssd-devel/2011-June/006407.html for explanation of the log levels. In general, this patch is going to be included in 1.7 and onwards -- please use the new debug levels (see src/util/util.h) defined as macros instead of integers.
Ok.
The second issue is that I'm getting a segfault when logging in via Kerberos, the backtrace is pointing to tevent_common_check_signal().
Finally found what the problem was. I suspect there might be a bug in tevent.
The SEGFAULT was generated from tevent_common_check_signal as you pointed out. It happened when copying a portion of siginfo that my signal handler wasn't using.
Simply adding the SA_SIGINFO flag to tevent_add_signal solved the issue.
Updated patch attached.
Pavel
The patch is OK now, I just needed to use a 3-way merge to get it applied - probably some patches landed in master since it was submitted for review.
Please ask Simo about the flag, this is too low-level tevent knowledge for me.
I'll ack the patch once you know what was the tevent problem.
On Thu, Oct 27, 2011 at 02:03:25PM +0200, Pavel Zuna wrote:
On 10/25/2011 01:08 PM, Pavel Zuna wrote:
On 10/21/2011 06:44 PM, Stephen Gallagher wrote:
On Fri, 2011-10-21 at 15:28 +0200, Pavel Zuna wrote:
Base on the second proposal:
https://fedorahosted.org/sssd/wiki/DesignDocs/SigChld
There is some old SIGCHLD handling code in src/providers/child_common.[ch], that should probably go away if this gets accepted. There was also a naming conflict with the sss_child_ctx structure. This structure is only used internally by functions defined src/providers/child_common.c. I renamed the original structure to sss_child_ctx_old for now.
Nack (but a good start)
sss_child_destructor() cannot return the result of hash_delete(). Talloc destructors may only return 0 on success or -1 on failure. However, because destructor failure results in cancelling the free(), we should just log an error here and return 0.
sss_child_register: Don't initialize sss_child_ctx *tmp to the child_ctx passed in. We want to leave child_ctx untouched until successful completion. Also, please don't use the variable name "tmp". It's not descriptive. Call it "child".
Instead of allocating *child_ctx at the beginning, allocate "child" on the mem_ctx.
child = talloc_zero(mem_ctx, struct sss_child_ctx);
Then, only at the end of execution, right before returning EOK, do *child_ctx = child;
sss_child_invoke_cb(): I'm wary of having the callback invocation free the child_ctx. It needs to be documented well (so that consumers know they must not free it once the callback has fired or they'll get a double-free).
sss_child_handler(): You need to check whether waitpid() returned an error and handle that appropriately. waitpid() will return -1 on error and set errno. If the errno value is EINTR, you should retry (because waitpid() was interrupted by a signal at the time). Other errors should generate a DEBUG message.
Ok, I'm done with all of this. It was pretty straight forward, but...
Please separate any changes you make to the existing sss_child_ctx into its own patch. I don't like just renaming it to _old, either. Please find a more appropriate way to replace this old object (preferably by converting it to use your new mechanism, also in a new patch).
... this, I'm trying to find a way to make the old code use the new mechanism. I think it would be better to remove the old one completely and replace it with the new one everywhere where it's used.
Pavel
New patch with the above mentioned issues fixed is attached to this email.
We agreed that we're going to replace all uses of the old system by the new. Expect about 3 patches to follow on this one - one for each provider using the old system.
Pavel
One more thing..this check is wrong:
+ sigchld_ctx = talloc_zero(ctx, struct sss_sigchild_ctx); + if (!ctx->sigchld_ctx) {
sssd-devel@lists.fedorahosted.org