https://fedorahosted.org/sssd/ticket/1090
Unfortunately there is not enough info in rhbz #754114 where the bug originated, so just band-aid the crash for now.
On Tue, 2011-12-13 at 16:01 +0100, Jakub Hrozek wrote:
https://fedorahosted.org/sssd/ticket/1090
Unfortunately there is not enough info in rhbz #754114 where the bug originated, so just band-aid the crash for now.
Nack.
Looking at this closely, I think what probably happened is that the child died while waiting for the ping_check to time out, so that when it finally returned, the memory for the mt_svc object had been freed. Because we then use talloc_get_type() to retrieve that memory, it detects that it's invalid and simply returns NULL (which we then dereference and crash on).
The real solution here would be to ensure that we cancel the DBusPendingCall with dbus_pending_call_cancel() if the mt_svc is freed.
So we need to modify the mt_svc structure to contain a DBusPendingCall * attribute that we assign to it when service_send_ping() calls sbus_conn_send() (currently, we just pass NULL for the 'pending' return to ignore it).
Then, we need to make sure that the mt_svc destructor will call dbus_pending_call_cancel() on mt_svc->pending, which will prevent us from reaching the ping_check() function.
(Obviously, we need to NULLify the mt_svc->pending attribute once we reach the ping_check() function correctly).
There's certainly no reason we can't ALSO include the band-aid of this patch, but I'd rather fix the underlying problem.
On Tue, 2011-12-13 at 10:18 -0500, Stephen Gallagher wrote:
On Tue, 2011-12-13 at 16:01 +0100, Jakub Hrozek wrote:
https://fedorahosted.org/sssd/ticket/1090
Unfortunately there is not enough info in rhbz #754114 where the bug originated, so just band-aid the crash for now.
Nack.
Looking at this closely, I think what probably happened is that the child died while waiting for the ping_check to time out, so that when it finally returned, the memory for the mt_svc object had been freed. Because we then use talloc_get_type() to retrieve that memory, it detects that it's invalid and simply returns NULL (which we then dereference and crash on).
maybe we should use talloc_get_type_abort() here ...
The real solution here would be to ensure that we cancel the DBusPendingCall with dbus_pending_call_cancel() if the mt_svc is freed.
So we need to modify the mt_svc structure to contain a DBusPendingCall * attribute that we assign to it when service_send_ping() calls sbus_conn_send() (currently, we just pass NULL for the 'pending' return to ignore it).
Ack
Then, we need to make sure that the mt_svc destructor will call dbus_pending_call_cancel() on mt_svc->pending, which will prevent us from reaching the ping_check() function.
(Obviously, we need to NULLify the mt_svc->pending attribute once we reach the ping_check() function correctly).
There's certainly no reason we can't ALSO include the band-aid of this patch, but I'd rather fix the underlying problem.
Ack x 2
Simo.
On Tue, 2011-12-13 at 10:32 -0500, Simo Sorce wrote:
Looking at this closely, I think what probably happened is that the child died while waiting for the ping_check to time out, so that when it finally returned, the memory for the mt_svc object had been freed. Because we then use talloc_get_type() to retrieve that memory, it detects that it's invalid and simply returns NULL (which we then dereference and crash on).
maybe we should use talloc_get_type_abort() here ...
Nack. I really, really don't want it to be possible for the monitor to crash.
On Tue, 2011-12-13 at 10:36 -0500, Stephen Gallagher wrote:
On Tue, 2011-12-13 at 10:32 -0500, Simo Sorce wrote:
Looking at this closely, I think what probably happened is that the child died while waiting for the ping_check to time out, so that when it finally returned, the memory for the mt_svc object had been freed. Because we then use talloc_get_type() to retrieve that memory, it detects that it's invalid and simply returns NULL (which we then dereference and crash on).
maybe we should use talloc_get_type_abort() here ...
Nack. I really, really don't want it to be possible for the monitor to crash.
In this case it makes no difference. If you reference freed memory or 'random' memory you will end up crashing as well.
The point is that we should never get to a point where talloc_get_type() fails and return NULL, unless it is legal to pass NULL in as input.
I am all for not having the monitor crash of course.
Simo.
On Tue, 2011-12-13 at 10:59 -0500, Simo Sorce wrote:
On Tue, 2011-12-13 at 10:36 -0500, Stephen Gallagher wrote:
On Tue, 2011-12-13 at 10:32 -0500, Simo Sorce wrote:
Looking at this closely, I think what probably happened is that the child died while waiting for the ping_check to time out, so that when it finally returned, the memory for the mt_svc object had been freed. Because we then use talloc_get_type() to retrieve that memory, it detects that it's invalid and simply returns NULL (which we then dereference and crash on).
maybe we should use talloc_get_type_abort() here ...
Nack. I really, really don't want it to be possible for the monitor to crash.
In this case it makes no difference. If you reference freed memory or 'random' memory you will end up crashing as well.
The point is that we should never get to a point where talloc_get_type() fails and return NULL, unless it is legal to pass NULL in as input.
I am all for not having the monitor crash of course.
Simo.
Well, in this particular case, our answer if it returns NULL will be to just exit and do nothing, which will keep the monitor running.
But as I said in my earlier comments, I think I identified the reason why it was coming up with NULL anyway.
On Tue, 2011-12-13 at 11:01 -0500, Stephen Gallagher wrote:
On Tue, 2011-12-13 at 10:59 -0500, Simo Sorce wrote:
On Tue, 2011-12-13 at 10:36 -0500, Stephen Gallagher wrote:
On Tue, 2011-12-13 at 10:32 -0500, Simo Sorce wrote:
Looking at this closely, I think what probably happened is that the child died while waiting for the ping_check to time out, so that when it finally returned, the memory for the mt_svc object had been freed. Because we then use talloc_get_type() to retrieve that memory, it detects that it's invalid and simply returns NULL (which we then dereference and crash on).
maybe we should use talloc_get_type_abort() here ...
Nack. I really, really don't want it to be possible for the monitor to crash.
In this case it makes no difference. If you reference freed memory or 'random' memory you will end up crashing as well.
The point is that we should never get to a point where talloc_get_type() fails and return NULL, unless it is legal to pass NULL in as input.
I am all for not having the monitor crash of course.
Simo.
Well, in this particular case, our answer if it returns NULL will be to just exit and do nothing, which will keep the monitor running.
Only because of chance, you may as well get a segfault as the pointer may be pointing to reclaimed memory that is not accessible by the process anymore.
But as I said in my earlier comments, I think I identified the reason why it was coming up with NULL anyway.
Yes, and I totally agree with your analysis, I just do not like to rely on something working because it recognized it was accessing freed memory and returned NULL as it is not safe.
Simo.
On Tue, Dec 13, 2011 at 10:18:10AM -0500, Stephen Gallagher wrote:
On Tue, 2011-12-13 at 16:01 +0100, Jakub Hrozek wrote:
https://fedorahosted.org/sssd/ticket/1090
Unfortunately there is not enough info in rhbz #754114 where the bug originated, so just band-aid the crash for now.
Nack.
Looking at this closely, I think what probably happened is that the child died while waiting for the ping_check to time out, so that when it finally returned, the memory for the mt_svc object had been freed. Because we then use talloc_get_type() to retrieve that memory, it detects that it's invalid and simply returns NULL (which we then dereference and crash on).
The real solution here would be to ensure that we cancel the DBusPendingCall with dbus_pending_call_cancel() if the mt_svc is freed.
So we need to modify the mt_svc structure to contain a DBusPendingCall * attribute that we assign to it when service_send_ping() calls sbus_conn_send() (currently, we just pass NULL for the 'pending' return to ignore it).
Then, we need to make sure that the mt_svc destructor will call dbus_pending_call_cancel() on mt_svc->pending, which will prevent us from reaching the ping_check() function.
(Obviously, we need to NULLify the mt_svc->pending attribute once we reach the ping_check() function correctly).
There's certainly no reason we can't ALSO include the band-aid of this patch, but I'd rather fix the underlying problem.
Thank you for the detailed explanation. A new patch is attached.
On Tue, 2011-12-13 at 19:46 +0100, Jakub Hrozek wrote:
On Tue, Dec 13, 2011 at 10:18:10AM -0500, Stephen Gallagher wrote:
On Tue, 2011-12-13 at 16:01 +0100, Jakub Hrozek wrote:
https://fedorahosted.org/sssd/ticket/1090
Unfortunately there is not enough info in rhbz #754114 where the bug originated, so just band-aid the crash for now.
Nack.
Looking at this closely, I think what probably happened is that the child died while waiting for the ping_check to time out, so that when it finally returned, the memory for the mt_svc object had been freed. Because we then use talloc_get_type() to retrieve that memory, it detects that it's invalid and simply returns NULL (which we then dereference and crash on).
The real solution here would be to ensure that we cancel the DBusPendingCall with dbus_pending_call_cancel() if the mt_svc is freed.
So we need to modify the mt_svc structure to contain a DBusPendingCall * attribute that we assign to it when service_send_ping() calls sbus_conn_send() (currently, we just pass NULL for the 'pending' return to ignore it).
Then, we need to make sure that the mt_svc destructor will call dbus_pending_call_cancel() on mt_svc->pending, which will prevent us from reaching the ping_check() function.
(Obviously, we need to NULLify the mt_svc->pending attribute once we reach the ping_check() function correctly).
There's certainly no reason we can't ALSO include the band-aid of this patch, but I'd rather fix the underlying problem.
Thank you for the detailed explanation. A new patch is attached.
Ack and pushed to master.
sssd-devel@lists.fedorahosted.org