This patch attempts to address the problem Eugene found with sdap_handle_release()
Simo.
Hi Simo,
On 04/13/2010 08:51 PM, Simo Sorce wrote:
This patch attempts to address the problem Eugene found with sdap_handle_release()
I have checked the suggested patch and found several problems:
1. providers/ipa/ipa_access.c still lacks failover retries
2. Why number of retries is 2? It is only enough for one server. If we have failover set up we need more retries (at least 1 per server + 1 for stale cached connection)
3. My patch contained an additional fix to sdap_handle_release. Before calling op callbacks LDAP callback data was destroyed:
commit: sssd-1_1_1 line 102: talloc_zfree(sh->conncb->lc_arg);
and later LDAP handle was unbound: line 116: ldap_unbind_ext(sh->ldap, NULL, NULL);
But during unbind the LDAP callbacks are called. And sssd callback (sdap_ldap_connect_callback_del) accesses zero pointer (lc_arg).
So before freeing sh->conncb->lc_arg it is necessary to call: /* remove callback first */ ldap_get_option(sh->ldap, LDAP_OPT_CONNECT_CB, sh->conncb);
Actually I wonder how does it work for you without this fix? Probably you have never tried to close a connection with still open file descriptor.
4. The suggested patch is a workaround. It relies on the agreement that all _send functions will be called from top level event context, not from subreq callback context. This is hard to guarantee.
I have attached to this e-mail a version of my patch addressing the same problem that applies directly on top of sssd 1.1.1.
I think its better to use my solution as:
1. It provides a real fix for the problem rather than workaround 2. The complexity of patches is alike (5 files, 213 versus 138 changed lines) 3. The real changes in my patch are localized to sdap.h, sdap_async_private.h, sdap_handle_create, sdap_handle_release 4. Other changes are cosmetic and related to changes in structure definition and therefore can be easily checked by peer review
Regards, Eugene
On Wed, Apr 14, 2010 at 10:48:28AM +0400, Eugene Indenbom wrote: ...
- My patch contained an additional fix to sdap_handle_release.
Before calling op callbacks LDAP callback data was destroyed:
commit: sssd-1_1_1 line 102: talloc_zfree(sh->conncb->lc_arg);
and later LDAP handle was unbound: line 116: ldap_unbind_ext(sh->ldap, NULL, NULL);
But during unbind the LDAP callbacks are called. And sssd callback (sdap_ldap_connect_callback_del) accesses zero pointer (lc_arg).
So before freeing sh->conncb->lc_arg it is necessary to call: /* remove callback first */ ldap_get_option(sh->ldap, LDAP_OPT_CONNECT_CB, sh->conncb);
Actually I wonder how does it work for you without this fix? Probably you have never tried to close a connection with still open file descriptor.
...
This is not necessary, because sdap_ldap_connect_callback_del() checks if lc_arg is NULL.
bye, Sumit
On 04/14/2010 11:13 AM, Sumit Bose wrote:
On Wed, Apr 14, 2010 at 10:48:28AM +0400, Eugene Indenbom wrote: ...
- My patch contained an additional fix to sdap_handle_release.
Before calling op callbacks LDAP callback data was destroyed:
commit: sssd-1_1_1 line 102: talloc_zfree(sh->conncb->lc_arg);
and later LDAP handle was unbound: line 116: ldap_unbind_ext(sh->ldap, NULL, NULL);
But during unbind the LDAP callbacks are called. And sssd callback (sdap_ldap_connect_callback_del) accesses zero pointer (lc_arg).
So before freeing sh->conncb->lc_arg it is necessary to call: /* remove callback first */ ldap_get_option(sh->ldap, LDAP_OPT_CONNECT_CB, sh->conncb);
Actually I wonder how does it work for you without this fix? Probably you have never tried to close a connection with still open file descriptor.
...
This is not necessary, because sdap_ldap_connect_callback_del() checks if lc_arg is NULL.
That's right, but I remember that during debugging of my patch I have indeed run across a crash related to this early free.
Probably it was not in sdap_ldap_connect_callback_del, but rather in sdap_ldap_connect_callback_add. Although I can not right now figure out how to reproduce it.
It's anyway safer to remove callback first and then free the callback data.
Regards, Eugene
On 04/14/2010 11:53 AM, Eugene Indenbom wrote:
On 04/14/2010 11:13 AM, Sumit Bose wrote:
On Wed, Apr 14, 2010 at 10:48:28AM +0400, Eugene Indenbom wrote: ...
- My patch contained an additional fix to sdap_handle_release.
Before calling op callbacks LDAP callback data was destroyed:
commit: sssd-1_1_1 line 102: talloc_zfree(sh->conncb->lc_arg);
and later LDAP handle was unbound: line 116: ldap_unbind_ext(sh->ldap, NULL, NULL);
But during unbind the LDAP callbacks are called. And sssd callback (sdap_ldap_connect_callback_del) accesses zero pointer (lc_arg).
So before freeing sh->conncb->lc_arg it is necessary to call: /* remove callback first */ ldap_get_option(sh->ldap, LDAP_OPT_CONNECT_CB, sh->conncb);
Actually I wonder how does it work for you without this fix? Probably you have never tried to close a connection with still open file descriptor.
...
This is not necessary, because sdap_ldap_connect_callback_del() checks if lc_arg is NULL.
That's right, but I remember that during debugging of my patch I have indeed run across a crash related to this early free.
Probably it was not in sdap_ldap_connect_callback_del, but rather in sdap_ldap_connect_callback_add. Although I can not right now figure out how to reproduce it.
It's anyway safer to remove callback first and then free the callback data.
I have found out how it has happened:
1. sdap_handle_release sets sdap_handle->connected = false near the end of the method, after callback are called. 2. So in recursion scenario (without Simo's patch and with partial changes from my patch #4), the connection was considered good in callbacks and reused. 3. As result new file descriptor was opened by LDAP and sdap_ldap_connect_callback_add was called. So here we have a crash.
So yet another important fix in my patch is that it sets sdap_handle->connected to false as soon as possible, thus preventing connection reuse.
Regards, Eugene
On Wed, 14 Apr 2010 10:48:28 +0400 Eugene Indenbom eindenbom@gmail.com wrote:
- /* private data is allocated as top-level talloc context as
* it may need to survive for a while after main handledestruction,
* see sdap_handle_data_lock/sdap_handle_data_release */- sh->private = talloc_zero(NULL, struct sdap_handle_data);
- if (!sh->private) {
goto fail;- }
- sh->private->sh = sh; return sh;
Sorry Eugene, but I *really* do not like things like this. - Allocating on NULL: very high risk of memory leaks - Assigning sh to private->sh: very high risk of accessing free memory if sh is freed before private is.
If preserving the handle is important I'd rather have a list of handles, with old ones pushed down the list as soon as a connection is marked bad so that they are not reused, but also not lost (memory wise).
On 04/14/2010 04:51 PM, Simo Sorce wrote:
On Wed, 14 Apr 2010 10:48:28 +0400 Eugene Indenbomeindenbom@gmail.com wrote:
- /* private data is allocated as top-level talloc context as
* it may need to survive for a while after main handledestruction,
* see sdap_handle_data_lock/sdap_handle_data_release */- sh->private = talloc_zero(NULL, struct sdap_handle_data);
- if (!sh->private) {
goto fail;- }
- sh->private->sh = sh; return sh;
Sorry Eugene, but I *really* do not like things like this.
- Allocating on NULL: very high risk of memory leaks
We can allocate on sdap_handle_data on parent sdap_handle context and later use talloc_steal to NULL context when we need to keep pointer to sdap_handle_data until stack unwinds.
So normally sdap_handle_data would be owned, but when it orphan it will become orphan in TALLOC_CTX sense too.
But I do not see any risk here: there is a destructor set on sdap_handle that take cares of destruction of sdap_handle_data. There is no way to get this out of control.
- Assigning sh to private->sh: very high risk of accessing free memory if sh is freed before private is.
private->sh is actually a boolean saying whether private data is orphan or not. I see no reason not to have a pointer to owner at hand, but this is not necessary.
And again there is no risk of having stale pointer here as there is a destructor set on sdap_hanlde, that takes care of clearing the pointer on time.
If preserving the handle is important I'd rather have a list of handles, with old ones pushed down the list as soon as a connection is marked bad so that they are not reused, but also not lost (memory wise).
So it looks like you do not really believe that talloc destructors are always called and it's reliable to free memory in destructors. In my humble opinion if memory deallocation is controlled from destructor there is actually no actual need for talloc ownership.
You suggestion of putting sdap_handle objects on destroy queue and garbage collecting them later does not sound as better alternative.
On 04/14/2010 05:00 PM, Simo Sorce wrote:
On Wed, 14 Apr 2010 10:48:28 +0400 Eugene Indenbomeindenbom@gmail.com wrote:
- The suggested patch is a workaround. It relies on the agreement
that all _send functions will be called from top level event context, not from subreq callback context. This is hard to guarantee.
It is easier to analyze and agree on for now. It solves the problem, without changing the underlying logic. This is a big plus for a patch to go into a stable tree as it will not cause surprises.
That's wrong your patch is changing behaviour too: 1. It activates code paths that segfaulted before. 2. It changes time when sdap_handle is deallocated (and all its child objects like LDAP handle). This way it substantially changes memory distribution, object life time and other key parameters.
My patch behaves better in this sense. It at least does not change time when LDAP handle is deallocated. Actually my patch affects only sdap_handle destruction sequence and only in previously faulty case of deletion recursion.
Regards, Eugene
PS Well, it's your project anyway and it's up to you and the project team to decide on which way to take.
On Wed, 14 Apr 2010 17:34:39 +0400 Eugene Indenbom eindenbom@gmail.com wrote:
So it looks like you do not really believe that talloc destructors are always called and it's reliable to free memory in destructors. In my humble opinion if memory deallocation is controlled from destructor there is actually no actual need for talloc ownership.
If you always destroy a handle in a destructor then freeing in the destructor is useless, you can simply make the destructor's parent the memory context parent. If you destroy conditionally in a destructor then you risk memory leaks, unless you have other ways to make sure the context is freed.
You suggestion of putting sdap_handle objects on destroy queue and garbage collecting them later does not sound as better alternative.
Unfortunately nothing is ideal here. But I want to play safe and have the memory hierarchy right rather then have special cases all over. Every special case is a long term burden.
Simo.
On 04/14/2010 06:02 PM, Simo Sorce wrote:
On Wed, 14 Apr 2010 17:34:39 +0400 Eugene Indenbomeindenbom@gmail.com wrote:
So it looks like you do not really believe that talloc destructors are always called and it's reliable to free memory in destructors. In my humble opinion if memory deallocation is controlled from destructor there is actually no actual need for talloc ownership.
If you always destroy a handle in a destructor then freeing in the destructor is useless, you can simply make the destructor's parent the memory context parent. If you destroy conditionally in a destructor then you risk memory leaks, unless you have other ways to make sure the context is freed.
That's right, but garbage collecting logic is also a special case, having much bigger impact on how thing internally work and what is allowed and what is not.
I'd rather have destroy lock with 2 simple lock/unlock methods (like in my patch) than maintain garbage collector queue. And with garbage collector once again there is no guarantee that garbage collector queue and garbage collector itself would not be destroyed in event loop callback.
You suggestion of putting sdap_handle objects on destroy queue and garbage collecting them later does not sound as better alternative.
Unfortunately nothing is ideal here. But I want to play safe and have the memory hierarchy right rather then have special cases all over. Every special case is a long term burden.
Yes, there is nothing ideal. But there is one substantial difference between workaround and solution: - Workaround fixes problem by changing __other__ (client) code in the way that avoids triggering a bug in the faulty code. - Solution fixes the buggy code so it starts to work as expected without putting a burden on client code of following unnecessary restrictions.
Regards, Eugene
On Wed, 14 Apr 2010 18:21:50 +0400 Eugene Indenbom eindenbom@gmail.com wrote:
Yes, there is nothing ideal. But there is one substantial difference between workaround and solution: - Workaround fixes problem by changing __other__ (client) code in the way that avoids triggering a bug in the faulty code. - Solution fixes the buggy code so it starts to work as expected without putting a burden on client code of following unnecessary restrictions.
I agree, I am sending a new patch on top of the last one to move all frees in the connection _recv() code.
This way we are guaranteed that it can't be called within a callback. This means no unnecessary restriction are necessary.
Simo.
On Wed, 14 Apr 2010 10:48:28 +0400 Eugene Indenbom eindenbom@gmail.com wrote:
Hi Simo,
On 04/13/2010 08:51 PM, Simo Sorce wrote:
This patch attempts to address the problem Eugene found with sdap_handle_release()
I have checked the suggested patch and found several problems:
- providers/ipa/ipa_access.c still lacks failover retries
As I said this patch only fixes the bug without changing any behavior. I want to see the reconnection stuff on top of it later.
- Why number of retries is 2? It is only enough for one server. If
we have failover set up we need more retries (at least 1 per server + 1 for stale cached connection)
That's an arbirtary number, I came up just to avoid having it loop forever, it is by no means intended to replace your 4th patch. Just a temporary stopgap to be able to push this into stable earlier, with the understanding we will push a better patch set in master once we agree on its form.
- My patch contained an additional fix to sdap_handle_release.
Before calling op callbacks LDAP callback data was destroyed:
commit: sssd-1_1_1 line 102: talloc_zfree(sh->conncb->lc_arg);
and later LDAP handle was unbound: line 116: ldap_unbind_ext(sh->ldap, NULL, NULL);
But during unbind the LDAP callbacks are called. And sssd callback (sdap_ldap_connect_callback_del) accesses zero pointer (lc_arg).
So before freeing sh->conncb->lc_arg it is necessary to call: /* remove callback first */ ldap_get_option(sh->ldap, LDAP_OPT_CONNECT_CB, sh->conncb);
Actually I wonder how does it work for you without this fix? Probably you have never tried to close a connection with still open file descriptor.
As you found out later, my patch prevents this from happening because it marks the connection as faulty.
- The suggested patch is a workaround. It relies on the agreement
that all _send functions will be called from top level event context, not from subreq callback context. This is hard to guarantee.
It is easier to analyze and agree on for now. It solves the problem, without changing the underlying logic. This is a big plus for a patch to go into a stable tree as it will not cause surprises.
I have attached to this e-mail a version of my patch addressing the same problem that applies directly on top of sssd 1.1.1.
Se my comments on your approach, I really don't like it. We try very hard to avoid using references and allocations on NULL, as they often just end up being memory leaks and cause other issues when freeing memory hierarchies as they are not connected in.
I think its better to use my solution as:
- It provides a real fix for the problem rather than workaround
Both patches provide a real fix to the segfault you saw.
- The complexity of patches is alike (5 files, 213 versus 138
changed lines)
No it is not, your patches changes fundamentally how sdap_handle behaves, this has a higher inpact than you realize.
- The real changes in my patch are localized to sdap.h,
sdap_async_private.h, sdap_handle_create, sdap_handle_release
See comments about that vs the code.
- Other changes are cosmetic and related to changes in structure
definition and therefore can be easily checked by peer review
A seprate patch with cosmetic changes is always in order, but don't rush on it, so far I am not convinced your patch is ok, sorry.
Simo.
On Tue, 13 Apr 2010 12:51:39 -0400 Simo Sorce ssorce@redhat.com wrote:
This patch attempts to address the problem Eugene found with sdap_handle_release()
Attached a follow-up patch to address Eugene's legitimate concern about restrictions put on _send() functions. We never free gsh in a _send() either but only in the _recv() function of the call that actually establish a new connection. This guarantees that gsh will never be freed during sdap_handle_release() or any other error code path.
I can squash the 2 patches in a single one if people feels that is better. Comments welcome.
Simo.
On 04/14/2010 06:43 PM, Simo Sorce wrote:
On Tue, 13 Apr 2010 12:51:39 -0400 Simo Sorcessorce@redhat.com wrote:
This patch attempts to address the problem Eugene found with sdap_handle_release()
Attached a follow-up patch to address Eugene's legitimate concern about restrictions put on _send() functions. We never free gsh in a _send() either but only in the _recv() function of the call that actually establish a new connection. This guarantees that gsh will never be freed during sdap_handle_release() or any other error code path.
I can squash the 2 patches in a single one if people feels that is better. Comments welcome.
I would appreciate very much if during this review you compare the solution suggested by me with the solution suggested by Simo.
My strong opinion is that the solution suggested by me is more reliable, robust to future changes and has less impact on the way already released code operates.
Well, I am definitely a biased judge on this question. :-)
Regards, Eugene
On 04/13/2010 08:51 PM, Simo Sorce wrote:
This patch attempts to address the problem Eugene found with sdap_handle_release()
Simo.
I have found 2 critical problems in Simo's patch:
1. After setting sdap_handle->connected = false, LDAP handle is never destroyed. (see sdap_handle_release). So actually patch introduces huge resource leak.
2. Despite all efforts not to free sdap_handle from callbacks it is still freed when connect operation fails (together with connection tevent_req). So the following sequence still reproduces the bug: a. Using debugger break in sdap_cli_rootdse_step b. Restart LDAP server or network connection leading to it c. Resume execution d. Observe crash
Actually I still do not understand why it is feasible instead of fixing destruction of sdap_handle in callback to invent a workaround trying to avoid doing so.
The bug in destruction of sdap_handle in callback violates basic programming principles: - Every creatable object must by destructible (or in case of reference counted object it must be possible to release ownership/reference from object) - There should be no restriction on when method starting destruction is called - After object destruction is started it is object's responsibility to complete destruction as soon as possible. There should not be any tricks like keeping an object in garbage collecting queue.
Violation of the above principles calls for __path__ coverage analysis of code as we have to ensure that no __path__ of code destructs object at wrong time. Even if Simo fixes the above problems there will be no guarantee that sdap_handle destruction is never called from callback.
Regards, Eugene
PS Although my patch never sets sdap_handle->connected to false, I have attached an amendment to my patch to improve robustness of destruction code when somebody really sets sdap_handle->connected to false.
PPS Even without this amendment no resource leaks exist in my patch as all memory and callbacks are a bit later destroyed through talloc ownership.
On Thu, 15 Apr 2010 10:13:46 +0400 Eugene Indenbom eindenbom@gmail.com wrote:
On 04/13/2010 08:51 PM, Simo Sorce wrote:
This patch attempts to address the problem Eugene found with sdap_handle_release()
Simo.
I have found 2 critical problems in Simo's patch:
- After setting sdap_handle->connected = false, LDAP handle is never
destroyed. (see sdap_handle_release). So actually patch introduces huge resource leak.
Ahh, thanks for catching this one, I forgot to remove the sh->connected check in sdap_handle_release()
- Despite all efforts not to free sdap_handle from callbacks it is
still freed when connect operation fails (together with connection tevent_req). So the following sequence still reproduces the bug: a. Using debugger break in sdap_cli_rootdse_step b. Restart LDAP server or network connection leading to it c. Resume execution d. Observe crash
This looks like a separate issue. Within the connect functions we are not using the global sdap_handle. I guess the issue here is that until we return from the connect request, the sh is a child of the request memory hierarchy, so when we call the callback within sdap_handle_release() we end up freeing the hierarchy.
Do you confirm ?
Actually I still do not understand why it is feasible instead of fixing destruction of sdap_handle in callback to invent a workaround trying to avoid doing so.
I am not sure why you call it a workaround, it is just a different solution.
The bug in destruction of sdap_handle in callback violates basic programming principles: - Every creatable object must by destructible (or in case of reference counted object it must be possible to release ownership/reference from object)
This is not really true in a program based on talloc, as you can see a talloc hierarchy as a set of references. We do indeed almost exclusively base our destruction sequences on talloc hierarchies. The only case where this is not entirely possible is indeed the interface code between the program and the ldap libraries (for obvious reasons), and I want to keep the difference restricted in an area as small as possible. One of the reasons I do not like your approach is that the talloc hierarchy become very fuzzy.
- There should be no restriction on when method startingdestruction is called
Not sure where you got this, but it is certainly not true :) You can't destroy objects in use. If you are thinking about deferred destruction that is what we want to achieve but using correct talloc hierarchies as much as possible. Unfortunately this is not the case, because of the interfcae we are forced into with the openldap libraries.
- After object destruction is started it is object'sresponsibility to complete destruction as soon as possible. There should not be any tricks like keeping an object in garbage collecting queue.
Unfortunately this is not possible as well, because pending requests all reference a common object (the one that ultimately holds the ldap handler).
Violation of the above principles calls for __path__ coverage analysis of code as we have to ensure that no __path__ of code destructs object at wrong time. Even if Simo fixes the above problems there will be no guarantee that sdap_handle destruction is never called from callback.
We just need to guarantee this, I will check again if I can steal some code from your patch though :)
Regards, Eugene
PS Although my patch never sets sdap_handle->connected to false, I have attached an amendment to my patch to improve robustness of destruction code when somebody really sets sdap_handle->connected to false.
PPS Even without this amendment no resource leaks exist in my patch as all memory and callbacks are a bit later destroyed through talloc ownership.
Given you alloc private on NULL, you can hardly claim it is released through a talloc hierarchy, but it is explicitly freed in sdap_handle_data_release().
I know the sdap_handle destructor() will end up caling it (through sdap_handle_release(), but then why all the games with the final argument ? Why not simply attach a destructor to sdap_handle_data that prevents destruction until it is safe ? Akso given we can prevent releasing sdap_handle too early through the destructor, why not simply apply all this to sdap_handle directly w/o requireing another sub-structure ?
Simo.
On Thu, 15 Apr 2010 09:14:37 -0400 Simo Sorce ssorce@redhat.com wrote:
I know the sdap_handle destructor() will end up caling it (through sdap_handle_release(), but then why all the games with the final argument ? Why not simply attach a destructor to sdap_handle_data that prevents destruction until it is safe ? Akso given we can prevent releasing sdap_handle too early through the destructor, why not simply apply all this to sdap_handle directly w/o requireing another sub-structure ?
Ah btw, while thinking about how you handle release of sdap_handle_data I think I found a case where you end up with some issues.
If you free sdap_handle while sdap_handle_data is locked sdap_handle_release will be called with final = ture, and this means sh->private is set to NULL. If later on sdap_handle_release is called again shd will be NULL (shd = sh->private) and sdap_handle_data_lock(shd); wiull derefernce a NULL pointer causing a segfault.
I haven't checked if this can actually happen, but given the ways you use locks here it looks like it might.
Simo.
On 04/15/2010 05:35 PM, Simo Sorce wrote:
On Thu, 15 Apr 2010 09:14:37 -0400 Simo Sorcessorce@redhat.com wrote:
I know the sdap_handle destructor() will end up caling it (through sdap_handle_release(), but then why all the games with the final argument ? Why not simply attach a destructor to sdap_handle_data that prevents destruction until it is safe ? Akso given we can prevent releasing sdap_handle too early through the destructor, why not simply apply all this to sdap_handle directly w/o requireing another sub-structure ?
Ah btw, while thinking about how you handle release of sdap_handle_data I think I found a case where you end up with some issues.
If you free sdap_handle while sdap_handle_data is locked sdap_handle_release will be called with final = ture, and this means sh->private is set to NULL. If later on sdap_handle_release is called again shd will be NULL (shd = sh->private) and sdap_handle_data_lock(shd); wiull derefernce a NULL pointer causing a segfault.
No, it will not. sdap_handle_data->sh is never dereferenced. :-) As I have written earlier it is simply a flag showing that sdap_handle_data still have a parent. When sdap_handle_data->sh is zero the structure deletes itself as soon as sdap_handle_data->destroy_lock is zero too (see sdap_handle_data_release).
Shall ever back pointer sdap_handle_data->sh be dereferenced, it needs to be checked for NULL before use. That's clear.
I haven't checked if this can actually happen, but given the ways you use locks here it looks like it might.
Sure this could not happen. I have __checked__ it and actually run my test scenario using step-by-step execution in debugger. I have reproduced the situation in debugger and made sure that both inner sdap_handle_destructor and outer sdap_handle_release complete without errors and that sdap_handle_data is actually freed (talloc_free is called). I have done all that.
Eugene
On Thu, 15 Apr 2010 17:58:18 +0400 Eugene Indenbom eindenbom@gmail.com wrote:
On 04/15/2010 05:35 PM, Simo Sorce wrote:
On Thu, 15 Apr 2010 09:14:37 -0400 Simo Sorcessorce@redhat.com wrote:
I know the sdap_handle destructor() will end up caling it (through sdap_handle_release(), but then why all the games with the final argument ? Why not simply attach a destructor to sdap_handle_data that prevents destruction until it is safe ? Akso given we can prevent releasing sdap_handle too early through the destructor, why not simply apply all this to sdap_handle directly w/o requireing another sub-structure ?
Ah btw, while thinking about how you handle release of sdap_handle_data I think I found a case where you end up with some issues.
If you free sdap_handle while sdap_handle_data is locked sdap_handle_release will be called with final = ture, and this means sh->private is set to NULL. If later on sdap_handle_release is called again shd will be NULL (shd = sh->private) and sdap_handle_data_lock(shd); wiull derefernce a NULL pointer causing a segfault.
No, it will not. sdap_handle_data->sh is never dereferenced. :-) As I have written earlier it is simply a flag showing that sdap_handle_data still have a parent. When sdap_handle_data->sh is zero the structure deletes itself as soon as sdap_handle_data->destroy_lock is zero too (see sdap_handle_data_release).
Shall ever back pointer sdap_handle_data->sh be dereferenced, it needs to be checked for NULL before use. That's clear.
Sorry, I don't understand your comment, I never once mentioned sdap_handle_data->sh
I haven't checked if this can actually happen, but given the ways you use locks here it looks like it might.
Sure this could not happen. I have __checked__ it and actually run my test scenario using step-by-step execution in debugger. I have reproduced the situation in debugger and made sure that both inner sdap_handle_destructor and outer sdap_handle_release complete without errors and that sdap_handle_data is actually freed (talloc_free is called). I have done all that.
I trust you did. That's not the point.
Simo.
On 04/15/2010 07:16 PM, Simo Sorce wrote:
On Thu, 15 Apr 2010 17:58:18 +0400 Eugene Indenbomeindenbom@gmail.com wrote:
On 04/15/2010 05:35 PM, Simo Sorce wrote:
On Thu, 15 Apr 2010 09:14:37 -0400 Simo Sorcessorce@redhat.com wrote:
I know the sdap_handle destructor() will end up caling it (through sdap_handle_release(), but then why all the games with the final argument ? Why not simply attach a destructor to sdap_handle_data that prevents destruction until it is safe ? Akso given we can prevent releasing sdap_handle too early through the destructor, why not simply apply all this to sdap_handle directly w/o requireing another sub-structure ?
Ah btw, while thinking about how you handle release of sdap_handle_data I think I found a case where you end up with some issues.
If you free sdap_handle while sdap_handle_data is locked sdap_handle_release will be called with final = ture, and this means sh->private is set to NULL. If later on sdap_handle_release is called again shd will be NULL (shd = sh->private) and sdap_handle_data_lock(shd); wiull derefernce a NULL pointer causing a segfault.
No, it will not. sdap_handle_data->sh is never dereferenced. :-) As I have written earlier it is simply a flag showing that sdap_handle_data still have a parent. When sdap_handle_data->sh is zero the structure deletes itself as soon as sdap_handle_data->destroy_lock is zero too (see sdap_handle_data_release).
Shall ever back pointer sdap_handle_data->sh be dereferenced, it needs to be checked for NULL before use. That's clear.
Sorry, I don't understand your comment, I never once mentioned sdap_handle_data->sh
Oops, I have got your concern wrong for the first time.
sh->private is set to NULL when and only when final == true, that means we are called from sdap_handle_destructor, so when immediately after that sdap_handle will be deleted by talloc.
So there is no chance of having sdap_handle where sdap_handle->private = NULL. I have set sdap_handle->private to NULL merely for housekeeping to completely break bi-link sdap_handle->sh <=> sdap_handle_data->shd.
Actually it looks like that sequence of actions during destruction is not clear. Well, it's common case for me. I take a lot of complicated staff for granted.
The sequence of actions in recursive destruction sequence: 1. sdap_process_result is called 2. sdap_handle_release(final = false) is called 3. conncb destroyed 3. op->callback is called 4. talloc_zfree(gsh) is called 5. sdap_handle_destroy is called 6. sdap_handle_release(final = true) is called 7. LDAP handle destroyed 8. sh->private = NULL, shd->sh = NULL 9. sdap_handle_release(final = true) returns 10. sdap_handle_destroy returns => 11. sdap_handle is destroyed by talloc (so there is no chance of dereferencing sh->private) 12. op->callback returns => 13. sdap_handle_data_release destroys sdap_handle_data 14. sdap_handle_release(final = false) returns 15. sdap_process_result returns
The sequence of actions in normal destruction sequence: 1. sdap_handle_destroy is called 2. sdap_handle_release(final = true) is called 3. conncb destroyed 4. LDAP handle destroyed 8. sh->private = NULL, shd->sh = NULL => 9. sdap_handle_data_release destroys sdap_handle_data NB that's why I set sh->private = NULL, here sdap_handle_data is destroyed first 10. sdap_handle_release(final = true) returns 11. sdap_handle_destroy returns => 12. sdap_handle is destroyed by talloc (so there is no chance of dereferencing sh->private)
Therefore setting both sh->private and shd->sh to NULL serves the purpose of marking relationship between them broken. So if any code needs to be executed after that (currently there is no such code) is could check whether sdap_handle_data is orphan or sdap_handle is almost destroyed so it does not even have inner sdap_handle_data object.
Regards, Eugene
On Thu, 15 Apr 2010 20:05:17 +0400 Eugene Indenbom eindenbom@gmail.com wrote:
On 04/15/2010 07:16 PM, Simo Sorce wrote:
On Thu, 15 Apr 2010 17:58:18 +0400 Eugene Indenbomeindenbom@gmail.com wrote:
On 04/15/2010 05:35 PM, Simo Sorce wrote:
On Thu, 15 Apr 2010 09:14:37 -0400 Simo Sorcessorce@redhat.com wrote:
I know the sdap_handle destructor() will end up caling it (through sdap_handle_release(), but then why all the games with the final argument ? Why not simply attach a destructor to sdap_handle_data that prevents destruction until it is safe ? Akso given we can prevent releasing sdap_handle too early through the destructor, why not simply apply all this to sdap_handle directly w/o requireing another sub-structure ?
Ah btw, while thinking about how you handle release of sdap_handle_data I think I found a case where you end up with some issues.
If you free sdap_handle while sdap_handle_data is locked sdap_handle_release will be called with final = ture, and this means sh->private is set to NULL. If later on sdap_handle_release is called again shd will be NULL (shd = sh->private) and sdap_handle_data_lock(shd); wiull derefernce a NULL pointer causing a segfault.
No, it will not. sdap_handle_data->sh is never dereferenced. :-) As I have written earlier it is simply a flag showing that sdap_handle_data still have a parent. When sdap_handle_data->sh is zero the structure deletes itself as soon as sdap_handle_data->destroy_lock is zero too (see sdap_handle_data_release).
Shall ever back pointer sdap_handle_data->sh be dereferenced, it needs to be checked for NULL before use. That's clear.
Sorry, I don't understand your comment, I never once mentioned sdap_handle_data->sh
Oops, I have got your concern wrong for the first time.
sh->private is set to NULL when and only when final == true, that means we are called from sdap_handle_destructor, so when immediately after that sdap_handle will be deleted by talloc.
So there is no chance of having sdap_handle where sdap_handle->private = NULL. I have set sdap_handle->private to NULL merely for housekeeping to completely break bi-link sdap_handle->sh <=> sdap_handle_data->shd.
Actually it looks like that sequence of actions during destruction is not clear. Well, it's common case for me. I take a lot of complicated staff for granted.
The sequence of actions in recursive destruction sequence:
- sdap_process_result is called
- sdap_handle_release(final = false) is called
- conncb destroyed
- op->callback is called
- talloc_zfree(gsh) is called
- sdap_handle_destroy is called
- sdap_handle_release(final = true) is called
- LDAP handle destroyed
- sh->private = NULL, shd->sh = NULL
- sdap_handle_release(final = true) returns
- sdap_handle_destroy returns
=> 11. sdap_handle is destroyed by talloc (so there is no chance of dereferencing sh->private) 12. op->callback returns => 13. sdap_handle_data_release destroys sdap_handle_data 14. sdap_handle_release(final = false) returns 15. sdap_process_result returns
The sequence of actions in normal destruction sequence:
- sdap_handle_destroy is called
- sdap_handle_release(final = true) is called
- conncb destroyed
- LDAP handle destroyed
- sh->private = NULL, shd->sh = NULL
=> 9. sdap_handle_data_release destroys sdap_handle_data NB that's why I set sh->private = NULL, here sdap_handle_data is destroyed first 10. sdap_handle_release(final = true) returns 11. sdap_handle_destroy returns => 12. sdap_handle is destroyed by talloc (so there is no chance of dereferencing sh->private)
Therefore setting both sh->private and shd->sh to NULL serves the purpose of marking relationship between them broken. So if any code needs to be executed after that (currently there is no such code) is could check whether sdap_handle_data is orphan or sdap_handle is almost destroyed so it does not even have inner sdap_handle_data object.
Thanks for posting the sequence, I managed to come up with it before though, and there is a case you are not considering.
You said you want to be resistant to random callback freeing the sdap_handle, so this is the sequence that still gives me trouble:
1. sdap_process_result is called 2. sdap_handle_release(final = false) is called 3. conncb destroyed 3. op->callback is called 4. talloc_zfree(gsh) is called 5. sdap_handle_destroy is called 6. sdap_handle_release(final = true) is called 7. netx op->callback is called 8. talloc_zfree(gsh) is called (2nd time .. not good)
I don't see how you prevent this in your code, given ctx->gsh is zeroed only when talloc_zfree() returns.
This is why I very much would prefer to avoid recursive loops within sdap_handle_release rather than just try to use locks to deal with them, and hence why the rule to never free ctx->gsh.
One way to do that is to prevent freeing ctx->gsh by returning -1 in the destructor if we are recursing and checking the talloc_free() return to decide whether to zero ctx->gsh or not depending on whether the free was successful.
Another way, maybe, is to NULL ctx->gsh before actually freeing it, but this again is a special rule on how to free ctx->gsh, so we can more easily just ban freeing ctx->gsh as I was trying to do in my patches.
...
ok, I guess another way is to make sure we loop over shd->ops only in the outermost sdap_handle_release() ... I guess we can do that using a flag to be set the first time we enter sdap_handle_release ...
Comments anyone ?
Simo.
On 04/15/2010 10:22 PM, Simo Sorce wrote:
On Thu, 15 Apr 2010 20:05:17 +0400 Eugene Indenbomeindenbom@gmail.com wrote:
On 04/15/2010 07:16 PM, Simo Sorce wrote:
On Thu, 15 Apr 2010 17:58:18 +0400 Eugene Indenbomeindenbom@gmail.com wrote:
On 04/15/2010 05:35 PM, Simo Sorce wrote:
On Thu, 15 Apr 2010 09:14:37 -0400 Simo Sorcessorce@redhat.com wrote:
I know the sdap_handle destructor() will end up caling it (through sdap_handle_release(), but then why all the games with the final argument ? Why not simply attach a destructor to sdap_handle_data that prevents destruction until it is safe ? Akso given we can prevent releasing sdap_handle too early through the destructor, why not simply apply all this to sdap_handle directly w/o requireing another sub-structure ?
Ah btw, while thinking about how you handle release of sdap_handle_data I think I found a case where you end up with some issues.
If you free sdap_handle while sdap_handle_data is locked sdap_handle_release will be called with final = ture, and this means sh->private is set to NULL. If later on sdap_handle_release is called again shd will be NULL (shd = sh->private) and sdap_handle_data_lock(shd); wiull derefernce a NULL pointer causing a segfault.
No, it will not. sdap_handle_data->sh is never dereferenced. :-) As I have written earlier it is simply a flag showing that sdap_handle_data still have a parent. When sdap_handle_data->sh is zero the structure deletes itself as soon as sdap_handle_data->destroy_lock is zero too (see sdap_handle_data_release).
Shall ever back pointer sdap_handle_data->sh be dereferenced, it needs to be checked for NULL before use. That's clear.
Sorry, I don't understand your comment, I never once mentioned sdap_handle_data->sh
Oops, I have got your concern wrong for the first time.
sh->private is set to NULL when and only when final == true, that means we are called from sdap_handle_destructor, so when immediately after that sdap_handle will be deleted by talloc.
So there is no chance of having sdap_handle where sdap_handle->private = NULL. I have set sdap_handle->private to NULL merely for housekeeping to completely break bi-link sdap_handle->sh <=> sdap_handle_data->shd.
Actually it looks like that sequence of actions during destruction is not clear. Well, it's common case for me. I take a lot of complicated staff for granted.
The sequence of actions in recursive destruction sequence:
- sdap_process_result is called
- sdap_handle_release(final = false) is called
- conncb destroyed
- op->callback is called
- talloc_zfree(gsh) is called
- sdap_handle_destroy is called
- sdap_handle_release(final = true) is called
- LDAP handle destroyed
- sh->private = NULL, shd->sh = NULL
- sdap_handle_release(final = true) returns
- sdap_handle_destroy returns
=> 11. sdap_handle is destroyed by talloc (so there is no chance of dereferencing sh->private) 12. op->callback returns => 13. sdap_handle_data_release destroys sdap_handle_data 14. sdap_handle_release(final = false) returns 15. sdap_process_result returns
The sequence of actions in normal destruction sequence:
- sdap_handle_destroy is called
- sdap_handle_release(final = true) is called
- conncb destroyed
- LDAP handle destroyed
- sh->private = NULL, shd->sh = NULL
=> 9. sdap_handle_data_release destroys sdap_handle_data NB that's why I set sh->private = NULL, here sdap_handle_data is destroyed first 10. sdap_handle_release(final = true) returns 11. sdap_handle_destroy returns => 12. sdap_handle is destroyed by talloc (so there is no chance of dereferencing sh->private)
Therefore setting both sh->private and shd->sh to NULL serves the purpose of marking relationship between them broken. So if any code needs to be executed after that (currently there is no such code) is could check whether sdap_handle_data is orphan or sdap_handle is almost destroyed so it does not even have inner sdap_handle_data object.
Thanks for posting the sequence, I managed to come up with it before though, and there is a case you are not considering.
You said you want to be resistant to random callback freeing the sdap_handle, so this is the sequence that still gives me trouble:
- sdap_process_result is called
- sdap_handle_release(final = false) is called
- conncb destroyed
- op->callback is called
- talloc_zfree(gsh) is called
- sdap_handle_destroy is called
- sdap_handle_release(final = true) is called
- netx op->callback is called
- talloc_zfree(gsh) is called (2nd time .. not good)
I don't see how you prevent this in your code, given ctx->gsh is zeroed only when talloc_zfree() returns.
This problem was one of major issues when I have started to work on failover reconnect problem.
The complete set of my patches addressed the issue by implementing reference counting of connection usage (see https://fedorahosted.org/pipermail/sssd-devel/2010-April/003113.html, patch #4). When reference counting is in place, sdap_handle is not destroyed until all operations complete. So inner sdap_handle_release(final = true) calls no callbacks.
So after I suggested patch #3 as a separate fix for "early free of sdap_handle" problem, I have looked into this issue again and found the deficiency in talloc_zfree you have mentioned above. The I have developed a simple patch to fix it, but was holding it back for a while waiting for this discussion to calm down.
Now I am sending this patch in separate e-mail for approval.
This is why I very much would prefer to avoid recursive loops within sdap_handle_release rather than just try to use locks to deal with them, and hence why the rule to never free ctx->gsh.
I agree that it would be good to prevent or at least limit recursion of callbacks, but unfortunately as callbacks are called from destructor it is not possible to resolve within this patch.
IMHO it is very bad policy to call callbacks from destructor. My patch #4 resolves the issue by introducing reference counting of sdap_handle.
One way to do that is to prevent freeing ctx->gsh by returning -1 in the destructor if we are recursing and checking the talloc_free() return to decide whether to zero ctx->gsh or not depending on whether the free was successful.
Another way, maybe, is to NULL ctx->gsh before actually freeing it, but this again is a special rule on how to free ctx->gsh, so we can more easily just ban freeing ctx->gsh as I was trying to do in my patches.
...
ok, I guess another way is to make sure we loop over shd->ops only in the outermost sdap_handle_release() ... I guess we can do that using a flag to be set the first time we enter sdap_handle_release ...
Comments anyone ?
It is not possible to call callback in outer sdap_handle_release(final = false), as all callbacks must complete before sdap_handle is destroyed (that is sdap_handle_release(final = true) completes).
On the other hand it is an absolute must to prevent double destruction of sdap_handle. This could be achieved by fixing talloc_zfree. Therefore in worst case the recursion is limited by to levels: - sdap_handle_release(final = false) - sdap_handle_release(final = true)
The third level of sdap_handle_release(final = true) would mean double destruction, that is obviously a disaster.
Eugene
On Fri, 16 Apr 2010 10:02:08 +0400 Eugene Indenbom eindenbom@gmail.com wrote:
I agree that it would be good to prevent or at least limit recursion of callbacks, but unfortunately as callbacks are called from destructor it is not possible to resolve within this patch.
Well, I'll think about that.
IMHO it is very bad policy to call callbacks from destructor. My patch #4 resolves the issue by introducing reference counting of sdap_handle.
calling callbacks from a destructor is perfectly fine, it is the method used to clean-up and reply to callers if something went wrong, I don't see anything bad with that.
What is important is that callbacks do not attempt to recurse (starting new requests) when there are errors. This is a general implicit rule of the tevent model anyway, not something we are making up here for this case.
It is not possible to call callback in outer sdap_handle_release(final = false), as all callbacks must complete before sdap_handle is destroyed (that is sdap_handle_release(final = true) completes).
On the other hand it is an absolute must to prevent double destruction of sdap_handle. This could be achieved by fixing talloc_zfree. Therefore in worst case the recursion is limited by to levels: - sdap_handle_release(final = false) - sdap_handle_release(final = true)
The third level of sdap_handle_release(final = true) would mean double destruction, that is obviously a disaster.
Ok, let me try to summarized what is the problem and what we need to do in abstract here, because we are going down a rat hole of exceptions and I think we are loosing the point.
There are only 2 ways for sdap_handle_release() to be called. 1. from sdap_process_result() in case the ldap handler returns a fatal error. 2. from sdap_handle_destructor when someone tries to free the handle.
In any case what we want is to signal all pending requests that they need to terminate either because we want to close the connection immediately or because there was an error.
In any case, ideally we never want sdap_handle_release to recurse, because we are already cleaning up and recursing causes issues.
The only difference between the 2 patch to get to sdap_handle_release is that if the destructor calls it we want the handle to be freed at the end of the outermost loop. While if we call it only from sdap_process_result we do not want to free it because other functions may still reference ctx->gsh, so we just want to release everythin but the memory context itself and only to avoid segfault if callers try to dereference it. But no caller will ever use the handle anymore because sh->connected is false.
Is there anything I am missing here ?
Simo.
On 04/16/2010 06:17 PM, Simo Sorce wrote:
On Fri, 16 Apr 2010 10:02:08 +0400 Eugene Indenbomeindenbom@gmail.com wrote:
I agree that it would be good to prevent or at least limit recursion of callbacks, but unfortunately as callbacks are called from destructor it is not possible to resolve within this patch.
Well, I'll think about that.
IMHO it is very bad policy to call callbacks from destructor. My patch #4 resolves the issue by introducing reference counting of sdap_handle.
calling callbacks from a destructor is perfectly fine, it is the method used to clean-up and reply to callers if something went wrong, I don't see anything bad with that.
Well, it is possible not to call callbacks from destructor. The technique is implemented in my patch #4: 1. Each active operation holds a reference on sdap_handle (or on sdap_id_conn_data to be precise) 2. When connection breaks sdap_handle_release(final = false) starts calling callbacks 3. Within each callback a reference on sdap_handle is dropped 4. Within last callback the last reference on sdap_handle is dropped 5. sdap_handle is destroyed 6. sdap_handle_release(final = true) is called, but there is no ops to notify
This way it is possible to avoid calling callbacks from destructor.
What is important is that callbacks do not attempt to recurse (starting new requests) when there are errors. This is a general implicit rule of the tevent model anyway, not something we are making up here for this case.
It is not possible to call callback in outer sdap_handle_release(final = false), as all callbacks must complete before sdap_handle is destroyed (that is sdap_handle_release(final = true) completes).
On the other hand it is an absolute must to prevent double destruction of sdap_handle. This could be achieved by fixing talloc_zfree. Therefore in worst case the recursion is limited by to levels: - sdap_handle_release(final = false) - sdap_handle_release(final = true)
The third level of sdap_handle_release(final = true) would mean double destruction, that is obviously a disaster.
Ok, let me try to summarized what is the problem and what we need to do in abstract here, because we are going down a rat hole of exceptions and I think we are loosing the point.
There are only 2 ways for sdap_handle_release() to be called.
- from sdap_process_result() in case the ldap handler returns a fatal
error. 2. from sdap_handle_destructor when someone tries to free the handle.
In any case what we want is to signal all pending requests that they need to terminate either because we want to close the connection immediately or because there was an error.
In any case, ideally we never want sdap_handle_release to recurse, because we are already cleaning up and recursing causes issues.
The only difference between the 2 patch to get to sdap_handle_release is that if the destructor calls it we want the handle to be freed at the end of the outermost loop. While if we call it only from sdap_process_result we do not want to free it because other functions may still reference ctx->gsh, so we just want to release everythin but the memory context itself and only to avoid segfault if callers try to dereference it. But no caller will ever use the handle anymore because sh->connected is false.
Is there anything I am missing here ?
Looks correct. Being a bit paranoid, I would like to state is in my own words.
There is 3 possible scenarios: 1. Only talloc_zfree(gsh) is called: - sdap_handle_release(final = true) is called - operation callbacks are called - LDAP handle is deallocated - sdap_handle_data is deallocated - sdap_handle is deallocated
2. Only sdap_process_result is called: - sdap_handle_release(final = true) is called - operation callbacks are called - LDAP handle is deallocated - sdap_handle and sdap_handle_data live waiting for talloc_zfree to free them
This scenario takes place when there is no active op, but cached connection breaks.
3. Recursion: - sdap_handle_release(final = false) is called - first operation callback is called - ... - talloc_zfree(gsh) is called - sdap_handle_release(final = true) is called - further operation callbacks are called - LDAP handle is deallocated - sdap_handle is deallocated - ... - the control returns to sdap_handle_release(final = false) - sdap_handle_data is freed
So the difference between paths: 1. If no talloc_zfree is called, sdap_handle and sdap_handle_data survive, but otherwise connection is closed. 2. talloc_zfree destroys everything except sdap_handle_data if there is an outer sdap_handle_release(final = false) call. Then outer call to sdap_handle_release releases sdap_handle_data.
Looks equivalent to what you have written, a bit more verbose may be.
Eugene
On Fri, 16 Apr 2010 19:00:57 +0400 Eugene Indenbom eindenbom@gmail.com wrote:
Looks equivalent to what you have written, a bit more verbose may be.
Ok in this case I think your patch is overkill. All we really need is to use the destructor to do what it is meant for, and just prevent freeing sdap_handle if we are recursing, and at the same time use a lock so that only one sdap_handle_release (the ""outermost"" does the job).
Attached find a patch.
Note that the patch still includes my old changes to avoid freeing ctx->gsh, they are *not* necessary, but I think they are good hygiene anyway so I left them in.
The only changes really necessary are the ones to add the 2 booleans to sdap_handle in sdap.h and the changes to sdap_handle_destructor() and sdap_handle_release() in sdap_async.c
These changes are much less invasive than the patch you propose and it looks to me they provide for the same outcome.
Of course comments are welcome.
If someone want to see this core change split from the other cosmetic changes to move freeing ctx->gsh and friends, let me know.
Simo.
On 04/16/2010 10:22 PM, Simo Sorce wrote:
On Fri, 16 Apr 2010 19:00:57 +0400 Eugene Indenbomeindenbom@gmail.com wrote:
Looks equivalent to what you have written, a bit more verbose may be.
Ok in this case I think your patch is overkill. All we really need is to use the destructor to do what it is meant for, and just prevent freeing sdap_handle if we are recursing, and at the same time use a lock so that only one sdap_handle_release (the ""outermost"" does the job).
Attached find a patch.
Very clever solution. Actually I am pretty much surprised by talloc library. Allowing an object to cancel its own destruction is funny and really unexpected feature.
Note that the patch still includes my old changes to avoid freeing ctx->gsh, they are *not* necessary, but I think they are good hygiene anyway so I left them in.
I think that these changes need to go to separate patch. Look for further discussion below.
The only changes really necessary are the ones to add the 2 booleans to sdap_handle in sdap.h and the changes to sdap_handle_destructor() and sdap_handle_release() in sdap_async.c
These changes are much less invasive than the patch you propose and it looks to me they provide for the same outcome.
Of course comments are welcome.
I think that several more or less cosmetic additions could be of some value, so I have created my own version of the patch. I know that this looks more or less as not-invented-here syndrome, so you could discard the patch altogether without looking into it. I have attached it merely to illustrate the following points that seem valuable to me:
1. Better names for sh->locked and sh->release respectively could be sh->being_disconnected and sh->orphan. Some additional comments would also be of some help to future maintainers.
2. A new method sdap_handle_disconnect should be used to forcibly disconnect sdap_handle. Setting sh->connected = false (as in full version of patch) is no good as it leaves the LDAP connection open for indeterminate period of time and violates object encapsulation.
3. sdap_handle_release should unregister sh->connb from LDAP prior to deleting it. Although NULL pointer check is done LDAP callback, it's good policy to remove callbacks when no longer needed: ldap_get_option(sh->ldap, LDAP_OPT_CONNECT_CB, sh->conncb);
4. It's bad policy to destroy memory allocated and owned by other code (in sdap_handle_release): if (op == sh->ops) talloc_free(op); So I suggest to use early call of sdap_op destructor instead of deallocation (see my patch)
//-------------------------------------
So now we are ready to continue with fixing failover reconnect and GSSAPI authentication in LDAP and IPA providers. From my point of view at least the following problems needs to be addressed by final solution:
1. When two (or more) BE requests are executed in parallel and there is no cached connection, only one LDAP connection should be established. In current implementation 2 connections will be established and the first one killed failing the operation that connected first.
2. When OFFLINE state is detected during request execution (there were cached connection, but all failover servers failed to connect during request execution), the backend must return DP_ERR_OFFLINE. It currently returns DP_ERR_FATAL with EIO error. Next request completes with DP_ERR_OFFLINE. So there is a big inconsistency in behaviour.
3. It is essential to close LDAP connection before GSSAPI ticket is expired as closing connection with already expired ticket still writes a message in message log.
4. The about-to-expire connection should be closed gracefully: all requests already in progress and using the connection should be completed, new requests should establish and use new connection.
5. ipa_access backend should also use failover retries.
6. I think it is essential to reduce amount of copy-paste code handling LDAP connect/reconnect code. My strong opinion is that a special mechanism for handling LDAP connect/retry logic is required.
Eugene
On Mon, 19 Apr 2010 15:19:22 +0400 Eugene Indenbom eindenbom@gmail.com wrote:
On 04/16/2010 10:22 PM, Simo Sorce wrote:
On Fri, 16 Apr 2010 19:00:57 +0400 Eugene Indenbomeindenbom@gmail.com wrote:
Looks equivalent to what you have written, a bit more verbose may be.
Ok in this case I think your patch is overkill. All we really need is to use the destructor to do what it is meant for, and just prevent freeing sdap_handle if we are recursing, and at the same time use a lock so that only one sdap_handle_release (the ""outermost"" does the job).
Attached find a patch.
Very clever solution. Actually I am pretty much surprised by talloc library. Allowing an object to cancel its own destruction is funny and really unexpected feature.
If you think hard about it, it actually makes a lot of sense to avoid and prevents recursion, which can never be a good thing when freeing resources.
Note that the patch still includes my old changes to avoid freeing ctx->gsh, they are *not* necessary, but I think they are good hygiene anyway so I left them in.
I think that these changes need to go to separate patch. Look for further discussion below.
Ok I split the patch in 2 parts.
The only changes really necessary are the ones to add the 2 booleans to sdap_handle in sdap.h and the changes to sdap_handle_destructor() and sdap_handle_release() in sdap_async.c
These changes are much less invasive than the patch you propose and it looks to me they provide for the same outcome.
Of course comments are welcome.
I think that several more or less cosmetic additions could be of some value, so I have created my own version of the patch. I know that this looks more or less as not-invented-here syndrome, so you could discard the patch altogether without looking into it. I have attached it merely to illustrate the following points that seem valuable to me:
Ok I took some advice, but my patch remains mostly unchanged.
- Better names for sh->locked and sh->release respectively could be
sh->being_disconnected and sh->orphan. Some additional comments would also be of some help to future maintainers.
I agree the names where not the best I could think, but I prefer to give names that represent the function of the variable rather than what is supposedly going to happen when the variable is true.
I renamed lock -> destructor_lock, so it is clear what it is locking.
I also renamed release -> release_memory, orphan doesn't fit as memory is not really orphan of anything, it is still child of whatever parent until freed, but we want to signal that memory needs be released once we are done.
- A new method sdap_handle_disconnect should be used to forcibly
disconnect sdap_handle. Setting sh->connected = false (as in full version of patch) is no good as it leaves the LDAP connection open for indeterminate period of time and violates object encapsulation.
I' leave this for a following patch if necessary.
- sdap_handle_release should unregister sh->connb from LDAP prior to
deleting it. Although NULL pointer check is done LDAP callback, it's good policy to remove callbacks when no longer needed: ldap_get_option(sh->ldap, LDAP_OPT_CONNECT_CB, sh->conncb);
This is now taken care of automatically with latest Sumit patches, thanks for pointing it out.
- It's bad policy to destroy memory allocated and owned by other
code (in sdap_handle_release): if (op == sh->ops) talloc_free(op); So I suggest to use early call of sdap_op destructor instead of deallocation (see my patch)
In this case although the memory is child of the original requests, the semantic owner is the sdap_handle(), so it is eantirely appropriate to free the operations if sdap_handle is being freed, for the simple reason that the "technical" owner has no access to those objects and they would be just hanging memory to their requests.
I propose the attached patches for inclusion in master. I'd like to divert any modification that is not an intrinsic bug in the patches to later patches and a separate thread so we can move on :)
Simo.
On 04/29/2010 09:47 PM, Simo Sorce wrote:
On Mon, 19 Apr 2010 15:19:22 +0400 Eugene Indenbomeindenbom@gmail.com wrote:
On 04/16/2010 10:22 PM, Simo Sorce wrote:
On Fri, 16 Apr 2010 19:00:57 +0400 Eugene Indenbomeindenbom@gmail.com wrote:
Looks equivalent to what you have written, a bit more verbose may be.
Ok in this case I think your patch is overkill. All we really need is to use the destructor to do what it is meant for, and just prevent freeing sdap_handle if we are recursing, and at the same time use a lock so that only one sdap_handle_release (the ""outermost"" does the job).
Attached find a patch.
Very clever solution. Actually I am pretty much surprised by talloc library. Allowing an object to cancel its own destruction is funny and really unexpected feature.
If you think hard about it, it actually makes a lot of sense to avoid and prevents recursion, which can never be a good thing when freeing resources.
Sure it's convenient. I just say it is unusual, at least for seasoned C++ programmer.
Note that the patch still includes my old changes to avoid freeing ctx->gsh, they are *not* necessary, but I think they are good hygiene anyway so I left them in.
I think that these changes need to go to separate patch. Look for further discussion below.
Ok I split the patch in 2 parts.
0002-Avoid-freeing-sdap_handle-too-early.patch is mostly OK. I have only one concern with it, see below.
0001-Better-handle-sdap_handle-memory-from-callers.patch, from my point of view, should be dropped or integrated into later patch addressing failover reconnect problem as: - The patch itself does not solve failover reconnect problems (it just adds a bit of robustness that is good, but is not sufficient) - 0002-Avoid-freeing-sdap_handle-too-early.patch fixes LDAP backend crash on its own and therefore failover reconnect problem could wait for compete solution.
<...>
- It's bad policy to destroy memory allocated and owned by other
code (in sdap_handle_release): if (op == sh->ops) talloc_free(op); So I suggest to use early call of sdap_op destructor instead of deallocation (see my patch)
In this case although the memory is child of the original requests, the semantic owner is the sdap_handle(), so it is eantirely appropriate to free the operations if sdap_handle is being freed, for the simple reason that the "technical" owner has no access to those objects and they would be just hanging memory to their requests.
I still disagree with this approach. The memory must be deleted only by owner. If it is not deleted after callback completes, sdap_handle_release must assume that it is still needed for some purpose. Destroying memory that is not owned is a sure way to problems.
Nevertheless in current sdap_handle usage, sdap_op is __always__ deleted by owner. So it would make no major harm if you stay with your point of view. It actually does not matter in current state of development.
Eugene
On Thu, Apr 29, 2010 at 01:47:42PM -0400, Simo Sorce wrote:
On Mon, 19 Apr 2010 15:19:22 +0400 Eugene Indenbom eindenbom@gmail.com wrote:
On 04/16/2010 10:22 PM, Simo Sorce wrote:
On Fri, 16 Apr 2010 19:00:57 +0400 Eugene Indenbomeindenbom@gmail.com wrote:
Looks equivalent to what you have written, a bit more verbose may be.
Ok in this case I think your patch is overkill. All we really need is to use the destructor to do what it is meant for, and just prevent freeing sdap_handle if we are recursing, and at the same time use a lock so that only one sdap_handle_release (the ""outermost"" does the job).
Attached find a patch.
Very clever solution. Actually I am pretty much surprised by talloc library. Allowing an object to cancel its own destruction is funny and really unexpected feature.
If you think hard about it, it actually makes a lot of sense to avoid and prevents recursion, which can never be a good thing when freeing resources.
Note that the patch still includes my old changes to avoid freeing ctx->gsh, they are *not* necessary, but I think they are good hygiene anyway so I left them in.
I think that these changes need to go to separate patch. Look for further discussion below.
Ok I split the patch in 2 parts.
The only changes really necessary are the ones to add the 2 booleans to sdap_handle in sdap.h and the changes to sdap_handle_destructor() and sdap_handle_release() in sdap_async.c
These changes are much less invasive than the patch you propose and it looks to me they provide for the same outcome.
Of course comments are welcome.
I think that several more or less cosmetic additions could be of some value, so I have created my own version of the patch. I know that this looks more or less as not-invented-here syndrome, so you could discard the patch altogether without looking into it. I have attached it merely to illustrate the following points that seem valuable to me:
Ok I took some advice, but my patch remains mostly unchanged.
- Better names for sh->locked and sh->release respectively could be
sh->being_disconnected and sh->orphan. Some additional comments would also be of some help to future maintainers.
I agree the names where not the best I could think, but I prefer to give names that represent the function of the variable rather than what is supposedly going to happen when the variable is true.
I renamed lock -> destructor_lock, so it is clear what it is locking.
I also renamed release -> release_memory, orphan doesn't fit as memory is not really orphan of anything, it is still child of whatever parent until freed, but we want to signal that memory needs be released once we are done.
- A new method sdap_handle_disconnect should be used to forcibly
disconnect sdap_handle. Setting sh->connected = false (as in full version of patch) is no good as it leaves the LDAP connection open for indeterminate period of time and violates object encapsulation.
I' leave this for a following patch if necessary.
- sdap_handle_release should unregister sh->connb from LDAP prior to
deleting it. Although NULL pointer check is done LDAP callback, it's good policy to remove callbacks when no longer needed: ldap_get_option(sh->ldap, LDAP_OPT_CONNECT_CB, sh->conncb);
This is now taken care of automatically with latest Sumit patches, thanks for pointing it out.
- It's bad policy to destroy memory allocated and owned by other
code (in sdap_handle_release): if (op == sh->ops) talloc_free(op); So I suggest to use early call of sdap_op destructor instead of deallocation (see my patch)
In this case although the memory is child of the original requests, the semantic owner is the sdap_handle(), so it is eantirely appropriate to free the operations if sdap_handle is being freed, for the simple reason that the "technical" owner has no access to those objects and they would be just hanging memory to their requests.
I propose the attached patches for inclusion in master. I'd like to divert any modification that is not an intrinsic bug in the patches to later patches and a separate thread so we can move on :)
Simo.
-- Simo Sorce * Red Hat, Inc * New York
ACK to both.
I have seen Eugene comments on the patches and do not want to argure that these patches make the code only cleaner and more robust. But I think these improvements are still worth to be added to the current code base and some of them might get lost if we postponed them to a later patch.
bye, Sumit
On 05/03/2010 05:21 AM, Sumit Bose wrote:
ACK to both.
Pushed to master and sssd-1-2
On Fri, 16 Apr 2010 10:02:08 +0400 Eugene Indenbom eindenbom@gmail.com wrote:
You said you want to be resistant to random callback freeing the sdap_handle, so this is the sequence that still gives me trouble:
- sdap_process_result is called
- sdap_handle_release(final = false) is called
- conncb destroyed
- op->callback is called
- talloc_zfree(gsh) is called
- sdap_handle_destroy is called
- sdap_handle_release(final = true) is called
- netx op->callback is called
- talloc_zfree(gsh) is called (2nd time .. not good)
I don't see how you prevent this in your code, given ctx->gsh is zeroed only when talloc_zfree() returns.
This problem was one of major issues when I have started to work on failover reconnect problem.
The complete set of my patches addressed the issue by implementing reference counting of connection usage (see https://fedorahosted.org/pipermail/sssd-devel/2010-April/003113.html, patch #4). When reference counting is in place, sdap_handle is not destroyed until all operations complete. So inner sdap_handle_release(final = true) calls no callbacks.
So after I suggested patch #3 as a separate fix for "early free of sdap_handle" problem, I have looked into this issue again and found the deficiency in talloc_zfree you have mentioned above. The I have developed a simple patch to fix it, but was holding it back for a while waiting for this discussion to calm down.
Now I am sending this patch in separate e-mail for approval.
Ah btw, I re-checked this with Sumit and it turns out that calling talloc_free(X) within the destructor of X is not really a problem and will not cause a recursion. Talloc will just return an not run the destructor again.
So the patch to talloc_zfree() is strictly not necessary, although it probably does no harm either so I'm ok with it.
Simo.
On 04/15/2010 05:14 PM, Simo Sorce wrote:
On Thu, 15 Apr 2010 10:13:46 +0400 Eugene Indenbomeindenbom@gmail.com wrote:
On 04/13/2010 08:51 PM, Simo Sorce wrote:
This patch attempts to address the problem Eugene found with sdap_handle_release()
Simo.
I have found 2 critical problems in Simo's patch:
- After setting sdap_handle->connected = false, LDAP handle is never
destroyed. (see sdap_handle_release). So actually patch introduces huge resource leak.
Ahh, thanks for catching this one, I forgot to remove the sh->connected check in sdap_handle_release()
- Despite all efforts not to free sdap_handle from callbacks it is
still freed when connect operation fails (together with connection tevent_req). So the following sequence still reproduces the bug: a. Using debugger break in sdap_cli_rootdse_step b. Restart LDAP server or network connection leading to it c. Resume execution d. Observe crash
This looks like a separate issue. Within the connect functions we are not using the global sdap_handle. I guess the issue here is that until we return from the connect request, the sh is a child of the request memory hierarchy, so when we call the callback within sdap_handle_release() we end up freeing the hierarchy.
Do you confirm ?
Yes.
Actually I still do not understand why it is feasible instead of fixing destruction of sdap_handle in callback to invent a workaround trying to avoid doing so.
I am not sure why you call it a workaround, it is just a different solution.
The bug in destruction of sdap_handle in callback violates basic programming principles: - Every creatable object must by destructible (or in case of reference counted object it must be possible to release ownership/reference from object)
This is not really true in a program based on talloc, as you can see a talloc hierarchy as a set of references. We do indeed almost exclusively base our destruction sequences on talloc hierarchies. The only case where this is not entirely possible is indeed the interface code between the program and the ldap libraries (for obvious reasons), and I want to keep the difference restricted in an area as small as possible. One of the reasons I do not like your approach is that the talloc hierarchy become very fuzzy.
I agree to this principle, but some exceptions must be allowed at least for cases like this when we need to delay destruction of structure holding head of callback list.
NB In my patch only sdap_handle_data memory block is allocated on NULL context. All others are allocated on talloc hierarchy. It is absolutely required as this memory block is used by callback iteration. Therefore it is owned not only by sdap_handle itself, but by each stack frame iterating over callbacks (sdap_handle->ops member). This stack frame ownership is taken into account by destroy_lock, simple reference counting variable. There is no room to screw the things up here.
The whole technique is stock and widely used design pattern. I really do not understand why you are so reluctant to accept it. It's much simpler and cheaper in future maintenance than crasy rules on when it is allowed and when it is not allowed to destroy sdap_handle.
- There should be no restriction on when method startingdestruction is called
Not sure where you got this, but it is certainly not true :) You can't destroy objects in use. If you are thinking about deferred destruction that is what we want to achieve but using correct talloc hierarchies as much as possible. Unfortunately this is not the case, because of the interfcae we are forced into with the openldap libraries.
If object is busy (like in our case) it simply starts destruction and completes it as soon as all dependent operations complete and it is not busy any longer.
- After object destruction is started it is object'sresponsibility to complete destruction as soon as possible. There should not be any tricks like keeping an object in garbage collecting queue.
Unfortunately this is not possible as well, because pending requests all reference a common object (the one that ultimately holds the ldap handler).
Yes, it is possible, and more over implemented in my patch. The only memory block left over and destroyed later after talloc_zfree is sdap_handle_data. All handles, file descriptors and in particular LDAP handle are destroyed or closed.
Violation of the above principles calls for __path__ coverage analysis of code as we have to ensure that no __path__ of code destructs object at wrong time. Even if Simo fixes the above problems there will be no guarantee that sdap_handle destruction is never called from callback.
We just need to guarantee this, I will check again if I can steal some code from your patch though :)
It's hopeless. There always will be a code path we overlook.
Regards, Eugene
PS Although my patch never sets sdap_handle->connected to false, I have attached an amendment to my patch to improve robustness of destruction code when somebody really sets sdap_handle->connected to false.
PPS Even without this amendment no resource leaks exist in my patch as all memory and callbacks are a bit later destroyed through talloc ownership.
Given you alloc private on NULL, you can hardly claim it is released through a talloc hierarchy, but it is explicitly freed in sdap_handle_data_release().
Yes, I claim that my patch follow the talloc hierarchy. The only exception is sdap_handle_data memory block, which is freed when stack unwinds to top-level callback iterator. The exception is well defined and managed by 2 simple methods (sdap_handle_data_lock/sdap_handle_data_release).
I know the sdap_handle destructor() will end up caling it (through sdap_handle_release(), but then why all the games with the final argument ? Why not simply attach a destructor to sdap_handle_data that prevents destruction until it is safe ?
Because, sdap_handle_destructor and sdap_handle_release return __before__ sdap_handle_data is freed. All other objects are freed (sdap_handle itself, LDAP handle, callback data) and all aborted operations are notified. Only this sdap_handle_data memory block survives the destructor, because it is needed to signal end of callback iteration to outer sdap_handle_release.
The final argument signals that sdap_handle_data needs to be detached from parent sdap_handle and go into orphan state. In orphan state is self-destroys as soon as stack unwinds to top-level sdap_handle_release.
Akso given we can prevent releasing sdap_handle too early through the destructor, why not simply apply all this to sdap_handle directly w/o requireing another sub-structure ?
Because: 1. sdap_handle is not an opaque data structure. It's pointer is known to outside code which can accidentally call talloc_free. It's easier to allow calling talloc_free than to find all places where it is called. It also reduces amount of changed legacy code.
2. I want to have regular talloc parent for all other sub-objects of sdap_handle. This way we exclude from hierarchy only one memory block. This makes the solution more simple and robust.
Eugene
On Thu, 15 Apr 2010 17:43:46 +0400 Eugene Indenbom eindenbom@gmail.com wrote:
Because:
- sdap_handle is not an opaque data structure. It's pointer is known
to outside code which can accidentally call talloc_free. It's easier to allow calling talloc_free than to find all places where it is called. It also reduces amount of changed legacy code.
- I want to have regular talloc parent for all other sub-objects of
sdap_handle. This way we exclude from hierarchy only one memory block. This makes the solution more simple and robust.
Ok, you almost convinced me.
I will take another look at your patch and see if we can use it as a base point.
Simo.
sssd-devel@lists.fedorahosted.org