This is very ugly hotfix for ticket: https://fedorahosted.org/sssd/ticket/2018
So far we were not able to find out why the slot or name_ptr values were corrupted, but this should at least prevent segfaults.
Comments are highly appreciated.
Thanks Michal
On Mon, 2013-08-05 at 21:13 +0200, Michal Židek wrote:
This is very ugly hotfix for ticket: https://fedorahosted.org/sssd/ticket/2018
So far we were not able to find out why the slot or name_ptr values were corrupted, but this should at least prevent segfaults.
NACK, (but close)
Please do not add all those comments about removing checks, those checks are good to stay forever, the cache may be corrupted by a bad disk sector or whatever, so they should never be removed.
Also please do not just return ENOENT, and pretend nothing happened. If the cache is corrupted we want to take immediate corrective action. I suggest we return a new SSSD error and in the topmost nsscache caller we invalidate the current cache and reinit.
Optional (may be should be a separate ticket) We currently reinit by closing the file and creating a new one. We should not do that, we should rather reset the header and then just zero all the data and reinitialize all the various areas. The reason is that if we have some bad bug and we keep creating new files we might run out of space because clients may have old files open and mmapped, and unless they perform a new operations, they may not release them. Note we should *not* use ftruncate() here, just write the appropriate initialization values everywhere needed.
Simo.
On Mon, Aug 05, 2013 at 03:30:43PM -0400, Simo Sorce wrote:
Optional (may be should be a separate ticket) We currently reinit by closing the file and creating a new one. We should not do that, we should rather reset the header and then just zero all the data and reinitialize all the various areas. The reason is that if we have some bad bug and we keep creating new files we might run out of space because clients may have old files open and mmapped, and unless they perform a new operations, they may not release them. Note we should *not* use ftruncate() here, just write the appropriate initialization values everywhere needed.
Thanks for the review Simo.
I would prefer this to be at least a separate commit, but preferably even a separate ticket simply for the reason that any change outside a simple patching might introduce regressions. Especially when it comes to the memcache, which is mapped to the process's memory, I think we should be double paranoid.
On 08/05/2013 09:30 PM, Simo Sorce wrote:
On Mon, 2013-08-05 at 21:13 +0200, Michal Židek wrote:
This is very ugly hotfix for ticket: https://fedorahosted.org/sssd/ticket/2018
So far we were not able to find out why the slot or name_ptr values were corrupted, but this should at least prevent segfaults.
NACK, (but close)
Please do not add all those comments about removing checks, those checks are good to stay forever, the cache may be corrupted by a bad disk sector or whatever, so they should never be removed.
Ok. I left just one FIXME (actually two, but related) label in the case where this patch relies on the same offset values of strs in both group and passwd data structure. I do not expect these structures to be changed, but it is something we want to have at least noticed in the code.
Also please do not just return ENOENT, and pretend nothing happened. If the cache is corrupted we want to take immediate corrective action. I suggest we return a new SSSD error and in the topmost nsscache caller we invalidate the current cache and reinit.
Actually I started doing it like this, but handling the problem in the topmost caller is not very good idea here. When I was writing it I found several topmost callers and even forcing the error to bubble up to these callers required some api changes. In addition we would have to make sure we handle the corrupted memory cache case in the future, when we use the affected functions in other topmost callers (it is not so easy to track). So I decided to handle it on the place where we detect the error.
Optional (may be should be a separate ticket) We currently reinit by closing the file and creating a new one. We should not do that, we should rather reset the header and then just zero all the data and reinitialize all the various areas. The reason is that if we have some bad bug and we keep creating new files we might run out of space because clients may have old files open and mmapped, and unless they perform a new operations, they may not release them. Note we should *not* use ftruncate() here, just write the appropriate initialization values everywhere needed.
I agree, but sss_mmap_cache_reinit is completely unsuitable for this case and a huge overkill, even if we might change it in the future. I created a new function sss_mmap_cache_reset which is very lightweight. I will not create it as a separate ticket since I do not want to use the reinit function even temporary for this case.
New patch is attached.
Thanks Michal
On Wed, Aug 07, 2013 at 06:38:47PM +0200, Michal Židek wrote:
On 08/05/2013 09:30 PM, Simo Sorce wrote:
On Mon, 2013-08-05 at 21:13 +0200, Michal Židek wrote:
This is very ugly hotfix for ticket: https://fedorahosted.org/sssd/ticket/2018
So far we were not able to find out why the slot or name_ptr values were corrupted, but this should at least prevent segfaults.
NACK, (but close)
Please do not add all those comments about removing checks, those checks are good to stay forever, the cache may be corrupted by a bad disk sector or whatever, so they should never be removed.
Ok. I left just one FIXME (actually two, but related) label in the case where this patch relies on the same offset values of strs in both group and passwd data structure. I do not expect these structures to be changed, but it is something we want to have at least noticed in the code.
Also please do not just return ENOENT, and pretend nothing happened. If the cache is corrupted we want to take immediate corrective action. I suggest we return a new SSSD error and in the topmost nsscache caller we invalidate the current cache and reinit.
Actually I started doing it like this, but handling the problem in the topmost caller is not very good idea here. When I was writing it I found several topmost callers and even forcing the error to bubble up to these callers required some api changes. In addition we would have to make sure we handle the corrupted memory cache case in the future, when we use the affected functions in other topmost callers (it is not so easy to track). So I decided to handle it on the place where we detect the error.
Optional (may be should be a separate ticket) We currently reinit by closing the file and creating a new one. We should not do that, we should rather reset the header and then just zero all the data and reinitialize all the various areas. The reason is that if we have some bad bug and we keep creating new files we might run out of space because clients may have old files open and mmapped, and unless they perform a new operations, they may not release them. Note we should *not* use ftruncate() here, just write the appropriate initialization values everywhere needed.
I agree, but sss_mmap_cache_reinit is completely unsuitable for this case and a huge overkill, even if we might change it in the future. I created a new function sss_mmap_cache_reset which is very lightweight. I will not create it as a separate ticket since I do not want to use the reinit function even temporary for this case.
New patch is attached.
Thanks Michal
I like the patch. I asked Simo to take a look, too, since he wrote the memcache support:
18:54 < jhrozek> anyway, the subject is "[PATCH] mmap_cache: Check if slot and name_ptr are not invalid." 18:58 < simo> ok it does look ok 18:59 < simo> what he describes as "I won't do this or that" is actually what I wanted him to do anyway :-D 18:59 < simo> mzidek: good job 18:59 < simo> (if it passes some testing :-) 19:00 < jhrozek> simo: I've been running with the patch applied today on my laptop 19:00 < simo> jhrozek: it would be nice to have a test binary that can randomly 'reset' the cache and another binary that loops over the cache 19:00 < simo> just to make sure these resets will not cause issues in sss_nss 19:00 < simo> they shouldn't but ... 19:00 < simo> jhrozek: yes but it is all error paths 19:00 < simo> it is very unlikely you excercised that code at all 19:01 < jhrozek> yes, but at least we know the code is sane 19:01 < simo> nope we don't 19:01 < simo> we only know it doesn;t break the build 19:01 < jhrozek> ok, I will do more testing with sss_cache thrown in to invalidate the cache etc 19:01 < simo> but if it is never called we do not know that it works 19:02 < jhrozek> we know the usual paths work and master won't be horribly broken with the patch 19:02 < simo> sss_cache will not do that kind of invalidation 19:02 < simo> jhrozek: oh I am not saying we should hold the patch
So Ack from me.
Really good work from both you and Lukas. The test Simo proposed is tracked by https://fedorahosted.org/sssd/ticket/2045
On Sun, Aug 11, 2013 at 07:35:48PM +0200, Jakub Hrozek wrote:
On Wed, Aug 07, 2013 at 06:38:47PM +0200, Michal Židek wrote:
On 08/05/2013 09:30 PM, Simo Sorce wrote:
On Mon, 2013-08-05 at 21:13 +0200, Michal Židek wrote:
This is very ugly hotfix for ticket: https://fedorahosted.org/sssd/ticket/2018
So far we were not able to find out why the slot or name_ptr values were corrupted, but this should at least prevent segfaults.
NACK, (but close)
Please do not add all those comments about removing checks, those checks are good to stay forever, the cache may be corrupted by a bad disk sector or whatever, so they should never be removed.
Ok. I left just one FIXME (actually two, but related) label in the case where this patch relies on the same offset values of strs in both group and passwd data structure. I do not expect these structures to be changed, but it is something we want to have at least noticed in the code.
Also please do not just return ENOENT, and pretend nothing happened. If the cache is corrupted we want to take immediate corrective action. I suggest we return a new SSSD error and in the topmost nsscache caller we invalidate the current cache and reinit.
Actually I started doing it like this, but handling the problem in the topmost caller is not very good idea here. When I was writing it I found several topmost callers and even forcing the error to bubble up to these callers required some api changes. In addition we would have to make sure we handle the corrupted memory cache case in the future, when we use the affected functions in other topmost callers (it is not so easy to track). So I decided to handle it on the place where we detect the error.
Optional (may be should be a separate ticket) We currently reinit by closing the file and creating a new one. We should not do that, we should rather reset the header and then just zero all the data and reinitialize all the various areas. The reason is that if we have some bad bug and we keep creating new files we might run out of space because clients may have old files open and mmapped, and unless they perform a new operations, they may not release them. Note we should *not* use ftruncate() here, just write the appropriate initialization values everywhere needed.
I agree, but sss_mmap_cache_reinit is completely unsuitable for this case and a huge overkill, even if we might change it in the future. I created a new function sss_mmap_cache_reset which is very lightweight. I will not create it as a separate ticket since I do not want to use the reinit function even temporary for this case.
New patch is attached.
Thanks Michal
I like the patch. I asked Simo to take a look, too, since he wrote the memcache support:
18:54 < jhrozek> anyway, the subject is "[PATCH] mmap_cache: Check if slot and name_ptr are not invalid." 18:58 < simo> ok it does look ok 18:59 < simo> what he describes as "I won't do this or that" is actually what I wanted him to do anyway :-D 18:59 < simo> mzidek: good job 18:59 < simo> (if it passes some testing :-) 19:00 < jhrozek> simo: I've been running with the patch applied today on my laptop 19:00 < simo> jhrozek: it would be nice to have a test binary that can randomly 'reset' the cache and another binary that loops over the cache 19:00 < simo> just to make sure these resets will not cause issues in sss_nss 19:00 < simo> they shouldn't but ... 19:00 < simo> jhrozek: yes but it is all error paths 19:00 < simo> it is very unlikely you excercised that code at all 19:01 < jhrozek> yes, but at least we know the code is sane 19:01 < simo> nope we don't 19:01 < simo> we only know it doesn;t break the build 19:01 < jhrozek> ok, I will do more testing with sss_cache thrown in to invalidate the cache etc 19:01 < simo> but if it is never called we do not know that it works 19:02 < jhrozek> we know the usual paths work and master won't be horribly broken with the patch 19:02 < simo> sss_cache will not do that kind of invalidation 19:02 < simo> jhrozek: oh I am not saying we should hold the patch
So Ack from me.
Really good work from both you and Lukas. The test Simo proposed is tracked by https://fedorahosted.org/sssd/ticket/2045
Pushed to master, sssd-1-10 and sssd-1-9
sssd-devel@lists.fedorahosted.org