On Tue, Sep 17, 2013 at 07:09:42PM +0200, Lukas Slebodnik wrote:
On (17/09/13 18:33), Michal Židek wrote:
>On 09/17/2013 06:15 PM, Lukas Slebodnik wrote:
>>On (17/09/13 17:15), Michal Židek wrote:
>>>On 09/17/2013 04:41 PM, Lukas Slebodnik wrote:
>>>>On (16/09/13 18:16), Michal Židek wrote:
>>>>>On 09/14/2013 12:35 AM, Lukas Slebodnik wrote:
>>>>>>On (13/09/13 19:17), Michal Židek wrote:
>>>>>>>On 09/13/2013 05:58 PM, Michal Židek wrote:
>>>>>>>>Hello,
>>>>>>>>
>>>>>>>>This patch should add another line of defence against
memory cache
>>>>>>>>problems caused by accessing slot outside of bounds.
>>>>>>>>
>>>>>>>>Thanks
>>>>>>>>Michal
>>>>>>>>
>>>>>>>
>>>>>>>After discussion with Lukas I am attaching alternative
version
>>>>>>>without the call to save the corrupted cache.
>>>>>>>
>>>>>>>Thanks
>>>>>>>Michal
>>>>>>>
>>>>>>
>>>>>>I tested patch with latest sssd-1-9 on RHEL6.
>>>>>>
>>>>>>Message was written to the sssd_nss.log:
>>>>>>(Sat Sep 14 00:19:24 2013) [sssd[nss]]
[sss_mc_get_next_slot_with_hash]
>>>>>>(0x0010): Corrupted fastcache. Slot number too big.
>>>>>>
>>>>>>But it seems that sssd_nss crashed.
>>>>>>
>>>>>>Program received signal SIGILL, Illegal instruction.
>>>>>>0x00007fa63f8b5b03 in mabort () from /lib64/libc.so.6
>>>>>>#0 0x00007fa63f8b5b03 in mabort () from /lib64/libc.so.6
>>>>>>No symbol table info available.
>>>>>>#1 0x0000000000000008 in ?? ()
>>>>>>No symbol table info available.
>>>>>>#2 0x0000000001086cf0 in ?? ()
>>>>>>No symbol table info available.
>>>>>>#3 0x00000000010842a0 in ?? ()
>>>>>>No symbol table info available.
>>>>>>#4 0x00000000004288de in sss_mc_find_free_slots (_mcc=0x1086cf0,
rec_len=252, key=<value optimized out>, _rec=0x7fff1b6bb008) at
src/responder/nss/nsssrv_mmap_cache.c:495
>>>>>> tot_slots = <value optimized out>
>>>>>> i = <value optimized out>
>>>>>> rec = <value optimized out>
>>>>>> cur = 0
>>>>>> t = <value optimized out>
>>>>>> used = true
>>>>>>#5 sss_mc_get_record (_mcc=0x1086cf0, rec_len=252, key=<value
optimized out>, _rec=0x7fff1b6bb008) at src/responder/nss/nsssrv_mmap_cache.c:637
>>>>>> mcc = 0x7fa632dfa038
>>>>>> old_rec = <value optimized out>
>>>>>> rec = <value optimized out>
>>>>>> old_slots = <value optimized out>
>>>>>> num_slots = 8
>>>>>> base_slot = <value optimized out>
>>>>>> ret = 853516376
>>>>>> i = <value optimized out>
>>>>>> __FUNCTION__ = "sss_mc_get_record"
>>>>>>#6 0x0000000000428eee in sss_mmap_cache_pw_store
(_mcc=0x10842a0, name=0x7fff1b6bb140, pw=0x7fff1b6bb150, uid=126516, gid=16000,
gecos=0x7fff1b6bb180, homedir=0x7fff1b6bb170, shell=0x7fff1b6bb160) at
src/responder/nss/nsssrv_mmap_cache.c:731
>>>>>> mcc = 0x1086cf0
>>>>>> rec = <value optimized out>
>>>>>> data = <value optimized out>
>>>>>> uidkey = {str = 0x7fff1b6bb010 "126516", len =
7}
>>>>>> uidstr = "126516\000\000\021)\213"
>>>>>> data_len = 204
>>>>>> rec_len = 252
>>>>>> pos = <value optimized out>
>>>>>> ret = <value optimized out>
>>>>>>
>>>>>>LS
>>>>>>
>>>>The crash was caused, because mmap_cache was invalidated
>>>>and memset was called for invalidated record.
>>>>
>>>>memset(rec->data, MC_INVALID_VAL8, <snip> *
MC_SIZE_TO_SLOTS(rec->len))
>>>> ^^^^^^^^^^
>>>> rec->len is
0xffffffff
>>>>>
>>>>>Good catch.
>>>>>
>>>>>In function sss_mc_get_next_slot_with_hash I returned MC_INVALID_VAL
>>>>>in case of failure. This function is called in function
>>>>>sss_mc_rm_rec_from_chain, which returns no error code. So the upper
>>>>>layers worked with the wrong MC_INVALID_VAL, as if it was valid
value
>>>>>and the MC_PTR_TO_SLOT resulted in some big slot number (which
>>>>>probably caused the crash). In this patch I populate the information
>>>>>about failure to to sss_mc_invalidate_rec and terminate it if
>>>>>necessary.
>>>>>
>>>>>New patch is attached. Please run the testing script with your
>>>>>configuration that failed to see if it is fixed.
>>>>>
>>>>>Thanks
>>>>>Michal
>>>>>
>>>>I ran a stress test for 20+ hours and sssd_nss didn't crash,
>>>>only mmap_cache was reseted or reinvalidated.
>>>>
>>>>Here is sssd_nss.log:
>>>>=============================================================
>>>>(Mon Sep 16 18:49:56 2013) [sssd[nss]] [monitor_common_rotate_logs]
(0x0010): Debug level changed to 0x0070
>>>>(Mon Sep 16 18:49:57 2013) [sssd[nss]] [monitor_common_rotate_logs]
(0x0010): Debug level changed to 0x0030
>>>>(Mon Sep 16 18:53:31 2013) [sssd[nss]] [setpwent_result_timeout]
(0x0020): setpwent result object has expired. Cleaning up.
>>>>(Mon Sep 16 19:37:17 2013) [sssd[nss]] [sss_mc_find_record] (0x0010):
Corrupted fastcache. name_ptr value is 16.
>>>>--- 1st reset.
>>>>(Mon Sep 16 21:32:40 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash]
(0x0010): Corrupted fastcache. Slot number too big.
>>>>--- 2nd reset.
>>>>(Mon Sep 16 22:04:04 2013) [sssd[nss]] [sss_mc_find_record] (0x0010):
Corrupted fastcache. name_ptr value is 909128497.
>>>>--- 3rd reset.
>>>>(Mon Sep 16 23:15:23 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash]
(0x0010): Corrupted fastcache. Slot number too big.
>>>>(Mon Sep 16 23:40:19 2013) [sssd[nss]] [sss_mc_is_valid_rec] (0x0010):
Corrupted fastcache. Slot number too big.
>>>>(Mon Sep 16 23:40:19 2013) [sssd[nss]] [sss_mc_get_record] (0x0020):
Fatal internal mmap cache error, invalidating cache!
>>>>(Mon Sep 16 23:40:19 2013) [sssd[nss]] [fill_pwent] (0x0020): Failed to
store user testbigldap129485(default) in mmap cache!
>>>>--- 1st invalidation.
>>>>(Tue Sep 17 06:06:51 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash]
(0x0010): Corrupted fastcache. Slot number too big.
>>>>--- 4th reset.
>>>>(Tue Sep 17 06:25:04 2013) [sssd[nss]] [sss_mc_add_rec_to_chain]
(0x0010): Corrupted fastcache. Slot number too big.
>>>>--- 5th reset.
>>>>(Tue Sep 17 07:48:23 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash]
(0x0010): Corrupted fastcache. Slot number too big.
>>>>--- 6th reset.
>>>>(Tue Sep 17 08:08:36 2013) [sssd[nss]] [sss_mc_is_valid_rec] (0x0010):
Corrupted fastcache. Slot number too big.
>>>>(Tue Sep 17 08:08:36 2013) [sssd[nss]] [sss_mc_get_record] (0x0020):
Fatal internal mmap cache error, invalidating cache!
>>>>(Tue Sep 17 08:08:36 2013) [sssd[nss]] [fill_pwent] (0x0020): Failed to
store user testbigldap119808(default) in mmap cache!
>>>>--- 2nd invalidation.
>>>>(Tue Sep 17 12:56:02 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash]
(0x0010): Corrupted fastcache. Slot number too big.
>>>>--- 7th reset.
>>>>(Tue Sep 17 13:22:27 2013) [sssd[nss]] [sss_mc_is_valid_rec] (0x0010):
Corrupted fastcache. Slot number too big.
>>>>(Tue Sep 17 13:22:27 2013) [sssd[nss]] [sss_mc_get_record] (0x0020):
Fatal internal mmap cache error, invalidating cache!
>>>>(Tue Sep 17 13:22:27 2013) [sssd[nss]] [fill_pwent] (0x0020): Failed to
store user testbigldap126987(default) in mmap cache!
>>>>--- 3rd invalidation.
>>>>(Tue Sep 17 14:30:56 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash]
(0x0010): Corrupted fastcache. Slot number too big.
>>>>--- 8th reset.
>>>>(Tue Sep 17 14:54:21 2013) [sssd[nss]] [sss_mc_add_rec_to_chain]
(0x0010): Corrupted fastcache. Slot number too big.
>>>>--- 9th reset.
>>>>=============================================================
>>>>
>>>>In older log sile, I also saw reseting caches in another functions.
>>>>BTW: I tested my patch with "two next pointers" and it did not
crash and
>>>>log file is empty :-)
>>>>
>>>>just few small nitpicks.
>>>>
>>>>>From ba054788692f713eed0738f5b3f5e6bb48fe4c4b Mon Sep 17 00:00:00
2001
>>>>>From: Michal Zidek <mzidek(a)redhat.com>
>>>>>Date: Fri, 13 Sep 2013 17:41:28 +0200
>>>>>Subject: [PATCH] Check slot validity before MC_SLOT_TO_PTR.
>>>>>
>>>>>resolves:
>>>>>https://fedorahosted.org/sssd/ticket/2049
>>>>>---
>>>>>src/responder/nss/nsssrv_mmap_cache.c | 80
+++++++++++++++++++++++++++++++----
>>>>>src/sss_client/nss_mc_common.c | 4 ++
>>>>>2 files changed, 75 insertions(+), 9 deletions(-)
>>>>>
>>>>>diff --git a/src/responder/nss/nsssrv_mmap_cache.c
b/src/responder/nss/nsssrv_mmap_cache.c
>>>>>index 4e45405..09a4c2b 100644
>>>>>--- a/src/responder/nss/nsssrv_mmap_cache.c
>>>>>+++ b/src/responder/nss/nsssrv_mmap_cache.c
>>>>>@@ -191,6 +191,13 @@ static void sss_mc_add_rec_to_chain(struct
sss_mc_ctx *mcc,
>>>>> }
>>>>>
>>>>> do {
>>>>>+ if (!MC_SLOT_WITHIN_BOUNDS(slot, mcc->dt_size)) {
>>>>>+ DEBUG(SSSDBG_FATAL_FAILURE,
>>>>>+ ("Corrupted fastcache. Slot number too
big.\n"));
>>>>>+ sss_mmap_cache_reset(mcc);
>>>>>+ return;
>>>>>+ }
>>>>>+
>>>>> cur = MC_SLOT_TO_PTR(mcc->data_table, slot, struct
sss_mc_rec);
>>>>> if (cur == rec) {
>>>>> /* rec already stored in hash chain */
>>>>>@@ -205,16 +212,24 @@ static void sss_mc_add_rec_to_chain(struct
sss_mc_ctx *mcc,
>>>>> cur->next = MC_PTR_TO_SLOT(mcc->data_table, rec);
>>>>>}
>>>>>
>>>>>-static inline uint32_t
>>>>>+static inline errno_t
>>>>>sss_mc_get_next_slot_with_hash(struct sss_mc_ctx *mcc,
>>>>> struct sss_mc_rec *start_rec,
>>>>>- uint32_t hash)
>>>>>+ uint32_t hash,
>>>>>+ uint32_t *_slot)
>>>>>{
>>>>> struct sss_mc_rec *rec;
>>>>> uint32_t slot;
>>>>>
>>>>> slot = start_rec->next;
>>>>> while (slot != MC_INVALID_VAL) {
>>>>>+ if (!MC_SLOT_WITHIN_BOUNDS(slot, mcc->dt_size)) {
>>>>>+ DEBUG(SSSDBG_FATAL_FAILURE,
>>>>>+ ("Corrupted fastcache. Slot number too
big.\n"));
>>>>>+ sss_mmap_cache_reset(mcc);
>>>>>+ return EINVAL;
>>>>>+ }
>>>>>+
>>>>> rec = MC_SLOT_TO_PTR(mcc->data_table, slot, struct
sss_mc_rec);
>>>>> if (rec->hash1 == hash || rec->hash2 == hash) {
>>>>> break;
>>>>>@@ -223,23 +238,26 @@ sss_mc_get_next_slot_with_hash(struct
sss_mc_ctx *mcc,
>>>>> slot = rec->next;
>>>>> }
>>>>>
>>>>>- return slot;
>>>>>+ *_slot = slot;
>>>>>+
>>>>>+ return EOK;
>>>>>}
>>>>>
>>>>>-static void sss_mc_rm_rec_from_chain(struct sss_mc_ctx *mcc,
>>>>>+static errno_t sss_mc_rm_rec_from_chain(struct sss_mc_ctx *mcc,
>>>>> struct sss_mc_rec *rec,
>>>>> uint32_t hash)
>>>>>{
>>>>> struct sss_mc_rec *prev = NULL;
>>>>> struct sss_mc_rec *cur = NULL;
>>>>> uint32_t slot;
>>>>>+ uint32_t ret;
>>>>>
>>>>> if (hash > MC_HT_ELEMS(mcc->ht_size)) {
>>>>> /* It can happen if rec->hash1 and rec->hash2 was the
same.
>>>>> * or it is invalid hash. It is better to return
>>>>> * than trying to access out of bounds memory
>>>>> */
>>>>>- return;
>>>>>+ return EOK;
>>>> ^^^
>>>> EOK should be returned on if hash is equal to MC_INVALID_VAL
>>>
>>>Ok. I also changed the comments (split the error and normal case).
>>>
>>>>
>>>>> }
>>>>>
>>>>> slot = mcc->hash_table[hash];
>>>>>@@ -247,18 +265,37 @@ static void sss_mc_rm_rec_from_chain(struct
sss_mc_ctx *mcc,
>>>>> /* record has already been removed. It may happen if
rec->hash1 and
>>>>> * rec->has2 are the same. (It is not very likely).
>>>>> */
>>>>>- return;
>>>>>+ return EOK;
>>>>>+ }
>>>>>+
>>>>>+ if (!MC_SLOT_WITHIN_BOUNDS(slot, mcc->dt_size)) {
>>>>>+ DEBUG(SSSDBG_FATAL_FAILURE,
>>>>>+ ("Corrupted fastcache. Slot number too
big.\n"));
>>>>>+ sss_mmap_cache_reset(mcc);
>>>>>+ return EINVAL;
>>>>> }
>>>>>+
>>>>> cur = MC_SLOT_TO_PTR(mcc->data_table, slot, struct
sss_mc_rec);
>>>>> if (cur == rec) {
>>>>> /* rec->next can refer to record without matching
hashes.
>>>>> * We need to skip this(those) records, because
>>>>> * mcc->hash_table[hash] have to refer to valid start of
the chain.
>>>>> */
>>>>>- mcc->hash_table[hash] =
sss_mc_get_next_slot_with_hash(mcc, rec, hash);
>>>>>+ ret = sss_mc_get_next_slot_with_hash(mcc, rec, hash,
>>>>>+
&mcc->hash_table[hash]);
>>>>>+ if (ret != EOK) {
>>>>>+ return ret;
>>>>>+ }
>>>>This condition does not make sense.
>>>>You can use:
>>>> either return ret;
>>>> or return sss_mc_get_next_slot_with_hash
>>>
>>>Well, I wanted the function to run to the end if nothing went wrong,
>>>but yes, in this case we return at several places in the function so
>>>it does not make much sense.
>>>
>>>>
>>>>LS
>>>
>>>Thanks for the review Lukas. New patch is attached.
>>>
>>>Michal
>>
>>I tested patch for sssd-1-9 branch on RHEL6
>>and it works fine. I think it is good alternative for sssd-1-9 branch.
>>
>>ACK
>>
>>LS
>
>I would personally prefer to have this in master as well, even if the
>currently known issues will be fixed with your patch ([PATCH]
>mmap_cache: Use two chains for hash collision). I think it hurts
>nothing to be more defensive.
Your patch is not about defensive coding. The patch is about fixing corrupted
cache in back-compatible way. In my opinion, mmap_cache source code looks much
more complicated and readability is decreased.
LS
I don't think the readability suffers that much and maybe we can improve
it by wrapping the repeated checks in a macro?
I'm also for more defensiveness, we don't know what other strange bugs
might linger in the code.