Re: [SSSD] [PATCH] Check slot validity before MC_SLOT_TO_PTR.

Tuesday, 17 September 2013

On Tue, Sep 17, 2013 at 07:09:42PM +0200, Lukas Slebodnik wrote:
...
 On (17/09/13 18:33), Michal Židek wrote:
 >On 09/17/2013 06:15 PM, Lukas Slebodnik wrote:
 >>On (17/09/13 17:15), Michal Židek wrote:
 >>>On 09/17/2013 04:41 PM, Lukas Slebodnik wrote:
 >>>>On (16/09/13 18:16), Michal Židek wrote:
 >>>>>On 09/14/2013 12:35 AM, Lukas Slebodnik wrote:
 >>>>>>On (13/09/13 19:17), Michal Židek wrote:
 >>>>>>>On 09/13/2013 05:58 PM, Michal Židek wrote:
 >>>>>>>>Hello,
 >>>>>>>>
 >>>>>>>>This patch should add another line of defence against
memory cache
 >>>>>>>>problems caused by accessing slot outside of bounds.
 >>>>>>>>
 >>>>>>>>Thanks
 >>>>>>>>Michal
 >>>>>>>>
 >>>>>>>
 >>>>>>>After discussion with Lukas I am attaching alternative
version
 >>>>>>>without the call to save the corrupted cache.
 >>>>>>>
 >>>>>>>Thanks
 >>>>>>>Michal
 >>>>>>>
 >>>>>>
 >>>>>>I tested patch with latest sssd-1-9 on RHEL6.
 >>>>>>
 >>>>>>Message was written to the sssd_nss.log:
 >>>>>>(Sat Sep 14 00:19:24 2013) [sssd[nss]]
[sss_mc_get_next_slot_with_hash]
 >>>>>>(0x0010): Corrupted fastcache. Slot number too big.
 >>>>>>
 >>>>>>But it seems that sssd_nss crashed.
 >>>>>>
 >>>>>>Program received signal SIGILL, Illegal instruction.
 >>>>>>0x00007fa63f8b5b03 in mabort () from /lib64/libc.so.6
 >>>>>>#0  0x00007fa63f8b5b03 in mabort () from /lib64/libc.so.6
 >>>>>>No symbol table info available.
 >>>>>>#1  0x0000000000000008 in ?? ()
 >>>>>>No symbol table info available.
 >>>>>>#2  0x0000000001086cf0 in ?? ()
 >>>>>>No symbol table info available.
 >>>>>>#3  0x00000000010842a0 in ?? ()
 >>>>>>No symbol table info available.
 >>>>>>#4  0x00000000004288de in sss_mc_find_free_slots (_mcc=0x1086cf0,
rec_len=252, key=<value optimized out>, _rec=0x7fff1b6bb008) at
src/responder/nss/nsssrv_mmap_cache.c:495
 >>>>>>         tot_slots = <value optimized out>
 >>>>>>         i = <value optimized out>
 >>>>>>         rec = <value optimized out>
 >>>>>>         cur = 0
 >>>>>>         t = <value optimized out>
 >>>>>>         used = true
 >>>>>>#5  sss_mc_get_record (_mcc=0x1086cf0, rec_len=252, key=<value
optimized out>, _rec=0x7fff1b6bb008) at src/responder/nss/nsssrv_mmap_cache.c:637
 >>>>>>         mcc = 0x7fa632dfa038
 >>>>>>         old_rec = <value optimized out>
 >>>>>>         rec = <value optimized out>
 >>>>>>         old_slots = <value optimized out>
 >>>>>>         num_slots = 8
 >>>>>>         base_slot = <value optimized out>
 >>>>>>         ret = 853516376
 >>>>>>         i = <value optimized out>
 >>>>>>         __FUNCTION__ = "sss_mc_get_record"
 >>>>>>#6  0x0000000000428eee in sss_mmap_cache_pw_store
(_mcc=0x10842a0, name=0x7fff1b6bb140, pw=0x7fff1b6bb150, uid=126516, gid=16000,
gecos=0x7fff1b6bb180, homedir=0x7fff1b6bb170, shell=0x7fff1b6bb160) at
src/responder/nss/nsssrv_mmap_cache.c:731
 >>>>>>         mcc = 0x1086cf0
 >>>>>>         rec = <value optimized out>
 >>>>>>         data = <value optimized out>
 >>>>>>         uidkey = {str = 0x7fff1b6bb010 "126516", len =
7}
 >>>>>>         uidstr = "126516\000\000\021)\213"
 >>>>>>         data_len = 204
 >>>>>>         rec_len = 252
 >>>>>>         pos = <value optimized out>
 >>>>>>         ret = <value optimized out>
 >>>>>>
 >>>>>>LS
 >>>>>>
 >>>>The crash was caused, because mmap_cache was invalidated
 >>>>and memset was called for invalidated record.
 >>>>
 >>>>memset(rec->data, MC_INVALID_VAL8, <snip> *
MC_SIZE_TO_SLOTS(rec->len))
 >>>>                                                             ^^^^^^^^^^
 >>>>                                                     rec->len is
0xffffffff
 >>>>>
 >>>>>Good catch.
 >>>>>
 >>>>>In function sss_mc_get_next_slot_with_hash I returned MC_INVALID_VAL
 >>>>>in case of failure. This function is called in function
 >>>>>sss_mc_rm_rec_from_chain, which returns no error code. So the upper
 >>>>>layers worked with the wrong MC_INVALID_VAL, as if it was valid
value
 >>>>>and the MC_PTR_TO_SLOT resulted in some big slot number (which
 >>>>>probably caused the crash). In this patch I populate the information
 >>>>>about failure to to sss_mc_invalidate_rec and terminate it if
 >>>>>necessary.
 >>>>>
 >>>>>New patch is attached. Please run the testing script with your
 >>>>>configuration that failed to see if it is fixed.
 >>>>>
 >>>>>Thanks
 >>>>>Michal
 >>>>>
 >>>>I ran a stress test for 20+ hours and sssd_nss didn't crash,
 >>>>only mmap_cache was reseted or reinvalidated.
 >>>>
 >>>>Here is sssd_nss.log:
 >>>>=============================================================
 >>>>(Mon Sep 16 18:49:56 2013) [sssd[nss]] [monitor_common_rotate_logs]
(0x0010): Debug level changed to 0x0070
 >>>>(Mon Sep 16 18:49:57 2013) [sssd[nss]] [monitor_common_rotate_logs]
(0x0010): Debug level changed to 0x0030
 >>>>(Mon Sep 16 18:53:31 2013) [sssd[nss]] [setpwent_result_timeout]
(0x0020): setpwent result object has expired. Cleaning up.
 >>>>(Mon Sep 16 19:37:17 2013) [sssd[nss]] [sss_mc_find_record] (0x0010):
Corrupted fastcache. name_ptr value is 16.
 >>>>--- 1st reset.
 >>>>(Mon Sep 16 21:32:40 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash]
(0x0010): Corrupted fastcache. Slot number too big.
 >>>>--- 2nd reset.
 >>>>(Mon Sep 16 22:04:04 2013) [sssd[nss]] [sss_mc_find_record] (0x0010):
Corrupted fastcache. name_ptr value is 909128497.
 >>>>--- 3rd reset.
 >>>>(Mon Sep 16 23:15:23 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash]
(0x0010): Corrupted fastcache. Slot number too big.
 >>>>(Mon Sep 16 23:40:19 2013) [sssd[nss]] [sss_mc_is_valid_rec] (0x0010):
Corrupted fastcache. Slot number too big.
 >>>>(Mon Sep 16 23:40:19 2013) [sssd[nss]] [sss_mc_get_record] (0x0020):
Fatal internal mmap cache error, invalidating cache!
 >>>>(Mon Sep 16 23:40:19 2013) [sssd[nss]] [fill_pwent] (0x0020): Failed to
store user testbigldap129485(default) in mmap cache!
 >>>>--- 1st invalidation.
 >>>>(Tue Sep 17 06:06:51 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash]
(0x0010): Corrupted fastcache. Slot number too big.
 >>>>--- 4th reset.
 >>>>(Tue Sep 17 06:25:04 2013) [sssd[nss]] [sss_mc_add_rec_to_chain]
(0x0010): Corrupted fastcache. Slot number too big.
 >>>>--- 5th reset.
 >>>>(Tue Sep 17 07:48:23 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash]
(0x0010): Corrupted fastcache. Slot number too big.
 >>>>--- 6th reset.
 >>>>(Tue Sep 17 08:08:36 2013) [sssd[nss]] [sss_mc_is_valid_rec] (0x0010):
Corrupted fastcache. Slot number too big.
 >>>>(Tue Sep 17 08:08:36 2013) [sssd[nss]] [sss_mc_get_record] (0x0020):
Fatal internal mmap cache error, invalidating cache!
 >>>>(Tue Sep 17 08:08:36 2013) [sssd[nss]] [fill_pwent] (0x0020): Failed to
store user testbigldap119808(default) in mmap cache!
 >>>>--- 2nd invalidation.
 >>>>(Tue Sep 17 12:56:02 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash]
(0x0010): Corrupted fastcache. Slot number too big.
 >>>>--- 7th reset.
 >>>>(Tue Sep 17 13:22:27 2013) [sssd[nss]] [sss_mc_is_valid_rec] (0x0010):
Corrupted fastcache. Slot number too big.
 >>>>(Tue Sep 17 13:22:27 2013) [sssd[nss]] [sss_mc_get_record] (0x0020):
Fatal internal mmap cache error, invalidating cache!
 >>>>(Tue Sep 17 13:22:27 2013) [sssd[nss]] [fill_pwent] (0x0020): Failed to
store user testbigldap126987(default) in mmap cache!
 >>>>--- 3rd invalidation.
 >>>>(Tue Sep 17 14:30:56 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash]
(0x0010): Corrupted fastcache. Slot number too big.
 >>>>--- 8th reset.
 >>>>(Tue Sep 17 14:54:21 2013) [sssd[nss]] [sss_mc_add_rec_to_chain]
(0x0010): Corrupted fastcache. Slot number too big.
 >>>>--- 9th reset.
 >>>>=============================================================
 >>>>
 >>>>In older log sile, I also saw reseting caches in another functions.
 >>>>BTW: I tested my patch with "two next pointers" and it did not
crash and
 >>>>log file is empty :-)
 >>>>
 >>>>just few small nitpicks.
 >>>>
 >>>>>From ba054788692f713eed0738f5b3f5e6bb48fe4c4b Mon Sep 17 00:00:00
2001
 >>>>>From: Michal Zidek <mzidek(a)redhat.com&gt;
 >>>>>Date: Fri, 13 Sep 2013 17:41:28 +0200
 >>>>>Subject: [PATCH] Check slot validity before MC_SLOT_TO_PTR.
 >>>>>
 >>>>>resolves:
 >>>>>https://fedorahosted.org/sssd/ticket/2049
 >>>>>---
 >>>>>src/responder/nss/nsssrv_mmap_cache.c | 80
+++++++++++++++++++++++++++++++----
 >>>>>src/sss_client/nss_mc_common.c        |  4 ++
 >>>>>2 files changed, 75 insertions(+), 9 deletions(-)
 >>>>>
 >>>>>diff --git a/src/responder/nss/nsssrv_mmap_cache.c
b/src/responder/nss/nsssrv_mmap_cache.c
 >>>>>index 4e45405..09a4c2b 100644
 >>>>>--- a/src/responder/nss/nsssrv_mmap_cache.c
 >>>>>+++ b/src/responder/nss/nsssrv_mmap_cache.c
 >>>>>@@ -191,6 +191,13 @@ static void sss_mc_add_rec_to_chain(struct
sss_mc_ctx *mcc,
 >>>>>     }
 >>>>>
 >>>>>     do {
 >>>>>+        if (!MC_SLOT_WITHIN_BOUNDS(slot, mcc->dt_size)) {
 >>>>>+            DEBUG(SSSDBG_FATAL_FAILURE,
 >>>>>+                  ("Corrupted fastcache. Slot number too
big.\n"));
 >>>>>+            sss_mmap_cache_reset(mcc);
 >>>>>+            return;
 >>>>>+        }
 >>>>>+
 >>>>>         cur = MC_SLOT_TO_PTR(mcc->data_table, slot, struct
sss_mc_rec);
 >>>>>         if (cur == rec) {
 >>>>>             /* rec already stored in hash chain */
 >>>>>@@ -205,16 +212,24 @@ static void sss_mc_add_rec_to_chain(struct
sss_mc_ctx *mcc,
 >>>>>     cur->next = MC_PTR_TO_SLOT(mcc->data_table, rec);
 >>>>>}
 >>>>>
 >>>>>-static inline uint32_t
 >>>>>+static inline errno_t
 >>>>>sss_mc_get_next_slot_with_hash(struct sss_mc_ctx *mcc,
 >>>>>                                struct sss_mc_rec *start_rec,
 >>>>>-                               uint32_t hash)
 >>>>>+                               uint32_t hash,
 >>>>>+                               uint32_t *_slot)
 >>>>>{
 >>>>>     struct sss_mc_rec *rec;
 >>>>>     uint32_t slot;
 >>>>>
 >>>>>     slot = start_rec->next;
 >>>>>     while (slot != MC_INVALID_VAL) {
 >>>>>+        if (!MC_SLOT_WITHIN_BOUNDS(slot, mcc->dt_size)) {
 >>>>>+            DEBUG(SSSDBG_FATAL_FAILURE,
 >>>>>+                  ("Corrupted fastcache. Slot number too
big.\n"));
 >>>>>+            sss_mmap_cache_reset(mcc);
 >>>>>+            return EINVAL;
 >>>>>+        }
 >>>>>+
 >>>>>         rec = MC_SLOT_TO_PTR(mcc->data_table, slot, struct
sss_mc_rec);
 >>>>>         if (rec->hash1 == hash || rec->hash2 == hash) {
 >>>>>             break;
 >>>>>@@ -223,23 +238,26 @@ sss_mc_get_next_slot_with_hash(struct
sss_mc_ctx *mcc,
 >>>>>         slot = rec->next;
 >>>>>     }
 >>>>>
 >>>>>-    return slot;
 >>>>>+    *_slot = slot;
 >>>>>+
 >>>>>+    return EOK;
 >>>>>}
 >>>>>
 >>>>>-static void sss_mc_rm_rec_from_chain(struct sss_mc_ctx *mcc,
 >>>>>+static errno_t sss_mc_rm_rec_from_chain(struct sss_mc_ctx *mcc,
 >>>>>                                      struct sss_mc_rec *rec,
 >>>>>                                      uint32_t hash)
 >>>>>{
 >>>>>     struct sss_mc_rec *prev = NULL;
 >>>>>     struct sss_mc_rec *cur = NULL;
 >>>>>     uint32_t slot;
 >>>>>+    uint32_t ret;
 >>>>>
 >>>>>     if (hash > MC_HT_ELEMS(mcc->ht_size)) {
 >>>>>         /* It can happen if rec->hash1 and rec->hash2 was the
same.
 >>>>>          * or it is invalid hash. It is better to return
 >>>>>          * than trying to access out of bounds memory
 >>>>>          */
 >>>>>-        return;
 >>>>>+        return EOK;
 >>>>                  ^^^
 >>>>           EOK should be returned on if hash is equal to MC_INVALID_VAL
 >>>
 >>>Ok. I also changed the comments (split the error and normal case).
 >>>
 >>>>
 >>>>>     }
 >>>>>
 >>>>>     slot = mcc->hash_table[hash];
 >>>>>@@ -247,18 +265,37 @@ static void sss_mc_rm_rec_from_chain(struct
sss_mc_ctx *mcc,
 >>>>>         /* record has already been removed. It may happen if
rec->hash1 and
 >>>>>          * rec->has2 are the same. (It is not very likely).
 >>>>>          */
 >>>>>-        return;
 >>>>>+        return EOK;
 >>>>>+    }
 >>>>>+
 >>>>>+    if (!MC_SLOT_WITHIN_BOUNDS(slot, mcc->dt_size)) {
 >>>>>+        DEBUG(SSSDBG_FATAL_FAILURE,
 >>>>>+              ("Corrupted fastcache. Slot number too
big.\n"));
 >>>>>+        sss_mmap_cache_reset(mcc);
 >>>>>+        return EINVAL;
 >>>>>     }
 >>>>>+
 >>>>>     cur = MC_SLOT_TO_PTR(mcc->data_table, slot, struct
sss_mc_rec);
 >>>>>     if (cur == rec) {
 >>>>>         /* rec->next can refer to record without matching
hashes.
 >>>>>          * We need to skip this(those) records, because
 >>>>>          * mcc->hash_table[hash] have to refer to valid start of
the chain.
 >>>>>          */
 >>>>>-        mcc->hash_table[hash] =
sss_mc_get_next_slot_with_hash(mcc, rec, hash);
 >>>>>+        ret = sss_mc_get_next_slot_with_hash(mcc, rec, hash,
 >>>>>+                                            
&mcc->hash_table[hash]);
 >>>>>+        if (ret != EOK) {
 >>>>>+            return ret;
 >>>>>+        }
 >>>>This condition does not make sense.
 >>>>You can use:
 >>>>       either return ret;
 >>>>       or return sss_mc_get_next_slot_with_hash
 >>>
 >>>Well, I wanted the function to run to the end if nothing went wrong,
 >>>but yes, in this case we return at several places in the function so
 >>>it does not make much sense.
 >>>
 >>>>
 >>>>LS
 >>>
 >>>Thanks for the review Lukas. New patch is attached.
 >>>
 >>>Michal
 >>
 >>I tested patch for sssd-1-9 branch on RHEL6
 >>and it works fine. I think it is good alternative for sssd-1-9 branch.
 >>
 >>ACK
 >>
 >>LS
 >
 >I would personally prefer to have this in master as well, even if the
 >currently known issues will be fixed with your patch ([PATCH]
 >mmap_cache: Use two chains for hash collision). I think it hurts
 >nothing to be more defensive.
 Your patch is not about defensive coding. The patch is about fixing corrupted
 cache in back-compatible way. In my opinion, mmap_cache source code looks much
 more complicated and readability is decreased.

 LS 
I don't think the readability suffers that much and maybe we can improve
it by wrapping the repeated checks in a macro?

I'm also for more defensiveness, we don't know what other strange bugs
might linger in the code.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [SSSD] [PATCH] Check slot validity before MC_SLOT_TO_PTR.