[SSSD] Re: stuck with ticket #3465

Thursday, 31 August 2017

On Wed, Aug 30, 2017 at 05:30:02PM +0200, Jakub Hrozek wrote:
...
 Hi,

 I'm afraid I got a little stuck looking into upstream ticket
 https://pagure.io/SSSD/sssd/issue/3465

 The reporter is seeing sssd memory usage increasing on RHEL-6 and
 RHEL-7. There is a valgrind log from RHEL-6 attached to the ticket which
 does show some leaks, the three biggest ones are:

 ==14913== 4,715,355 (74,383 direct, 4,640,972 indirect) bytes in 599 blocks are
definitely lost in loss record 1,113 of 1,115
 ==14913==    at 0x4C28A2E: malloc (vg_replace_malloc.c:270)
 ==14913==    by 0x8D76DCA: _talloc_array (talloc.c:668)
 ==14913==    by 0x56D0A48: ldb_val_dup (ldb_msg.c:106)
 ==14913==    by 0x52668B4: sysdb_attrs_add_val_int (sysdb.c:555)
 ==14913==    by 0x1264C964: sdap_parse_entry (sdap.c:576)
 ==14913==    by 0x1261C702: sdap_get_and_parse_generic_parse_entry (sdap_async.c:1749)
 ==14913==    by 0x1261FEB3: sdap_get_generic_op_finished (sdap_async.c:1487)
 ==14913==    by 0x1262155E: sdap_process_result (sdap_async.c:352)
 ==14913==    by 0x8B6DEA5: epoll_event_loop_once (tevent_epoll.c:728)
 ==14913==    by 0x8B6C2D5: std_event_loop_once (tevent_standard.c:114)
 ==14913==    by 0x8B67C3C: _tevent_loop_once (tevent.c:533)
 ==14913==    by 0x8B67CBA: tevent_common_loop_wait (tevent.c:637)
 ==14913== 
 ==14913== 7,998,231 bytes in 64,275 blocks are indirectly lost in loss record 1,114 of
1,115
 ==14913==    at 0x4C28A2E: malloc (vg_replace_malloc.c:270)
 ==14913==    by 0x8D76DCA: _talloc_array (talloc.c:668)
 ==14913==    by 0x56D0A48: ldb_val_dup (ldb_msg.c:106)
 ==14913==    by 0x52668B4: sysdb_attrs_add_val_int (sysdb.c:555)
 ==14913==    by 0x1264C964: sdap_parse_entry (sdap.c:576)
 ==14913==    by 0x1261C702: sdap_get_and_parse_generic_parse_entry (sdap_async.c:1749)
 ==14913==    by 0x1261FEB3: sdap_get_generic_op_finished (sdap_async.c:1487)
 ==14913==    by 0x1262155E: sdap_process_result (sdap_async.c:352)
 ==14913==    by 0x8B6DEA5: epoll_event_loop_once (tevent_epoll.c:728)
 ==14913==    by 0x8B6C2D5: std_event_loop_once (tevent_standard.c:114)
 ==14913==    by 0x8B67C3C: _tevent_loop_once (tevent.c:533)
 ==14913==    by 0x8B67CBA: tevent_common_loop_wait (tevent.c:637)
 ==14913== 
 ==14913== 33,554,430 bytes in 2 blocks are still reachable in loss record 1,115 of 1,115
 ==14913==    at 0x4C28A2E: malloc (vg_replace_malloc.c:270)
 ==14913==    by 0x9FA4CCD: _plug_decode (plugin_common.c:666)
 ==14913==    by 0x1453036A: gssapi_decode (gssapi.c:497)
 ==14913==    by 0x9F9B783: sasl_decode (common.c:621)
 ==14913==    by 0x69AC69C: sb_sasl_cyrus_decode (cyrus.c:188)
 ==14913==    by 0x69AEF2F: sb_sasl_generic_read (sasl.c:711)
 ==14913==    by 0x6790AEB: sb_debug_read (sockbuf.c:829)
 ==14913==    by 0x67906AE: ber_int_sb_read (sockbuf.c:423)
 ==14913==    by 0x678D789: ber_get_next (io.c:532)
 ==14913==    by 0x69A6D4D: wait4msg (result.c:491)
 ==14913==    by 0x12620EB4: sdap_process_result (sdap_async.c:165)
 ==14913==    by 0x8B6DEA5: epoll_event_loop_once (tevent_epoll.c:728)

 The biggest one almost looks like a leak in libldap, because this line:
     ==14913==    by 0x12620EB4: sdap_process_result (sdap_async.c:165)
 is a call to ldap_result().

 But even the other leaks make little sense to me, like this one:
 ==14913== 7,998,231 bytes in 64,275 blocks are indirectly lost in loss record 1,114 of
1,115
 ==14913==    at 0x4C28A2E: malloc (vg_replace_malloc.c:270)
 ==14913==    by 0x8D76DCA: _talloc_array (talloc.c:668)
 ==14913==    by 0x56D0A48: ldb_val_dup (ldb_msg.c:106)
 ==14913==    by 0x52668B4: sysdb_attrs_add_val_int (sysdb.c:555)
 ==14913==    by 0x1264C964: sdap_parse_entry (sdap.c:576)
 ==14913==    by 0x1261C702: sdap_get_and_parse_generic_parse_entry (sdap_async.c:1749)
 ==14913==    by 0x1261FEB3: sdap_get_generic_op_finished (sdap_async.c:1487)
 ==14913==    by 0x1262155E: sdap_process_result (sdap_async.c:352)
 ==14913==    by 0x8B6DEA5: epoll_event_loop_once (tevent_epoll.c:728)
 ==14913==    by 0x8B6C2D5: std_event_loop_once (tevent_standard.c:114)
 ==14913==    by 0x8B67C3C: _tevent_loop_once (tevent.c:533)
 ==14913==    by 0x8B67CBA: tevent_common_loop_wait (tevent.c:637)

 In sdap_parse_entry, we allocate sysdb_attrs on state and all the internal
 ldb structures hang off sysdb_attrs. The context is stolen to caller's
 provided context which is typically state of a tevent request which is
 passed upwards. So the only idea I have is that somewhere, we steal the
 request to NULL.. But I don't know where or how to look for this.

 Also, it seems odd that with so many people running sssd, this is the
 only user (so far?) who reported this issue. 
According to the ps output in the ticket the increase is not that big
122520 -> 195176 in RSS after 48h. I guess this might be the reason many
people do not recognize the increase.

I did a small test where I loaded the members of a larger group in a
plain LDAP setup and saw an increase from 16100 to 42224 which stayed
over night. I try to get more data from my setup.

bye,
Sumit

> 
> Does anyone have an idea how to continue?
> _______________________________________________
> sssd-devel mailing list -- sssd-devel(a)lists.fedorahosted.org
> To unsubscribe send an email to sssd-devel-leave(a)lists.fedorahosted.org

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

[SSSD] Re: stuck with ticket #3465