Hi,
I'm afraid I got a little stuck looking into upstream ticket
https://pagure.io/SSSD/sssd/issue/3465
The reporter is seeing sssd memory usage increasing on RHEL-6 and
RHEL-7. There is a valgrind log from RHEL-6 attached to the ticket which
does show some leaks, the three biggest ones are:
==14913== 4,715,355 (74,383 direct, 4,640,972 indirect) bytes in 599 blocks are
definitely lost in loss record 1,113 of 1,115
==14913== at 0x4C28A2E: malloc (vg_replace_malloc.c:270)
==14913== by 0x8D76DCA: _talloc_array (talloc.c:668)
==14913== by 0x56D0A48: ldb_val_dup (ldb_msg.c:106)
==14913== by 0x52668B4: sysdb_attrs_add_val_int (sysdb.c:555)
==14913== by 0x1264C964: sdap_parse_entry (sdap.c:576)
==14913== by 0x1261C702: sdap_get_and_parse_generic_parse_entry (sdap_async.c:1749)
==14913== by 0x1261FEB3: sdap_get_generic_op_finished (sdap_async.c:1487)
==14913== by 0x1262155E: sdap_process_result (sdap_async.c:352)
==14913== by 0x8B6DEA5: epoll_event_loop_once (tevent_epoll.c:728)
==14913== by 0x8B6C2D5: std_event_loop_once (tevent_standard.c:114)
==14913== by 0x8B67C3C: _tevent_loop_once (tevent.c:533)
==14913== by 0x8B67CBA: tevent_common_loop_wait (tevent.c:637)
==14913==
==14913== 7,998,231 bytes in 64,275 blocks are indirectly lost in loss record 1,114 of
1,115
==14913== at 0x4C28A2E: malloc (vg_replace_malloc.c:270)
==14913== by 0x8D76DCA: _talloc_array (talloc.c:668)
==14913== by 0x56D0A48: ldb_val_dup (ldb_msg.c:106)
==14913== by 0x52668B4: sysdb_attrs_add_val_int (sysdb.c:555)
==14913== by 0x1264C964: sdap_parse_entry (sdap.c:576)
==14913== by 0x1261C702: sdap_get_and_parse_generic_parse_entry (sdap_async.c:1749)
==14913== by 0x1261FEB3: sdap_get_generic_op_finished (sdap_async.c:1487)
==14913== by 0x1262155E: sdap_process_result (sdap_async.c:352)
==14913== by 0x8B6DEA5: epoll_event_loop_once (tevent_epoll.c:728)
==14913== by 0x8B6C2D5: std_event_loop_once (tevent_standard.c:114)
==14913== by 0x8B67C3C: _tevent_loop_once (tevent.c:533)
==14913== by 0x8B67CBA: tevent_common_loop_wait (tevent.c:637)
==14913==
==14913== 33,554,430 bytes in 2 blocks are still reachable in loss record 1,115 of 1,115
==14913== at 0x4C28A2E: malloc (vg_replace_malloc.c:270)
==14913== by 0x9FA4CCD: _plug_decode (plugin_common.c:666)
==14913== by 0x1453036A: gssapi_decode (gssapi.c:497)
==14913== by 0x9F9B783: sasl_decode (common.c:621)
==14913== by 0x69AC69C: sb_sasl_cyrus_decode (cyrus.c:188)
==14913== by 0x69AEF2F: sb_sasl_generic_read (sasl.c:711)
==14913== by 0x6790AEB: sb_debug_read (sockbuf.c:829)
==14913== by 0x67906AE: ber_int_sb_read (sockbuf.c:423)
==14913== by 0x678D789: ber_get_next (io.c:532)
==14913== by 0x69A6D4D: wait4msg (result.c:491)
==14913== by 0x12620EB4: sdap_process_result (sdap_async.c:165)
==14913== by 0x8B6DEA5: epoll_event_loop_once (tevent_epoll.c:728)
The biggest one almost looks like a leak in libldap, because this line:
==14913== by 0x12620EB4: sdap_process_result (sdap_async.c:165)
is a call to ldap_result().
But even the other leaks make little sense to me, like this one:
==14913== 7,998,231 bytes in 64,275 blocks are indirectly lost in loss record 1,114 of
1,115
==14913== at 0x4C28A2E: malloc (vg_replace_malloc.c:270)
==14913== by 0x8D76DCA: _talloc_array (talloc.c:668)
==14913== by 0x56D0A48: ldb_val_dup (ldb_msg.c:106)
==14913== by 0x52668B4: sysdb_attrs_add_val_int (sysdb.c:555)
==14913== by 0x1264C964: sdap_parse_entry (sdap.c:576)
==14913== by 0x1261C702: sdap_get_and_parse_generic_parse_entry (sdap_async.c:1749)
==14913== by 0x1261FEB3: sdap_get_generic_op_finished (sdap_async.c:1487)
==14913== by 0x1262155E: sdap_process_result (sdap_async.c:352)
==14913== by 0x8B6DEA5: epoll_event_loop_once (tevent_epoll.c:728)
==14913== by 0x8B6C2D5: std_event_loop_once (tevent_standard.c:114)
==14913== by 0x8B67C3C: _tevent_loop_once (tevent.c:533)
==14913== by 0x8B67CBA: tevent_common_loop_wait (tevent.c:637)
In sdap_parse_entry, we allocate sysdb_attrs on state and all the internal
ldb structures hang off sysdb_attrs. The context is stolen to caller's
provided context which is typically state of a tevent request which is
passed upwards. So the only idea I have is that somewhere, we steal the
request to NULL.. But I don't know where or how to look for this.
Also, it seems odd that with so many people running sssd, this is the
only user (so far?) who reported this issue.
According to the ps output in the ticket the increase is not that big
122520 -> 195176 in RSS after 48h. I guess this might be the reason many
people do not recognize the increase.
I did a small test where I loaded the members of a larger group in a
plain LDAP setup and saw an increase from 16100 to 42224 which stayed
over night. I try to get more data from my setup.
bye,
Sumit
>
> Does anyone have an idea how to continue?
> _______________________________________________
> sssd-devel mailing list -- sssd-devel(a)lists.fedorahosted.org
> To unsubscribe send an email to sssd-devel-leave(a)lists.fedorahosted.org