[389-users] Lots of abandoned connections from sssd

Mon Nov 10 22:55:34 UTC 2014

On 11/10/2014 03:32 PM, Orion Poplawski wrote:
> On 11/06/2014 03:14 AM, Rich Megginson wrote:
>> On 11/06/2014 04:16 AM, Orion Poplawski wrote:
>>> Just recently we're seeing some very strange behavior on our system.
>>> Periodically we will see a sssd process start to have an ever greater number
>>> of connections to our ldap server until the server runs out of file
>>> descriptors.  This seems to be happening with a particular user, who is
>>> having trouble logging in at times, particularly with email (dovecot).  We
>>> see entries like the following on our sever:
>>>
>>> [05/Nov/2014:17:14:51 -0700] conn=1786153 op=0 EXT
>>> oid="1.3.6.1.4.1.1466.20037" name="startTLS"
>>> [05/Nov/2014:17:14:51 -0700] conn=1786153 op=0 RESULT err=0 tag=120
>>> nentries=0 etime=0
>>> [05/Nov/2014:17:14:51 -0700] conn=1786153 SSL 128-bit AES
>>> [05/Nov/2014:17:14:51 -0700] conn=1786153 op=1 BIND
>>> dn="uid=user,ou=People,dc=domain,dc=com" method=128 version=3
>>> [05/Nov/2014:17:14:56 -0700] conn=1786153 op=2 ABANDON targetop=NOTFOUND
>>> msgid=2
>>> [05/Nov/2014:17:14:56 -0700] conn=1786153 op=3 UNBIND
>>> [05/Nov/2014:17:14:56 -0700] conn=1786153 op=3 fd=1022 closed - U1
>>>
>>> I don't yet have debug info from the sssd process.  Any ideas from the above?
>>>
>>> Restarting the sssd process seems to clear things up for a while.
>>>
>>> - Orion
>>>
>> Try to reproduce the problem while using gdb to capture stack traces every few
>> seconds as in http://www.port389.org/docs/389ds/FAQ/faq.html#debugging-hangs
>> Ideally, we can get some stack traces of the server during the time between
>> the BIND and the ABANDON
> If I catch the problem early enough I can still get a stack trace.  A series
> of them are in http://www.cora.nwra.com/~orion/ns-slapd-trace.tar.gz.
> Anything useful there?
>
These traces show the server is almost entirely idle - not working on 
client operations, not deadlocked or otherwise waiting on locks, not 
waiting on I/O to complete.  So either we're still not catching dirsrv 
in the act, or dirsrv is not the problem.