[389-users] Multiple threads simultaneously working on connection's private buffer causes ns-slapd to abort
Rich Megginson
rmeggins at redhat.com
Sat Nov 19 20:57:09 UTC 2011
On 11/18/2011 11:36 PM, Reddy, Gandikota Rushendra (ES&N) wrote:
> Hi Rich,
>
> Thanks for the reply.
>
> I have attached the sample client code.
>
> It uses sync search calls.
>
> It calls
>
> ldap_simple_bind_s
> ldap_search_ext_s (3 times)
> ldap_simple_bind_s
> ldap_search_ext_s (3 times)
> return
>
> It doesn't call unbind.
>
> As you said, decrementing the reference count should be the last thing before control goes back to
> starting of the while loop in connection_threadmain. But does this change has any ill effects as signal_listner()
> and connection_make_readable() are called before decrementing reference count?? (That is waking up the listener thread and
> setting gettingber to zero before decrementing reference count.) I request your inputs on this.
connection_make_readable() - as long as all code that accesses
gettingber first acquires the conn->c_mutex lock, it should be fine
signal_listner() - probably better to set the need_wakeup flag inside
the mutex, then call signal_listner() after doing the connection_release
and the PR_Unlock()
> If I understood correctly, the changes you suggested will look similar to(please correct me if not):
>
>
>
> 2187 // if (!thread_turbo_flag&& !more_data) {
> 2188 // connection_release_nolock (conn);<------ don't touch c_refcnt
> 2189 // }
> 2190 PR_Unlock( conn->c_mutex );
> 2191 }
> 2192 /* Since we didn't do so earlier, we need to make a replication connection readable again here */
> 2193 if ( ((1 == is_timedout) || (replication_connection&& !thread_turbo_flag))&& !more_data)
> 2194 connection_make_readable(conn);<---- set gettingber
> 2195 pb = NULL;
>
> +------------> if(doshutdown) {
> | PR_Lock( conn->c_mutex );
> |
> | c_gettingber = 0;
> |
> | if (conn->c_threadnumber == config_get_maxthreadsperconn())
> | need_wakeup = 1;
> | else
> | need_wakeup = 0;
> |
> | conn->c_threadnumber--;
> |
> | if (need_wakeup)
> | signal_listner();
> | connection_release_nolock (conn);
> |
> | PR_Unlock( conn->c_mutex );
> | return;
> +------------> }
>
>
> 2196
> 2197 if (!thread_turbo_flag&& !more_data) { /* Don't do this in turbo mode */
> 2198 PR_Lock( conn->c_mutex );
> 2199 /* if the threadnumber of now below the maximum, wakeup
> 2200 * the listener thread so that we start polling on this
> 2201 * connection again
> 2202 */
> 2203 /* DBDB I think this code is bogus -- we already signaled the listener above here */
> 2204 if (conn->c_threadnumber == config_get_maxthreadsperconn())
> 2205 need_wakeup = 1;
> 2206 else
> 2207 need_wakeup = 0;
> 2208 conn->c_threadnumber--;
> 2209 PR_Unlock( conn->c_mutex );
> 2210
> 2211 if (need_wakeup)
> 2212 signal_listner();
> 2213 }
>
>
>
> +--------> if ( !( pb->pb_op->o_flags& OP_FLAG_PS )) {
> | if (!thread_turbo_flag&& !more_data) {
> | PR_Lock( conn->c_mutex );
> | connection_release_nolock (conn);<---- decrement ref cont
> | PR_Unlock( conn->c_mutex );
> | }
> +--------> }
>
>
> 2190
> 2214
> 2215
> 2216 } /* while (1) */
> 2217 }
>
>
>
>
> Thanks& regards,
> Rushendra
>
>
>
>
>
>
>
>
>
>
>
>
>
> ________________________________________
> From: Rich Megginson [rmeggins at redhat.com]
> Sent: Friday, November 18, 2011 11:34 PM
> To: Reddy, Gandikota Rushendra (ES&N)
> Cc: 389-users at lists.fedoraproject.org
> Subject: Re: Multiple threads simultaneously working on connection's private buffer causes ns-slapd to abort
>
> On 11/18/2011 01:06 PM, Reddy, Gandikota Rushendra (ES&N) wrote:
>> Hi Rich,
>>
>> Thanks for your reply.
>> This issue can be reproduced by writing a client application that doesn't call UNBIND before exiting.
>> Typically the application does 1)BIND 2) Search 3) Search and exit without calling UNBIND.
> So a BIND, followed by two Searches, followed by exit?
> Something like
> ldap_sasl_bind_s()
> ldap_search_s()
> ldap_search_s()
> exit()
> ?
> or are you using async search operations?
>
> Do you have such a client?
>> Multiple instances of this application need to be run simultaneously for 12 to 15 hours to reproduce the issue.
>>
>>
>> Thanks for the correction. The probable fix I had mentioned supposed to be as below:
>>
>>
>> is_conn_done = 0;<-------------
>> 2097 switch (ret) {
>> 2098 case CONN_DONE:
>> 2099 /* This means that the connection was closed, so clear turbo mode */
>> is_conn_done = 1;<-----------------
>> thread_turbo_flag = 0;<-------------
>> break;<-----------
>>
>> 2101 case CONN_TIMEDOUT:
>> 2102 thread_turbo_flag = 0;
>> 2103 is_timedout = 1;
>>
>>
>> 2193 ---------> if ( !is_conn_done&& ((1 == is_timedout) || (replication_connection&& !thread_turbo_flag))&& !more_data )
>> 2194 connection_make_readable(conn);
>> 2195 pb = NULL;
>>
>>
>> Here CONN_DONE and CONN_TIMEOUT conditions are handled differently. So if it is CONN_DONE, we don't call connection_make_readable() again.
>> The rationale behind this is in all occurences of CONN_DONE conditions we call either disconnect_server() or disconnection_server_nomutex().
>> These functions set c_gettingber to 0. So we can skip calling connection_make_readable() (to set c_gettingber to 0) if it is CONN_DONE.
>>
>>
>>
>> In addition to above changes, I believe following code segment also needs to be secured similarly.
>>
>>
>> connection.c : connection_threadmain()
>>
>>
>> 2197 if (!thread_turbo_flag&& !more_data) { /* Don't do this in turbo mode */
>> 2198 PR_Lock( conn->c_mutex );
>> 2199 /* if the threadnumber of now below the maximum, wakeup
>> 2200 * the listener thread so that we start polling on this
>> 2201 * connection again
>> 2202 */
>> 2203 /* DBDB I think this code is bogus -- we already signaled the listener above here */
>> 2204 if (conn->c_threadnumber == config_get_maxthreadsperconn())
>> 2205 need_wakeup = 1;
>> 2206 else
>> 2207 need_wakeup = 0;
>> 2208 conn->c_threadnumber--;
>> 2209 PR_Unlock( conn->c_mutex );
>> 2210
>> 2211 if (need_wakeup)
>> 2212 signal_listner();
>> 2213 }
>>
>> Once the connection object is reallocated to a newer connection, the thread belongs to older connection should not touch
>> any variables of the connection structure including c_threadnumber. In the above piece of code c_threadnumber is decremented.
>> This could result in unexpected behavior. From the data collected from instrumented image,
>> I could see c_threadnumber again becoming zero though a thread is executing an operation belongs to newer connection (c_threadnumber supposed to be 1).
>>
>> In addition to this, if c_threadnumber is 5(default value), it calls signal_listener(). ( In the abort() scenario I don't think this could happen.
>> When the connection object is being reallocated, the value of c_threadnumber is reset to zero. )
>>
>> So I believe before executing above pieces of code, we should make sure that the connection object is not released/realloacted to some other connection.
> If I understand the problem correctly, it is that we should not touch
> the conn object at all after doing the connection_release_nolock() - is
> that correct?
>
> So perhaps the simplest solution will be to move the
> connection_release_nolock() to after the code that does the
> connection_make_readable() and the threadnumber/wakeup handling? And
> also in the doshutdown case - have to release the connection before
> doing the return. Might also need a connection_make_readable_nolock()
> so that we can reset the gettingber flag and decrement the refcount and
> do the threadnumber handling inside the same mutex.
>
> I think in your proposed solution, the code will not do the
> connection_make_readable() until it loops again - this seems like a
> minor change in behavior, but the code in connection_threadmain is very
> sensitive - small changes can have very large unintended consequences.
>> I made changes as mentioned above (is_conn_done) and the problem is no more reproducing.
>> But I am not sure will it create any other issues.
>>
>>
>> The other probable fix could be:
>>
>> 1. Make a note of connection Id (c_connid) and connection object address (conn) before executing any operation.
>> 2. Before executing the above mentioned code, check whether c_connid and conn are still same.
>> If not, the connection object is already reallocated. So skip the code and goto starting of the while loop in connection_threadmain().
>>
>>
>>
>> connection.c: conntion_threadmain()
>>
>> 2071 /* Once we're here we have a pb */
>> 2072 conn = pb->pb_conn;
>> 2073 op = pb->pb_op;
>> connid = conn->c_connid;<---------
>> connaddr = conn;<-----------
>>
>>
>>
>>
>>
>> ----------> if( (connid == conn->c_connid)&& (connaddr == conn ) ) { // do we need to check operation id also?
>>
>> 2193 if ( ((1 == is_timedout) || (replication_connection&& !thread_turbo_flag))&& !more_data )
>> 2194 connection_make_readable(conn);
>> }
>> 2195 pb = NULL;
>>
>>
>> ----------> if( (connid == conn->c_connid)&& (connaddr == conn ) ) {
>>
>> 2197 if (!thread_turbo_flag&& !more_data) { /* Don't do this in turbo mode */
>> 2198 PR_Lock( conn->c_mutex );
>> 2199 /* if the threadnumber of now below the maximum, wakeup
>> 2200 * the listener thread so that we start polling on this
>> 2201 * connection again
>> 2202 */
>> 2203 /* DBDB I think this code is bogus -- we already signaled the listener above here */
>> 2204 if (conn->c_threadnumber == config_get_maxthreadsperconn())
>> 2205 need_wakeup = 1;
>> 2206 else
>> 2207 need_wakeup = 0;
>> 2208 conn->c_threadnumber--;
>> 2209 PR_Unlock( conn->c_mutex );
>> 2210
>> 2211 if (need_wakeup)
>> 2212 signal_listner();
>> 2213 }
>>
>> }
>>
>>
>> Please provide your suggestions on the above fixes.
>>
>>
>> Regards,
>> Rushendra
>>
>>
>>
>>
>> ________________________________________
>> From: Rich Megginson [rmeggins at redhat.com]
>> Sent: Friday, November 18, 2011 1:54 PM
>> To: Reddy, Gandikota Rushendra (ES&N); 389-users at lists.fedoraproject.org
>> Subject: Re: Multiple threads simultaneously working on connection's private buffer causes ns-slapd to abort
>>
>> On 11/18/2011 05:31 AM, "Reddy, Gandikota Rushendra (ES&N)"<gandikota-rushendra.reddy at hp.com><mailto:gandikota-rushendra.reddy at hp.com> wrote:
>>
>> This was a bounce - please check your list membership before posting again.
>>
>> On a multi CPU system, ns-slapd process aborts. Stack trace says it aborted calling PR_ASSERT() at the following location in connection.c (Release version of 389 dirsrv is 1.2.0).
>>
>>
>> Is there any chance you could attempt to reproduce this issue with the latest 1.2.9.9 (stable) or 1.2.10.a5 (testing)? Or provide a reproducer. 1.2.0 is quite old.
>>
>> Also, please file a bug at https://bugzilla.redhat.com/enter_bug.cgi?product=389
>>
>>
>> -----
>> Connection.c
>> 1690 int connection_read_operation(Connection *conn, Operation *op, ber_tag_t *tag, int *remaining_data)
>> 1691 {
>>
>> 1734 }
>> 1735 /* If we still haven't seen a complete PDU, read from the network */
>> 1736 while (*tag == LBER_DEFAULT) {
>> 1737 int ioblocktimeout_waits = config_get_ioblocktimeout() / CONN_TURBO_TIMEOUT_INTERVAL;
>> 1738 /* We should never get here with data remaining in the buffer */
>> 1739 PR_ASSERT( !new_operation || 0 == (conn->c_private->c_buffer_bytes - conn->c_private->c_buffer_offset) );<------------------ aborted calling PR_ASSERT()
>> 1740 /* We make a non-blocking read call */
>> -----
>>
>> Using instrumented image I have collected data and found that:
>>
>> While a thread is working on connection's private buffer, another thread which belongs to older connection makes the connection readable again by modifying c_gettingber to zero.
>> This results in waking up one more thread. Finally more than one thread works simultaneously on connection's private buffer and results in aborting ns-slapd.
>>
>>
>> It happens as below:
>>
>> 1. Thead X is working on Connection-1's last opereration.
>>
>> 2. After performing the LDAP operation ( connection_dispatch_operation() ) decrementes the connection's c_refcnt by calling connection_release_nolock() in connection_threadmain() function.
>>
>> 3. Since this is the last operation, c_refcnt becomes zero.
>>
>> 4. Thread X yet to execute the code that updates c_threadnumber and c_gettingber. But as soon as c_refcnt of the connection becomes zero, main thread allocates the same connection object/structure to new incoming connection connection-2. Thread-Y started executing operation belongs to this newer connection (Connection-2).
>>
>> -----
>> connection.c
>>
>> 2008 static void
>> 2009 connection_threadmain()
>> 2010 {
>>
>>
>> 2187 if (!thread_turbo_flag&& !more_data) {
>> 2188 connection_release_nolock (conn);<------------------ Thread-X decremented c_refcnt and now c_refcnt = 0
>> 2189 }
>> 2190 PR_Unlock( conn->c_mutex );
>> 2191 }
>> 2192 /* Since we didn't do so earlier, we need to make a replication connection readable again here */
>> 2193 if ( ((1 == is_timedout) || (replication_connection&& !thread_turbo_flag))&& !more_data)
>> 2194 connection_make_readable(conn);
>> 2195 pb = NULL;
>> 2196
>> 2197 if (!thread_turbo_flag&& !more_data) { /* Don't do this in turbo mode */
>> 2198 PR_Lock( conn->c_mutex );
>> 2199 /* if the threadnumber of now below the maximum, wakeup
>> 2200 * the listener thread so that we start polling on this
>> 2201 * connection again
>> 2202 */
>> 2203 /* DBDB I think this code is bogus -- we already signaled the listener above here */
>> 2204 if (conn->c_threadnumber == config_get_maxthreadsperconn())
>> 2205 need_wakeup = 1;
>> 2206 else
>> 2207 need_wakeup = 0;
>> 2208 conn->c_threadnumber--;
>> 2209 PR_Unlock( conn->c_mutex );
>> 2210
>> 2211 if (need_wakeup)
>> 2212 signal_listner();
>> 2213 }
>> 2214
>> 2215
>> 2216 } /* while (1) */
>> 2217 }
>> ----
>>
>> 5. Thread-Y started processing operation data belongs to newer connection Connection-2 (connection_read_operation() ).
>>
>> 6. Now Thread-X of connection-1 called make_connection_readable() and made the connection readable again by setting c_gettingbuffer to 0.
>>
>> 7. This wakes up another thread of connection-2 and it too starts processing operation data (connection_read_operation() ).
>>
>> 8. So at this point of time more than one thread is working on connection's private data.
>>
>> 9. As two threads are working on the connections c_buffer, c_bufferbytes and c_offset, at some time it meats conditions to abort() at the following location:
>>
>> Connection.c: connection_read_operation()
>>
>> 1734 }
>> 1735 /* If we still haven't seen a complete PDU, read from the network */
>> 1736 while (*tag == LBER_DEFAULT) {
>> 1737 int ioblocktimeout_waits = config_get_ioblocktimeout() / CONN_TURBO_TIMEOUT_INTERVAL;
>> 1738 /* We should never get here with data remaining in the buffer */
>> 1739 PR_ASSERT( !new_operation || 0 == (conn->c_private->c_buffer_bytes - conn->c_private->c_buffer_offset) );<------------------ aborted calling PR_ASSERT()
>> 1740 /* We make a non-blocking read call */
>> -----
>>
>>
>>
>> The probable fix could be:
>>
>> In the connection_threadmain() function, if it is CONN_DONE don't make the connection readable again after decrementing the c_refcnt.
>>
>> So should we handle CONN_DONE and CONN_TIMEDOUT situation differently as below?
>>
>>
>>
>> Present code:
>>
>> 2097 switch (ret) {
>> 2098 case CONN_DONE:
>> 2099 /* This means that the connection was closed, so clear turbo mode */
>> 2100 /*FALLTHROUGH*/
>> 2101 case CONN_TIMEDOUT:
>> 2102 thread_turbo_flag = 0;
>> 2103 is_timedout = 1;
>>
>>
>> new code:
>>
>>
>> 2097 switch (ret) {
>> 2098 case CONN_DONE:
>> 2099 /* This means that the connection was closed, so clear turbo mode */
>> is_conn_done = 0;<-----------------
>> thread_turbo_flag = 0;<-------------
>> 2101 case CONN_TIMEDOUT:
>> 2102 thread_turbo_flag = 0;
>> 2103 is_timedout = 1;
>>
>>
>> Note that you don't have to set thread_turbo_flag = 0 in the CONN_DONE case because it will be set when it falls through to the CONN_TIMEDOUT case.
>>
>>
>>
>> Present:
>>
>> 2193 if ( ((1 == is_timedout) || (replication_connection&& !thread_turbo_flag))&& !more_data )
>> 2194 connection_make_readable(conn);
>> 2195 pb = NULL;
>>
>>
>>
>> new:
>>
>> 2193 ---------> if ( !is_conn_done&& ((1 == is_timedout) || (replication_connection&& !thread_turbo_flag))&& !more_data )
>> 2194 connection_make_readable(conn);
>> 2195 pb = NULL;
>>
>>
>>
>> Is it right way to fix it? Please provide your inputs.
>>
>>
>> If ret is CONN_DONE, but you set is_conn_done = 0, how does the code know the connection is really done?
>>
>>
>>
>>
>> Thanks& regards,
>> Rushendra
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
More information about the 389-users
mailing list