Re: [389-users] Multiple threads simultaneously working on connection's private buffer causes ns-slapd to abort

Friday, 18 November 2011

On 11/18/2011 05:31 AM, "Reddy, Gandikota Rushendra (ES&N)" 
<gandikota-rushendra.reddy(a)hp.com&gt; wrote:

This was a bounce - please check your list membership before posting again.
...
 On a multi CPU system, ns-slapd process aborts. Stack trace says it
aborted calling PR_ASSERT() at the following location in connection.c (Release version of
389 dirsrv is 1.2.0). Is there any chance you could attempt to reproduce this issue
with the 
latest 1.2.9.9 (stable) or 1.2.10.a5 (testing)? Or provide a 
reproducer.  1.2.0 is quite old.

Also, please file a bug at 
https://bugzilla.redhat.com/enter_bug.cgi?product=389
...
 -----
 Connection.c
 1690 int connection_read_operation(Connection *conn, Operation *op, ber_tag_t *tag, int
*remaining_data)
 1691 {

 1734         }
 1735         /* If we still haven't seen a complete PDU, read from the network */
 1736         while (*tag == LBER_DEFAULT) {
 1737                 int ioblocktimeout_waits = config_get_ioblocktimeout() /
CONN_TURBO_TIMEOUT_INTERVAL;
 1738                 /* We should never get here with data remaining in the buffer */
 1739                 PR_ASSERT( !new_operation || 0 ==
(conn->c_private->c_buffer_bytes - conn->c_private->c_buffer_offset)
);<------------------ aborted calling PR_ASSERT()
 1740                 /* We make a non-blocking read call */
 -----

 Using instrumented image I have collected data and found that:

 While a thread is working on connection's private buffer, another thread which
belongs to older connection makes the connection readable again by modifying c_gettingber
to zero.
 This results in waking up one more thread. Finally more than one thread works
simultaneously on connection's private buffer and results in aborting ns-slapd.

 It happens as below:

 1. Thead X is working on Connection-1's last opereration.

 2. After performing the LDAP operation ( connection_dispatch_operation() ) decrementes
the connection's c_refcnt  by calling connection_release_nolock() in
connection_threadmain() function.

 3. Since this is the last operation, c_refcnt becomes zero.

 4. Thread X yet to execute the code that updates c_threadnumber and  c_gettingber. But as
soon as c_refcnt of the connection becomes zero, main thread allocates the same connection
object/structure to new incoming connection connection-2.  Thread-Y started executing
operation belongs to this newer connection (Connection-2).

 -----
 connection.c

 2008 static void
 2009 connection_threadmain()
 2010 {

 2187                     if (!thread_turbo_flag&&  !more_data) {
 2188                                 connection_release_nolock
(conn);<------------------  Thread-X decremented c_refcnt and now c_refcnt = 0
 2189                         }
 2190                     PR_Unlock( conn->c_mutex );
 2191                 }
 2192                 /* Since we didn't do so earlier, we need to make a replication
connection readable again here */
 2193                 if ( ((1 == is_timedout) || (replication_connection&& 
!thread_turbo_flag))&&  !more_data)
 2194                         connection_make_readable(conn);
 2195                 pb = NULL;
 2196
 2197                 if (!thread_turbo_flag&&  !more_data) { /* Don't do this
in turbo mode */
 2198                         PR_Lock( conn->c_mutex );
 2199                         /* if the threadnumber of now below the maximum, wakeup
 2200                          * the listener thread so that we start polling on this
 2201                          * connection again
 2202                          */
 2203                         /* DBDB I think this code is bogus -- we already signaled
the listener above here */
 2204                         if (conn->c_threadnumber ==
config_get_maxthreadsperconn())
 2205                                 need_wakeup = 1;
 2206                         else
 2207                                 need_wakeup = 0;
 2208                         conn->c_threadnumber--;
 2209                         PR_Unlock( conn->c_mutex );
 2210
 2211                         if (need_wakeup)
 2212                                 signal_listner();
 2213                 }
 2214
 2215
 2216         } /* while (1) */
 2217 }
 ----

 5. Thread-Y started processing operation data belongs to newer connection Connection-2 
(connection_read_operation() ).

 6. Now Thread-X of connection-1 called make_connection_readable() and made the connection
readable again by setting c_gettingbuffer to 0.

 7. This wakes up another thread of connection-2 and it too starts processing operation
data (connection_read_operation() ).

 8. So at this point of time more than one thread is working on connection's private
data.

 9. As two threads are working on the connections c_buffer, c_bufferbytes and c_offset, at
some time it meats conditions to abort() at the following location:

 Connection.c:  connection_read_operation()

 1734         }
 1735         /* If we still haven't seen a complete PDU, read from the network */
 1736         while (*tag == LBER_DEFAULT) {
 1737                 int ioblocktimeout_waits = config_get_ioblocktimeout() /
CONN_TURBO_TIMEOUT_INTERVAL;
 1738                 /* We should never get here with data remaining in the buffer */
 1739                 PR_ASSERT( !new_operation || 0 ==
(conn->c_private->c_buffer_bytes - conn->c_private->c_buffer_offset)
);<------------------ aborted calling PR_ASSERT()
 1740                 /* We make a non-blocking read call */
 -----

 The probable fix could be:

 In the connection_threadmain() function, if it is CONN_DONE don't make the connection
readable again after decrementing the c_refcnt.

 So should we handle CONN_DONE and CONN_TIMEDOUT situation differently as below?

 Present code:

 2097                 switch (ret) {
 2098                         case CONN_DONE:
 2099                                 /* This means that the connection was closed, so
clear turbo mode */
 2100                         /*FALLTHROUGH*/
 2101                         case CONN_TIMEDOUT:
 2102                                 thread_turbo_flag = 0;
 2103                                 is_timedout = 1;

 new code:

 2097                 switch (ret) {
 2098                         case CONN_DONE:
 2099                                 /* This means that the connection was closed, so
clear turbo mode */
                                       is_conn_done = 0;<-----------------
                                       thread_turbo_flag = 0;<-------------
 2101                         case CONN_TIMEDOUT:
 2102                                 thread_turbo_flag = 0;
 2103                                 is_timedout = 1; Note that you don't have
to set thread_turbo_flag = 0 in the CONN_DONE 
case because it will be set when it falls through to the CONN_TIMEDOUT case.
...

 Present:

 2193                 if ( ((1 == is_timedout) || (replication_connection&& 
!thread_turbo_flag))&&  !more_data )
 2194                         connection_make_readable(conn);
 2195                 pb = NULL;

 new:

 2193    --------->   if ( !is_conn_done&&  ((1 == is_timedout) ||
(replication_connection&&  !thread_turbo_flag))&&  !more_data )
 2194                         connection_make_readable(conn);
 2195                 pb = NULL;

 Is it right way to fix it?  Please provide your inputs. If ret is CONN_DONE, but
you set is_conn_done = 0, how does the code 
know the connection is really done?
...

 Thanks&  regards,
 Rushendra

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005