Hi Rich, Thanks for the reply. I have attached the sample client code. It uses sync search calls. It calls ldap_simple_bind_s ldap_search_ext_s (3 times) ldap_simple_bind_s ldap_search_ext_s (3 times) return It doesn't call unbind. As you said, decrementing the reference count should be the last thing before control goes back to starting of the while loop in connection_threadmain. But does this change has any ill effects as signal_listner() and connection_make_readable() are called before decrementing reference count?? (That is waking up the listener thread and setting gettingber to zero before decrementing reference count.) I request your inputs on this. If I understood correctly, the changes you suggested will look similar to(please correct me if not): 2187 // if (!thread_turbo_flag && !more_data) { 2188 // connection_release_nolock (conn); <------ don't touch c_refcnt 2189 // } 2190 PR_Unlock( conn->c_mutex ); 2191 } 2192 /* Since we didn't do so earlier, we need to make a replication connection readable again here */ 2193 if ( ((1 == is_timedout) || (replication_connection && !thread_turbo_flag)) && !more_data) 2194 connection_make_readable(conn); <---- set gettingber 2195 pb = NULL; +------------> if(doshutdown) { | PR_Lock( conn->c_mutex ); | | c_gettingber = 0; | | if (conn->c_threadnumber == config_get_maxthreadsperconn()) | need_wakeup = 1; | else | need_wakeup = 0; | | conn->c_threadnumber--; | | if (need_wakeup) | signal_listner(); | connection_release_nolock (conn); | | PR_Unlock( conn->c_mutex ); | return; +------------> } 2196 2197 if (!thread_turbo_flag && !more_data) { /* Don't do this in turbo mode */ 2198 PR_Lock( conn->c_mutex ); 2199 /* if the threadnumber of now below the maximum, wakeup 2200 * the listener thread so that we start polling on this 2201 * connection again 2202 */ 2203 /* DBDB I think this code is bogus -- we already signaled the listener above here */ 2204 if (conn->c_threadnumber == config_get_maxthreadsperconn()) 2205 need_wakeup = 1; 2206 else 2207 need_wakeup = 0; 2208 conn->c_threadnumber--; 2209 PR_Unlock( conn->c_mutex ); 2210 2211 if (need_wakeup) 2212 signal_listner(); 2213 } +--------> if ( !( pb->pb_op->o_flags & OP_FLAG_PS )) { | if (!thread_turbo_flag && !more_data) { | PR_Lock( conn->c_mutex ); | connection_release_nolock (conn); <---- decrement ref cont | PR_Unlock( conn->c_mutex ); | } +--------> } 2190 2214 2215 2216 } /* while (1) */ 2217 } Thanks & regards, Rushendra ________________________________________ From: Rich Megginson [rmeggins@xxxxxxxxxx] Sent: Friday, November 18, 2011 11:34 PM To: Reddy, Gandikota Rushendra (ES&N) Cc: 389-users@xxxxxxxxxxxxxxxxxxxxxxx Subject: Re: Multiple threads simultaneously working on connection's private buffer causes ns-slapd to abort On 11/18/2011 01:06 PM, Reddy, Gandikota Rushendra (ES&N) wrote: > Hi Rich, > > Thanks for your reply. > This issue can be reproduced by writing a client application that doesn't call UNBIND before exiting. > Typically the application does 1)BIND 2) Search 3) Search and exit without calling UNBIND. So a BIND, followed by two Searches, followed by exit? Something like ldap_sasl_bind_s() ldap_search_s() ldap_search_s() exit() ? or are you using async search operations? Do you have such a client? > Multiple instances of this application need to be run simultaneously for 12 to 15 hours to reproduce the issue. > > > Thanks for the correction. The probable fix I had mentioned supposed to be as below: > > > is_conn_done = 0;<------------- > 2097 switch (ret) { > 2098 case CONN_DONE: > 2099 /* This means that the connection was closed, so clear turbo mode */ > is_conn_done = 1;<----------------- > thread_turbo_flag = 0;<------------- > break;<----------- > > 2101 case CONN_TIMEDOUT: > 2102 thread_turbo_flag = 0; > 2103 is_timedout = 1; > > > 2193 ---------> if ( !is_conn_done&& ((1 == is_timedout) || (replication_connection&& !thread_turbo_flag))&& !more_data ) > 2194 connection_make_readable(conn); > 2195 pb = NULL; > > > Here CONN_DONE and CONN_TIMEOUT conditions are handled differently. So if it is CONN_DONE, we don't call connection_make_readable() again. > The rationale behind this is in all occurences of CONN_DONE conditions we call either disconnect_server() or disconnection_server_nomutex(). > These functions set c_gettingber to 0. So we can skip calling connection_make_readable() (to set c_gettingber to 0) if it is CONN_DONE. > > > > In addition to above changes, I believe following code segment also needs to be secured similarly. > > > connection.c : connection_threadmain() > > > 2197 if (!thread_turbo_flag&& !more_data) { /* Don't do this in turbo mode */ > 2198 PR_Lock( conn->c_mutex ); > 2199 /* if the threadnumber of now below the maximum, wakeup > 2200 * the listener thread so that we start polling on this > 2201 * connection again > 2202 */ > 2203 /* DBDB I think this code is bogus -- we already signaled the listener above here */ > 2204 if (conn->c_threadnumber == config_get_maxthreadsperconn()) > 2205 need_wakeup = 1; > 2206 else > 2207 need_wakeup = 0; > 2208 conn->c_threadnumber--; > 2209 PR_Unlock( conn->c_mutex ); > 2210 > 2211 if (need_wakeup) > 2212 signal_listner(); > 2213 } > > Once the connection object is reallocated to a newer connection, the thread belongs to older connection should not touch > any variables of the connection structure including c_threadnumber. In the above piece of code c_threadnumber is decremented. > This could result in unexpected behavior. From the data collected from instrumented image, > I could see c_threadnumber again becoming zero though a thread is executing an operation belongs to newer connection (c_threadnumber supposed to be 1). > > In addition to this, if c_threadnumber is 5(default value), it calls signal_listener(). ( In the abort() scenario I don't think this could happen. > When the connection object is being reallocated, the value of c_threadnumber is reset to zero. ) > > So I believe before executing above pieces of code, we should make sure that the connection object is not released/realloacted to some other connection. If I understand the problem correctly, it is that we should not touch the conn object at all after doing the connection_release_nolock() - is that correct? So perhaps the simplest solution will be to move the connection_release_nolock() to after the code that does the connection_make_readable() and the threadnumber/wakeup handling? And also in the doshutdown case - have to release the connection before doing the return. Might also need a connection_make_readable_nolock() so that we can reset the gettingber flag and decrement the refcount and do the threadnumber handling inside the same mutex. I think in your proposed solution, the code will not do the connection_make_readable() until it loops again - this seems like a minor change in behavior, but the code in connection_threadmain is very sensitive - small changes can have very large unintended consequences. > > I made changes as mentioned above (is_conn_done) and the problem is no more reproducing. > But I am not sure will it create any other issues. > > > The other probable fix could be: > > 1. Make a note of connection Id (c_connid) and connection object address (conn) before executing any operation. > 2. Before executing the above mentioned code, check whether c_connid and conn are still same. > If not, the connection object is already reallocated. So skip the code and goto starting of the while loop in connection_threadmain(). > > > > connection.c: conntion_threadmain() > > 2071 /* Once we're here we have a pb */ > 2072 conn = pb->pb_conn; > 2073 op = pb->pb_op; > connid = conn->c_connid;<--------- > connaddr = conn;<----------- > > > > > > ----------> if( (connid == conn->c_connid)&& (connaddr == conn ) ) { // do we need to check operation id also? > > 2193 if ( ((1 == is_timedout) || (replication_connection&& !thread_turbo_flag))&& !more_data ) > 2194 connection_make_readable(conn); > } > 2195 pb = NULL; > > > ----------> if( (connid == conn->c_connid)&& (connaddr == conn ) ) { > > 2197 if (!thread_turbo_flag&& !more_data) { /* Don't do this in turbo mode */ > 2198 PR_Lock( conn->c_mutex ); > 2199 /* if the threadnumber of now below the maximum, wakeup > 2200 * the listener thread so that we start polling on this > 2201 * connection again > 2202 */ > 2203 /* DBDB I think this code is bogus -- we already signaled the listener above here */ > 2204 if (conn->c_threadnumber == config_get_maxthreadsperconn()) > 2205 need_wakeup = 1; > 2206 else > 2207 need_wakeup = 0; > 2208 conn->c_threadnumber--; > 2209 PR_Unlock( conn->c_mutex ); > 2210 > 2211 if (need_wakeup) > 2212 signal_listner(); > 2213 } > > } > > > Please provide your suggestions on the above fixes. > > > Regards, > Rushendra > > > > > ________________________________________ > From: Rich Megginson [rmeggins@xxxxxxxxxx] > Sent: Friday, November 18, 2011 1:54 PM > To: Reddy, Gandikota Rushendra (ES&N); 389-users@xxxxxxxxxxxxxxxxxxxxxxx > Subject: Re: Multiple threads simultaneously working on connection's private buffer causes ns-slapd to abort > > On 11/18/2011 05:31 AM, "Reddy, Gandikota Rushendra (ES&N)"<gandikota-rushendra.reddy@xxxxxx><mailto:gandikota-rushendra.reddy@xxxxxx> wrote: > > This was a bounce - please check your list membership before posting again. > > On a multi CPU system, ns-slapd process aborts. Stack trace says it aborted calling PR_ASSERT() at the following location in connection.c (Release version of 389 dirsrv is 1.2.0). > > > Is there any chance you could attempt to reproduce this issue with the latest 1.2.9.9 (stable) or 1.2.10.a5 (testing)? Or provide a reproducer. 1.2.0 is quite old. > > Also, please file a bug at https://bugzilla.redhat.com/enter_bug.cgi?product=389 > > > ----- > Connection.c > 1690 int connection_read_operation(Connection *conn, Operation *op, ber_tag_t *tag, int *remaining_data) > 1691 { > > 1734 } > 1735 /* If we still haven't seen a complete PDU, read from the network */ > 1736 while (*tag == LBER_DEFAULT) { > 1737 int ioblocktimeout_waits = config_get_ioblocktimeout() / CONN_TURBO_TIMEOUT_INTERVAL; > 1738 /* We should never get here with data remaining in the buffer */ > 1739 PR_ASSERT( !new_operation || 0 == (conn->c_private->c_buffer_bytes - conn->c_private->c_buffer_offset) );<------------------ aborted calling PR_ASSERT() > 1740 /* We make a non-blocking read call */ > ----- > > Using instrumented image I have collected data and found that: > > While a thread is working on connection's private buffer, another thread which belongs to older connection makes the connection readable again by modifying c_gettingber to zero. > This results in waking up one more thread. Finally more than one thread works simultaneously on connection's private buffer and results in aborting ns-slapd. > > > It happens as below: > > 1. Thead X is working on Connection-1's last opereration. > > 2. After performing the LDAP operation ( connection_dispatch_operation() ) decrementes the connection's c_refcnt by calling connection_release_nolock() in connection_threadmain() function. > > 3. Since this is the last operation, c_refcnt becomes zero. > > 4. Thread X yet to execute the code that updates c_threadnumber and c_gettingber. But as soon as c_refcnt of the connection becomes zero, main thread allocates the same connection object/structure to new incoming connection connection-2. Thread-Y started executing operation belongs to this newer connection (Connection-2). > > ----- > connection.c > > 2008 static void > 2009 connection_threadmain() > 2010 { > > > 2187 if (!thread_turbo_flag&& !more_data) { > 2188 connection_release_nolock (conn);<------------------ Thread-X decremented c_refcnt and now c_refcnt = 0 > 2189 } > 2190 PR_Unlock( conn->c_mutex ); > 2191 } > 2192 /* Since we didn't do so earlier, we need to make a replication connection readable again here */ > 2193 if ( ((1 == is_timedout) || (replication_connection&& !thread_turbo_flag))&& !more_data) > 2194 connection_make_readable(conn); > 2195 pb = NULL; > 2196 > 2197 if (!thread_turbo_flag&& !more_data) { /* Don't do this in turbo mode */ > 2198 PR_Lock( conn->c_mutex ); > 2199 /* if the threadnumber of now below the maximum, wakeup > 2200 * the listener thread so that we start polling on this > 2201 * connection again > 2202 */ > 2203 /* DBDB I think this code is bogus -- we already signaled the listener above here */ > 2204 if (conn->c_threadnumber == config_get_maxthreadsperconn()) > 2205 need_wakeup = 1; > 2206 else > 2207 need_wakeup = 0; > 2208 conn->c_threadnumber--; > 2209 PR_Unlock( conn->c_mutex ); > 2210 > 2211 if (need_wakeup) > 2212 signal_listner(); > 2213 } > 2214 > 2215 > 2216 } /* while (1) */ > 2217 } > ---- > > 5. Thread-Y started processing operation data belongs to newer connection Connection-2 (connection_read_operation() ). > > 6. Now Thread-X of connection-1 called make_connection_readable() and made the connection readable again by setting c_gettingbuffer to 0. > > 7. This wakes up another thread of connection-2 and it too starts processing operation data (connection_read_operation() ). > > 8. So at this point of time more than one thread is working on connection's private data. > > 9. As two threads are working on the connections c_buffer, c_bufferbytes and c_offset, at some time it meats conditions to abort() at the following location: > > Connection.c: connection_read_operation() > > 1734 } > 1735 /* If we still haven't seen a complete PDU, read from the network */ > 1736 while (*tag == LBER_DEFAULT) { > 1737 int ioblocktimeout_waits = config_get_ioblocktimeout() / CONN_TURBO_TIMEOUT_INTERVAL; > 1738 /* We should never get here with data remaining in the buffer */ > 1739 PR_ASSERT( !new_operation || 0 == (conn->c_private->c_buffer_bytes - conn->c_private->c_buffer_offset) );<------------------ aborted calling PR_ASSERT() > 1740 /* We make a non-blocking read call */ > ----- > > > > The probable fix could be: > > In the connection_threadmain() function, if it is CONN_DONE don't make the connection readable again after decrementing the c_refcnt. > > So should we handle CONN_DONE and CONN_TIMEDOUT situation differently as below? > > > > Present code: > > 2097 switch (ret) { > 2098 case CONN_DONE: > 2099 /* This means that the connection was closed, so clear turbo mode */ > 2100 /*FALLTHROUGH*/ > 2101 case CONN_TIMEDOUT: > 2102 thread_turbo_flag = 0; > 2103 is_timedout = 1; > > > new code: > > > 2097 switch (ret) { > 2098 case CONN_DONE: > 2099 /* This means that the connection was closed, so clear turbo mode */ > is_conn_done = 0;<----------------- > thread_turbo_flag = 0;<------------- > 2101 case CONN_TIMEDOUT: > 2102 thread_turbo_flag = 0; > 2103 is_timedout = 1; > > > Note that you don't have to set thread_turbo_flag = 0 in the CONN_DONE case because it will be set when it falls through to the CONN_TIMEDOUT case. > > > > Present: > > 2193 if ( ((1 == is_timedout) || (replication_connection&& !thread_turbo_flag))&& !more_data ) > 2194 connection_make_readable(conn); > 2195 pb = NULL; > > > > new: > > 2193 ---------> if ( !is_conn_done&& ((1 == is_timedout) || (replication_connection&& !thread_turbo_flag))&& !more_data ) > 2194 connection_make_readable(conn); > 2195 pb = NULL; > > > > Is it right way to fix it? Please provide your inputs. > > > If ret is CONN_DONE, but you set is_conn_done = 0, how does the code know the connection is really done? > > > > > Thanks& regards, > Rushendra > > > > > > > > > > > > > >
<<attachment: reproducer.zip>>
-- 389 users mailing list 389-users@xxxxxxxxxxxxxxxxxxxxxxx https://admin.fedoraproject.org/mailman/listinfo/389-users