Thanks a lot Rich. I will try the changes you have suggested. Regards, Rushendra ________________________________________ From: Rich Megginson [rmeggins@xxxxxxxxxx] Sent: Saturday, November 19, 2011 8:57 PM To: Reddy, Gandikota Rushendra (ES&N) Cc: 389-users@xxxxxxxxxxxxxxxxxxxxxxx Subject: Re: Multiple threads simultaneously working on connection's private buffer causes ns-slapd to abort On 11/18/2011 11:36 PM, Reddy, Gandikota Rushendra (ES&N) wrote: > Hi Rich, > > Thanks for the reply. > > I have attached the sample client code. > > It uses sync search calls. > > It calls > > ldap_simple_bind_s > ldap_search_ext_s (3 times) > ldap_simple_bind_s > ldap_search_ext_s (3 times) > return > > It doesn't call unbind. > > As you said, decrementing the reference count should be the last thing before control goes back to > starting of the while loop in connection_threadmain. But does this change has any ill effects as signal_listner() > and connection_make_readable() are called before decrementing reference count?? (That is waking up the listener thread and > setting gettingber to zero before decrementing reference count.) I request your inputs on this. connection_make_readable() - as long as all code that accesses gettingber first acquires the conn->c_mutex lock, it should be fine signal_listner() - probably better to set the need_wakeup flag inside the mutex, then call signal_listner() after doing the connection_release and the PR_Unlock() > If I understood correctly, the changes you suggested will look similar to(please correct me if not): > > > > 2187 // if (!thread_turbo_flag&& !more_data) { > 2188 // connection_release_nolock (conn);<------ don't touch c_refcnt > 2189 // } > 2190 PR_Unlock( conn->c_mutex ); > 2191 } > 2192 /* Since we didn't do so earlier, we need to make a replication connection readable again here */ > 2193 if ( ((1 == is_timedout) || (replication_connection&& !thread_turbo_flag))&& !more_data) > 2194 connection_make_readable(conn);<---- set gettingber > 2195 pb = NULL; > > +------------> if(doshutdown) { > | PR_Lock( conn->c_mutex ); > | > | c_gettingber = 0; > | > | if (conn->c_threadnumber == config_get_maxthreadsperconn()) > | need_wakeup = 1; > | else > | need_wakeup = 0; > | > | conn->c_threadnumber--; > | > | if (need_wakeup) > | signal_listner(); > | connection_release_nolock (conn); > | > | PR_Unlock( conn->c_mutex ); > | return; > +------------> } > > > 2196 > 2197 if (!thread_turbo_flag&& !more_data) { /* Don't do this in turbo mode */ > 2198 PR_Lock( conn->c_mutex ); > 2199 /* if the threadnumber of now below the maximum, wakeup > 2200 * the listener thread so that we start polling on this > 2201 * connection again > 2202 */ > 2203 /* DBDB I think this code is bogus -- we already signaled the listener above here */ > 2204 if (conn->c_threadnumber == config_get_maxthreadsperconn()) > 2205 need_wakeup = 1; > 2206 else > 2207 need_wakeup = 0; > 2208 conn->c_threadnumber--; > 2209 PR_Unlock( conn->c_mutex ); > 2210 > 2211 if (need_wakeup) > 2212 signal_listner(); > 2213 } > > > > +--------> if ( !( pb->pb_op->o_flags& OP_FLAG_PS )) { > | if (!thread_turbo_flag&& !more_data) { > | PR_Lock( conn->c_mutex ); > | connection_release_nolock (conn);<---- decrement ref cont > | PR_Unlock( conn->c_mutex ); > | } > +--------> } > > > 2190 > 2214 > 2215 > 2216 } /* while (1) */ > 2217 } > > > > > Thanks& regards, > Rushendra > > > > > > > > > > > > > > ________________________________________ > From: Rich Megginson [rmeggins@xxxxxxxxxx] > Sent: Friday, November 18, 2011 11:34 PM > To: Reddy, Gandikota Rushendra (ES&N) > Cc: 389-users@xxxxxxxxxxxxxxxxxxxxxxx > Subject: Re: Multiple threads simultaneously working on connection's private buffer causes ns-slapd to abort > > On 11/18/2011 01:06 PM, Reddy, Gandikota Rushendra (ES&N) wrote: >> Hi Rich, >> >> Thanks for your reply. >> This issue can be reproduced by writing a client application that doesn't call UNBIND before exiting. >> Typically the application does 1)BIND 2) Search 3) Search and exit without calling UNBIND. > So a BIND, followed by two Searches, followed by exit? > Something like > ldap_sasl_bind_s() > ldap_search_s() > ldap_search_s() > exit() > ? > or are you using async search operations? > > Do you have such a client? >> Multiple instances of this application need to be run simultaneously for 12 to 15 hours to reproduce the issue. >> >> >> Thanks for the correction. The probable fix I had mentioned supposed to be as below: >> >> >> is_conn_done = 0;<------------- >> 2097 switch (ret) { >> 2098 case CONN_DONE: >> 2099 /* This means that the connection was closed, so clear turbo mode */ >> is_conn_done = 1;<----------------- >> thread_turbo_flag = 0;<------------- >> break;<----------- >> >> 2101 case CONN_TIMEDOUT: >> 2102 thread_turbo_flag = 0; >> 2103 is_timedout = 1; >> >> >> 2193 ---------> if ( !is_conn_done&& ((1 == is_timedout) || (replication_connection&& !thread_turbo_flag))&& !more_data ) >> 2194 connection_make_readable(conn); >> 2195 pb = NULL; >> >> >> Here CONN_DONE and CONN_TIMEOUT conditions are handled differently. So if it is CONN_DONE, we don't call connection_make_readable() again. >> The rationale behind this is in all occurences of CONN_DONE conditions we call either disconnect_server() or disconnection_server_nomutex(). >> These functions set c_gettingber to 0. So we can skip calling connection_make_readable() (to set c_gettingber to 0) if it is CONN_DONE. >> >> >> >> In addition to above changes, I believe following code segment also needs to be secured similarly. >> >> >> connection.c : connection_threadmain() >> >> >> 2197 if (!thread_turbo_flag&& !more_data) { /* Don't do this in turbo mode */ >> 2198 PR_Lock( conn->c_mutex ); >> 2199 /* if the threadnumber of now below the maximum, wakeup >> 2200 * the listener thread so that we start polling on this >> 2201 * connection again >> 2202 */ >> 2203 /* DBDB I think this code is bogus -- we already signaled the listener above here */ >> 2204 if (conn->c_threadnumber == config_get_maxthreadsperconn()) >> 2205 need_wakeup = 1; >> 2206 else >> 2207 need_wakeup = 0; >> 2208 conn->c_threadnumber--; >> 2209 PR_Unlock( conn->c_mutex ); >> 2210 >> 2211 if (need_wakeup) >> 2212 signal_listner(); >> 2213 } >> >> Once the connection object is reallocated to a newer connection, the thread belongs to older connection should not touch >> any variables of the connection structure including c_threadnumber. In the above piece of code c_threadnumber is decremented. >> This could result in unexpected behavior. From the data collected from instrumented image, >> I could see c_threadnumber again becoming zero though a thread is executing an operation belongs to newer connection (c_threadnumber supposed to be 1). >> >> In addition to this, if c_threadnumber is 5(default value), it calls signal_listener(). ( In the abort() scenario I don't think this could happen. >> When the connection object is being reallocated, the value of c_threadnumber is reset to zero. ) >> >> So I believe before executing above pieces of code, we should make sure that the connection object is not released/realloacted to some other connection. > If I understand the problem correctly, it is that we should not touch > the conn object at all after doing the connection_release_nolock() - is > that correct? > > So perhaps the simplest solution will be to move the > connection_release_nolock() to after the code that does the > connection_make_readable() and the threadnumber/wakeup handling? And > also in the doshutdown case - have to release the connection before > doing the return. Might also need a connection_make_readable_nolock() > so that we can reset the gettingber flag and decrement the refcount and > do the threadnumber handling inside the same mutex. > > I think in your proposed solution, the code will not do the > connection_make_readable() until it loops again - this seems like a > minor change in behavior, but the code in connection_threadmain is very > sensitive - small changes can have very large unintended consequences. >> I made changes as mentioned above (is_conn_done) and the problem is no more reproducing. >> But I am not sure will it create any other issues. >> >> >> The other probable fix could be: >> >> 1. Make a note of connection Id (c_connid) and connection object address (conn) before executing any operation. >> 2. Before executing the above mentioned code, check whether c_connid and conn are still same. >> If not, the connection object is already reallocated. So skip the code and goto starting of the while loop in connection_threadmain(). >> >> >> >> connection.c: conntion_threadmain() >> >> 2071 /* Once we're here we have a pb */ >> 2072 conn = pb->pb_conn; >> 2073 op = pb->pb_op; >> connid = conn->c_connid;<--------- >> connaddr = conn;<----------- >> >> >> >> >> >> ----------> if( (connid == conn->c_connid)&& (connaddr == conn ) ) { // do we need to check operation id also? >> >> 2193 if ( ((1 == is_timedout) || (replication_connection&& !thread_turbo_flag))&& !more_data ) >> 2194 connection_make_readable(conn); >> } >> 2195 pb = NULL; >> >> >> ----------> if( (connid == conn->c_connid)&& (connaddr == conn ) ) { >> >> 2197 if (!thread_turbo_flag&& !more_data) { /* Don't do this in turbo mode */ >> 2198 PR_Lock( conn->c_mutex ); >> 2199 /* if the threadnumber of now below the maximum, wakeup >> 2200 * the listener thread so that we start polling on this >> 2201 * connection again >> 2202 */ >> 2203 /* DBDB I think this code is bogus -- we already signaled the listener above here */ >> 2204 if (conn->c_threadnumber == config_get_maxthreadsperconn()) >> 2205 need_wakeup = 1; >> 2206 else >> 2207 need_wakeup = 0; >> 2208 conn->c_threadnumber--; >> 2209 PR_Unlock( conn->c_mutex ); >> 2210 >> 2211 if (need_wakeup) >> 2212 signal_listner(); >> 2213 } >> >> } >> >> >> Please provide your suggestions on the above fixes. >> >> >> Regards, >> Rushendra >> >> >> >> >> ________________________________________ >> From: Rich Megginson [rmeggins@xxxxxxxxxx] >> Sent: Friday, November 18, 2011 1:54 PM >> To: Reddy, Gandikota Rushendra (ES&N); 389-users@xxxxxxxxxxxxxxxxxxxxxxx >> Subject: Re: Multiple threads simultaneously working on connection's private buffer causes ns-slapd to abort >> >> On 11/18/2011 05:31 AM, "Reddy, Gandikota Rushendra (ES&N)"<gandikota-rushendra.reddy@xxxxxx><mailto:gandikota-rushendra.reddy@xxxxxx> wrote: >> >> This was a bounce - please check your list membership before posting again. >> >> On a multi CPU system, ns-slapd process aborts. Stack trace says it aborted calling PR_ASSERT() at the following location in connection.c (Release version of 389 dirsrv is 1.2.0). >> >> >> Is there any chance you could attempt to reproduce this issue with the latest 1.2.9.9 (stable) or 1.2.10.a5 (testing)? Or provide a reproducer. 1.2.0 is quite old. >> >> Also, please file a bug at https://bugzilla.redhat.com/enter_bug.cgi?product=389 >> >> >> ----- >> Connection.c >> 1690 int connection_read_operation(Connection *conn, Operation *op, ber_tag_t *tag, int *remaining_data) >> 1691 { >> >> 1734 } >> 1735 /* If we still haven't seen a complete PDU, read from the network */ >> 1736 while (*tag == LBER_DEFAULT) { >> 1737 int ioblocktimeout_waits = config_get_ioblocktimeout() / CONN_TURBO_TIMEOUT_INTERVAL; >> 1738 /* We should never get here with data remaining in the buffer */ >> 1739 PR_ASSERT( !new_operation || 0 == (conn->c_private->c_buffer_bytes - conn->c_private->c_buffer_offset) );<------------------ aborted calling PR_ASSERT() >> 1740 /* We make a non-blocking read call */ >> ----- >> >> Using instrumented image I have collected data and found that: >> >> While a thread is working on connection's private buffer, another thread which belongs to older connection makes the connection readable again by modifying c_gettingber to zero. >> This results in waking up one more thread. Finally more than one thread works simultaneously on connection's private buffer and results in aborting ns-slapd. >> >> >> It happens as below: >> >> 1. Thead X is working on Connection-1's last opereration. >> >> 2. After performing the LDAP operation ( connection_dispatch_operation() ) decrementes the connection's c_refcnt by calling connection_release_nolock() in connection_threadmain() function. >> >> 3. Since this is the last operation, c_refcnt becomes zero. >> >> 4. Thread X yet to execute the code that updates c_threadnumber and c_gettingber. But as soon as c_refcnt of the connection becomes zero, main thread allocates the same connection object/structure to new incoming connection connection-2. Thread-Y started executing operation belongs to this newer connection (Connection-2). >> >> ----- >> connection.c >> >> 2008 static void >> 2009 connection_threadmain() >> 2010 { >> >> >> 2187 if (!thread_turbo_flag&& !more_data) { >> 2188 connection_release_nolock (conn);<------------------ Thread-X decremented c_refcnt and now c_refcnt = 0 >> 2189 } >> 2190 PR_Unlock( conn->c_mutex ); >> 2191 } >> 2192 /* Since we didn't do so earlier, we need to make a replication connection readable again here */ >> 2193 if ( ((1 == is_timedout) || (replication_connection&& !thread_turbo_flag))&& !more_data) >> 2194 connection_make_readable(conn); >> 2195 pb = NULL; >> 2196 >> 2197 if (!thread_turbo_flag&& !more_data) { /* Don't do this in turbo mode */ >> 2198 PR_Lock( conn->c_mutex ); >> 2199 /* if the threadnumber of now below the maximum, wakeup >> 2200 * the listener thread so that we start polling on this >> 2201 * connection again >> 2202 */ >> 2203 /* DBDB I think this code is bogus -- we already signaled the listener above here */ >> 2204 if (conn->c_threadnumber == config_get_maxthreadsperconn()) >> 2205 need_wakeup = 1; >> 2206 else >> 2207 need_wakeup = 0; >> 2208 conn->c_threadnumber--; >> 2209 PR_Unlock( conn->c_mutex ); >> 2210 >> 2211 if (need_wakeup) >> 2212 signal_listner(); >> 2213 } >> 2214 >> 2215 >> 2216 } /* while (1) */ >> 2217 } >> ---- >> >> 5. Thread-Y started processing operation data belongs to newer connection Connection-2 (connection_read_operation() ). >> >> 6. Now Thread-X of connection-1 called make_connection_readable() and made the connection readable again by setting c_gettingbuffer to 0. >> >> 7. This wakes up another thread of connection-2 and it too starts processing operation data (connection_read_operation() ). >> >> 8. So at this point of time more than one thread is working on connection's private data. >> >> 9. As two threads are working on the connections c_buffer, c_bufferbytes and c_offset, at some time it meats conditions to abort() at the following location: >> >> Connection.c: connection_read_operation() >> >> 1734 } >> 1735 /* If we still haven't seen a complete PDU, read from the network */ >> 1736 while (*tag == LBER_DEFAULT) { >> 1737 int ioblocktimeout_waits = config_get_ioblocktimeout() / CONN_TURBO_TIMEOUT_INTERVAL; >> 1738 /* We should never get here with data remaining in the buffer */ >> 1739 PR_ASSERT( !new_operation || 0 == (conn->c_private->c_buffer_bytes - conn->c_private->c_buffer_offset) );<------------------ aborted calling PR_ASSERT() >> 1740 /* We make a non-blocking read call */ >> ----- >> >> >> >> The probable fix could be: >> >> In the connection_threadmain() function, if it is CONN_DONE don't make the connection readable again after decrementing the c_refcnt. >> >> So should we handle CONN_DONE and CONN_TIMEDOUT situation differently as below? >> >> >> >> Present code: >> >> 2097 switch (ret) { >> 2098 case CONN_DONE: >> 2099 /* This means that the connection was closed, so clear turbo mode */ >> 2100 /*FALLTHROUGH*/ >> 2101 case CONN_TIMEDOUT: >> 2102 thread_turbo_flag = 0; >> 2103 is_timedout = 1; >> >> >> new code: >> >> >> 2097 switch (ret) { >> 2098 case CONN_DONE: >> 2099 /* This means that the connection was closed, so clear turbo mode */ >> is_conn_done = 0;<----------------- >> thread_turbo_flag = 0;<------------- >> 2101 case CONN_TIMEDOUT: >> 2102 thread_turbo_flag = 0; >> 2103 is_timedout = 1; >> >> >> Note that you don't have to set thread_turbo_flag = 0 in the CONN_DONE case because it will be set when it falls through to the CONN_TIMEDOUT case. >> >> >> >> Present: >> >> 2193 if ( ((1 == is_timedout) || (replication_connection&& !thread_turbo_flag))&& !more_data ) >> 2194 connection_make_readable(conn); >> 2195 pb = NULL; >> >> >> >> new: >> >> 2193 ---------> if ( !is_conn_done&& ((1 == is_timedout) || (replication_connection&& !thread_turbo_flag))&& !more_data ) >> 2194 connection_make_readable(conn); >> 2195 pb = NULL; >> >> >> >> Is it right way to fix it? Please provide your inputs. >> >> >> If ret is CONN_DONE, but you set is_conn_done = 0, how does the code know the connection is really done? >> >> >> >> >> Thanks& regards, >> Rushendra >> >> >> >> >> >> >> >> >> >> >> >> >> >> -- 389 users mailing list 389-users@xxxxxxxxxxxxxxxxxxxxxxx https://admin.fedoraproject.org/mailman/listinfo/389-users