On 11/18/2011 01:06 PM, Reddy, Gandikota Rushendra (ES&N) wrote: > Hi Rich, > > Thanks for your reply. > This issue can be reproduced by writing a client application that doesn't call UNBIND before exiting. > Typically the application does 1)BIND 2) Search 3) Search and exit without calling UNBIND. So a BIND, followed by two Searches, followed by exit? Something like ldap_sasl_bind_s() ldap_search_s() ldap_search_s() exit() ? or are you using async search operations? Do you have such a client? > Multiple instances of this application need to be run simultaneously for 12 to 15 hours to reproduce the issue. > > > Thanks for the correction. The probable fix I had mentioned supposed to be as below: > > > is_conn_done = 0;<------------- > 2097 switch (ret) { > 2098 case CONN_DONE: > 2099 /* This means that the connection was closed, so clear turbo mode */ > is_conn_done = 1;<----------------- > thread_turbo_flag = 0;<------------- > break;<----------- > > 2101 case CONN_TIMEDOUT: > 2102 thread_turbo_flag = 0; > 2103 is_timedout = 1; > > > 2193 ---------> if ( !is_conn_done&& ((1 == is_timedout) || (replication_connection&& !thread_turbo_flag))&& !more_data ) > 2194 connection_make_readable(conn); > 2195 pb = NULL; > > > Here CONN_DONE and CONN_TIMEOUT conditions are handled differently. So if it is CONN_DONE, we don't call connection_make_readable() again. > The rationale behind this is in all occurences of CONN_DONE conditions we call either disconnect_server() or disconnection_server_nomutex(). > These functions set c_gettingber to 0. So we can skip calling connection_make_readable() (to set c_gettingber to 0) if it is CONN_DONE. > > > > In addition to above changes, I believe following code segment also needs to be secured similarly. > > > connection.c : connection_threadmain() > > > 2197 if (!thread_turbo_flag&& !more_data) { /* Don't do this in turbo mode */ > 2198 PR_Lock( conn->c_mutex ); > 2199 /* if the threadnumber of now below the maximum, wakeup > 2200 * the listener thread so that we start polling on this > 2201 * connection again > 2202 */ > 2203 /* DBDB I think this code is bogus -- we already signaled the listener above here */ > 2204 if (conn->c_threadnumber == config_get_maxthreadsperconn()) > 2205 need_wakeup = 1; > 2206 else > 2207 need_wakeup = 0; > 2208 conn->c_threadnumber--; > 2209 PR_Unlock( conn->c_mutex ); > 2210 > 2211 if (need_wakeup) > 2212 signal_listner(); > 2213 } > > Once the connection object is reallocated to a newer connection, the thread belongs to older connection should not touch > any variables of the connection structure including c_threadnumber. In the above piece of code c_threadnumber is decremented. > This could result in unexpected behavior. From the data collected from instrumented image, > I could see c_threadnumber again becoming zero though a thread is executing an operation belongs to newer connection (c_threadnumber supposed to be 1). > > In addition to this, if c_threadnumber is 5(default value), it calls signal_listener(). ( In the abort() scenario I don't think this could happen. > When the connection object is being reallocated, the value of c_threadnumber is reset to zero. ) > > So I believe before executing above pieces of code, we should make sure that the connection object is not released/realloacted to some other connection. If I understand the problem correctly, it is that we should not touch the conn object at all after doing the connection_release_nolock() - is that correct? So perhaps the simplest solution will be to move the connection_release_nolock() to after the code that does the connection_make_readable() and the threadnumber/wakeup handling? And also in the doshutdown case - have to release the connection before doing the return. Might also need a connection_make_readable_nolock() so that we can reset the gettingber flag and decrement the refcount and do the threadnumber handling inside the same mutex. I think in your proposed solution, the code will not do the connection_make_readable() until it loops again - this seems like a minor change in behavior, but the code in connection_threadmain is very sensitive - small changes can have very large unintended consequences. > > I made changes as mentioned above (is_conn_done) and the problem is no more reproducing. > But I am not sure will it create any other issues. > > > The other probable fix could be: > > 1. Make a note of connection Id (c_connid) and connection object address (conn) before executing any operation. > 2. Before executing the above mentioned code, check whether c_connid and conn are still same. > If not, the connection object is already reallocated. So skip the code and goto starting of the while loop in connection_threadmain(). > > > > connection.c: conntion_threadmain() > > 2071 /* Once we're here we have a pb */ > 2072 conn = pb->pb_conn; > 2073 op = pb->pb_op; > connid = conn->c_connid;<--------- > connaddr = conn;<----------- > > > > > > ----------> if( (connid == conn->c_connid)&& (connaddr == conn ) ) { // do we need to check operation id also? > > 2193 if ( ((1 == is_timedout) || (replication_connection&& !thread_turbo_flag))&& !more_data ) > 2194 connection_make_readable(conn); > } > 2195 pb = NULL; > > > ----------> if( (connid == conn->c_connid)&& (connaddr == conn ) ) { > > 2197 if (!thread_turbo_flag&& !more_data) { /* Don't do this in turbo mode */ > 2198 PR_Lock( conn->c_mutex ); > 2199 /* if the threadnumber of now below the maximum, wakeup > 2200 * the listener thread so that we start polling on this > 2201 * connection again > 2202 */ > 2203 /* DBDB I think this code is bogus -- we already signaled the listener above here */ > 2204 if (conn->c_threadnumber == config_get_maxthreadsperconn()) > 2205 need_wakeup = 1; > 2206 else > 2207 need_wakeup = 0; > 2208 conn->c_threadnumber--; > 2209 PR_Unlock( conn->c_mutex ); > 2210 > 2211 if (need_wakeup) > 2212 signal_listner(); > 2213 } > > } > > > Please provide your suggestions on the above fixes. > > > Regards, > Rushendra > > > > > ________________________________________ > From: Rich Megginson [rmeggins@xxxxxxxxxx] > Sent: Friday, November 18, 2011 1:54 PM > To: Reddy, Gandikota Rushendra (ES&N); 389-users@xxxxxxxxxxxxxxxxxxxxxxx > Subject: Re: Multiple threads simultaneously working on connection's private buffer causes ns-slapd to abort > > On 11/18/2011 05:31 AM, "Reddy, Gandikota Rushendra (ES&N)"<gandikota-rushendra.reddy@xxxxxx><mailto:gandikota-rushendra.reddy@xxxxxx> wrote: > > This was a bounce - please check your list membership before posting again. > > On a multi CPU system, ns-slapd process aborts. Stack trace says it aborted calling PR_ASSERT() at the following location in connection.c (Release version of 389 dirsrv is 1.2.0). > > > Is there any chance you could attempt to reproduce this issue with the latest 1.2.9.9 (stable) or 1.2.10.a5 (testing)? Or provide a reproducer. 1.2.0 is quite old. > > Also, please file a bug at https://bugzilla.redhat.com/enter_bug.cgi?product=389 > > > ----- > Connection.c > 1690 int connection_read_operation(Connection *conn, Operation *op, ber_tag_t *tag, int *remaining_data) > 1691 { > > 1734 } > 1735 /* If we still haven't seen a complete PDU, read from the network */ > 1736 while (*tag == LBER_DEFAULT) { > 1737 int ioblocktimeout_waits = config_get_ioblocktimeout() / CONN_TURBO_TIMEOUT_INTERVAL; > 1738 /* We should never get here with data remaining in the buffer */ > 1739 PR_ASSERT( !new_operation || 0 == (conn->c_private->c_buffer_bytes - conn->c_private->c_buffer_offset) );<------------------ aborted calling PR_ASSERT() > 1740 /* We make a non-blocking read call */ > ----- > > Using instrumented image I have collected data and found that: > > While a thread is working on connection's private buffer, another thread which belongs to older connection makes the connection readable again by modifying c_gettingber to zero. > This results in waking up one more thread. Finally more than one thread works simultaneously on connection's private buffer and results in aborting ns-slapd. > > > It happens as below: > > 1. Thead X is working on Connection-1's last opereration. > > 2. After performing the LDAP operation ( connection_dispatch_operation() ) decrementes the connection's c_refcnt by calling connection_release_nolock() in connection_threadmain() function. > > 3. Since this is the last operation, c_refcnt becomes zero. > > 4. Thread X yet to execute the code that updates c_threadnumber and c_gettingber. But as soon as c_refcnt of the connection becomes zero, main thread allocates the same connection object/structure to new incoming connection connection-2. Thread-Y started executing operation belongs to this newer connection (Connection-2). > > ----- > connection.c > > 2008 static void > 2009 connection_threadmain() > 2010 { > > > 2187 if (!thread_turbo_flag&& !more_data) { > 2188 connection_release_nolock (conn);<------------------ Thread-X decremented c_refcnt and now c_refcnt = 0 > 2189 } > 2190 PR_Unlock( conn->c_mutex ); > 2191 } > 2192 /* Since we didn't do so earlier, we need to make a replication connection readable again here */ > 2193 if ( ((1 == is_timedout) || (replication_connection&& !thread_turbo_flag))&& !more_data) > 2194 connection_make_readable(conn); > 2195 pb = NULL; > 2196 > 2197 if (!thread_turbo_flag&& !more_data) { /* Don't do this in turbo mode */ > 2198 PR_Lock( conn->c_mutex ); > 2199 /* if the threadnumber of now below the maximum, wakeup > 2200 * the listener thread so that we start polling on this > 2201 * connection again > 2202 */ > 2203 /* DBDB I think this code is bogus -- we already signaled the listener above here */ > 2204 if (conn->c_threadnumber == config_get_maxthreadsperconn()) > 2205 need_wakeup = 1; > 2206 else > 2207 need_wakeup = 0; > 2208 conn->c_threadnumber--; > 2209 PR_Unlock( conn->c_mutex ); > 2210 > 2211 if (need_wakeup) > 2212 signal_listner(); > 2213 } > 2214 > 2215 > 2216 } /* while (1) */ > 2217 } > ---- > > 5. Thread-Y started processing operation data belongs to newer connection Connection-2 (connection_read_operation() ). > > 6. Now Thread-X of connection-1 called make_connection_readable() and made the connection readable again by setting c_gettingbuffer to 0. > > 7. This wakes up another thread of connection-2 and it too starts processing operation data (connection_read_operation() ). > > 8. So at this point of time more than one thread is working on connection's private data. > > 9. As two threads are working on the connections c_buffer, c_bufferbytes and c_offset, at some time it meats conditions to abort() at the following location: > > Connection.c: connection_read_operation() > > 1734 } > 1735 /* If we still haven't seen a complete PDU, read from the network */ > 1736 while (*tag == LBER_DEFAULT) { > 1737 int ioblocktimeout_waits = config_get_ioblocktimeout() / CONN_TURBO_TIMEOUT_INTERVAL; > 1738 /* We should never get here with data remaining in the buffer */ > 1739 PR_ASSERT( !new_operation || 0 == (conn->c_private->c_buffer_bytes - conn->c_private->c_buffer_offset) );<------------------ aborted calling PR_ASSERT() > 1740 /* We make a non-blocking read call */ > ----- > > > > The probable fix could be: > > In the connection_threadmain() function, if it is CONN_DONE don't make the connection readable again after decrementing the c_refcnt. > > So should we handle CONN_DONE and CONN_TIMEDOUT situation differently as below? > > > > Present code: > > 2097 switch (ret) { > 2098 case CONN_DONE: > 2099 /* This means that the connection was closed, so clear turbo mode */ > 2100 /*FALLTHROUGH*/ > 2101 case CONN_TIMEDOUT: > 2102 thread_turbo_flag = 0; > 2103 is_timedout = 1; > > > new code: > > > 2097 switch (ret) { > 2098 case CONN_DONE: > 2099 /* This means that the connection was closed, so clear turbo mode */ > is_conn_done = 0;<----------------- > thread_turbo_flag = 0;<------------- > 2101 case CONN_TIMEDOUT: > 2102 thread_turbo_flag = 0; > 2103 is_timedout = 1; > > > Note that you don't have to set thread_turbo_flag = 0 in the CONN_DONE case because it will be set when it falls through to the CONN_TIMEDOUT case. > > > > Present: > > 2193 if ( ((1 == is_timedout) || (replication_connection&& !thread_turbo_flag))&& !more_data ) > 2194 connection_make_readable(conn); > 2195 pb = NULL; > > > > new: > > 2193 ---------> if ( !is_conn_done&& ((1 == is_timedout) || (replication_connection&& !thread_turbo_flag))&& !more_data ) > 2194 connection_make_readable(conn); > 2195 pb = NULL; > > > > Is it right way to fix it? Please provide your inputs. > > > If ret is CONN_DONE, but you set is_conn_done = 0, how does the code know the connection is really done? > > > > > Thanks& regards, > Rushendra > > > > > > > > > > > > > > -- 389 users mailing list 389-users@xxxxxxxxxxxxxxxxxxxxxxx https://admin.fedoraproject.org/mailman/listinfo/389-users