-----
Connection.c
1690 int connection_read_operation(Connection *conn, Operation *op, ber_tag_t *tag, int *remaining_data)
1691 {
1734 }
1735 /* If we still haven't seen a complete PDU, read from the network */
1736 while (*tag == LBER_DEFAULT) {
1737 int ioblocktimeout_waits = config_get_ioblocktimeout() / CONN_TURBO_TIMEOUT_INTERVAL;
1738 /* We should never get here with data remaining in the buffer */
1739 PR_ASSERT( !new_operation || 0 == (conn->c_private->c_buffer_bytes - conn->c_private->c_buffer_offset) ); <------------------ aborted calling PR_ASSERT()
1740 /* We make a non-blocking read call */
-----
Using instrumented image I have collected data and found that:
While a thread is working on connection's private buffer, another thread which belongs to older connection makes the connection readable again by modifying c_gettingber to zero.
This results in waking up one more thread. Finally more than one thread works simultaneously on connection's private buffer and results in aborting ns-slapd.
It happens as below:
1. Thead X is working on Connection-1's last opereration.
2. After performing the LDAP operation ( connection_dispatch_operation() ) decrementes the connection's c_refcnt by calling connection_release_nolock() in connection_threadmain() function.
3. Since this is the last operation, c_refcnt becomes zero.
4. Thread X yet to execute the code that updates c_threadnumber and c_gettingber. But as soon as c_refcnt of the connection becomes zero, main thread allocates the same connection object/structure to new incoming connection connection-2. Thread-Y started executing operation belongs to this newer connection (Connection-2).
-----
connection.c
2008 static void
2009 connection_threadmain()
2010 {
2187 if (!thread_turbo_flag && !more_data) {
2188 connection_release_nolock (conn); <------------------ Thread-X decremented c_refcnt and now c_refcnt = 0
2189 }
2190 PR_Unlock( conn->c_mutex );
2191 }
2192 /* Since we didn't do so earlier, we need to make a replication connection readable again here */
2193 if ( ((1 == is_timedout) || (replication_connection && !thread_turbo_flag)) && !more_data)
2194 connection_make_readable(conn);
2195 pb = NULL;
2196
2197 if (!thread_turbo_flag && !more_data) { /* Don't do this in turbo mode */
2198 PR_Lock( conn->c_mutex );
2199 /* if the threadnumber of now below the maximum, wakeup
2200 * the listener thread so that we start polling on this
2201 * connection again
2202 */
2203 /* DBDB I think this code is bogus -- we already signaled the listener above here */
2204 if (conn->c_threadnumber == config_get_maxthreadsperconn())
2205 need_wakeup = 1;
2206 else
2207 need_wakeup = 0;
2208 conn->c_threadnumber--;
2209 PR_Unlock( conn->c_mutex );
2210
2211 if (need_wakeup)
2212 signal_listner();
2213 }
2214
2215
2216 } /* while (1) */
2217 }
----
5. Thread-Y started processing operation data belongs to newer connection Connection-2 (connection_read_operation() ).
6. Now Thread-X of connection-1 called make_connection_readable() and made the connection readable again by setting c_gettingbuffer to 0.
7. This wakes up another thread of connection-2 and it too starts processing operation data (connection_read_operation() ).
8. So at this point of time more than one thread is working on connection's private data.
9. As two threads are working on the connections c_buffer, c_bufferbytes and c_offset, at some time it meats conditions to abort() at the following location:
Connection.c: connection_read_operation()
1734 }
1735 /* If we still haven't seen a complete PDU, read from the network */
1736 while (*tag == LBER_DEFAULT) {
1737 int ioblocktimeout_waits = config_get_ioblocktimeout() / CONN_TURBO_TIMEOUT_INTERVAL;
1738 /* We should never get here with data remaining in the buffer */
1739 PR_ASSERT( !new_operation || 0 == (conn->c_private->c_buffer_bytes - conn->c_private->c_buffer_offset) ); <------------------ aborted calling PR_ASSERT()
1740 /* We make a non-blocking read call */
-----
The probable fix could be:
In the connection_threadmain() function, if it is CONN_DONE don't make the connection readable again after decrementing the c_refcnt.
So should we handle CONN_DONE and CONN_TIMEDOUT situation differently as below?
Present code:
2097 switch (ret) {
2098 case CONN_DONE:
2099 /* This means that the connection was closed, so clear turbo mode */
2100 /*FALLTHROUGH*/
2101 case CONN_TIMEDOUT:
2102 thread_turbo_flag = 0;
2103 is_timedout = 1;
new code:
2097 switch (ret) {
2098 case CONN_DONE:
2099 /* This means that the connection was closed, so clear turbo mode */
is_conn_done = 0; <-----------------
thread_turbo_flag = 0; <-------------
2101 case CONN_TIMEDOUT:
2102 thread_turbo_flag = 0;
2103 is_timedout = 1;