Re: [389-users] Multiple threads simultaneously working on connection's private buffer causes ns-slapd to abort

Rich Megginson <rmeggins@xxxxxxxxxx> · Fri, 18 Nov 2011 16:34:08 -0700

On 11/18/2011 01:06 PM, Reddy, Gandikota Rushendra (ES&N) wrote:
> Hi Rich,
>
> Thanks for your reply.
> This issue can be reproduced by writing a client application that doesn't call UNBIND before exiting.
> Typically the application does 1)BIND 2) Search 3) Search and exit without calling UNBIND.
So a BIND, followed by two Searches, followed by exit?
Something like
ldap_sasl_bind_s()
ldap_search_s()
ldap_search_s()
exit()
?
or are you using async search operations?

Do you have such a client?
> Multiple instances of this application need to be run simultaneously for 12 to 15 hours to reproduce the issue.
>
>
> Thanks for the correction. The probable fix I had mentioned supposed to be as below:
>
>
>                       is_conn_done = 0;<-------------
> 2097                 switch (ret) {
> 2098                         case CONN_DONE:
> 2099                                 /* This means that the connection was closed, so clear turbo mode */
>                                       is_conn_done = 1;<-----------------
>                                       thread_turbo_flag = 0;<-------------
>                                       break;<-----------
>
> 2101                         case CONN_TIMEDOUT:
> 2102                                 thread_turbo_flag = 0;
> 2103                                 is_timedout = 1;
>
>
> 2193    --------->   if ( !is_conn_done&&  ((1 == is_timedout) || (replication_connection&&  !thread_turbo_flag))&&  !more_data )
> 2194                         connection_make_readable(conn);
> 2195                 pb = NULL;
>
>
> Here CONN_DONE and CONN_TIMEOUT conditions are handled differently. So if it is CONN_DONE, we don't call connection_make_readable() again.
> The rationale behind this is in all occurences of CONN_DONE conditions we call either disconnect_server() or disconnection_server_nomutex().
> These functions set c_gettingber to 0. So we can skip calling connection_make_readable() (to set c_gettingber to 0) if it is CONN_DONE.
>
>
>
> In addition to above changes, I believe following code segment also needs to be secured similarly.
>
>
> connection.c : connection_threadmain()
>
>
> 2197                 if (!thread_turbo_flag&&  !more_data) { /* Don't do this in turbo mode */
> 2198                         PR_Lock( conn->c_mutex );
> 2199                         /* if the threadnumber of now below the maximum, wakeup
> 2200                          * the listener thread so that we start polling on this
> 2201                          * connection again
> 2202                          */
> 2203                         /* DBDB I think this code is bogus -- we already signaled the listener above here */
> 2204                         if (conn->c_threadnumber == config_get_maxthreadsperconn())
> 2205                                 need_wakeup = 1;
> 2206                         else
> 2207                                 need_wakeup = 0;
> 2208                         conn->c_threadnumber--;
> 2209                         PR_Unlock( conn->c_mutex );
> 2210
> 2211                         if (need_wakeup)
> 2212                                 signal_listner();
> 2213                 }
>
> Once the connection object is reallocated to a newer connection, the thread belongs to older connection should not touch
> any variables of the connection structure including c_threadnumber. In the above piece of code c_threadnumber is decremented.
> This could result in unexpected behavior. From the data collected from instrumented image,
> I could see c_threadnumber again becoming zero though a thread is executing an operation belongs to newer connection (c_threadnumber supposed to be 1).
>
> In addition to this, if c_threadnumber is 5(default value), it calls signal_listener(). ( In the abort() scenario I don't think this could happen.
> When the connection object is being reallocated, the value of c_threadnumber is reset to zero. )
>
> So I believe before executing above pieces of code, we should make sure that the connection object is not released/realloacted to some other connection.
If I understand the problem correctly, it is that we should not touch 
the conn object at all after doing the connection_release_nolock() - is 
that correct?

So perhaps the simplest solution will be to move the 
connection_release_nolock() to after the code that does the 
connection_make_readable() and the threadnumber/wakeup handling?  And 
also in the doshutdown case - have to release the connection before 
doing the return.  Might also need a connection_make_readable_nolock() 
so that we can reset the gettingber flag and decrement the refcount and 
do the threadnumber handling inside the same mutex.

I think in your proposed solution, the code will not do the 
connection_make_readable() until it loops again - this seems like a 
minor change in behavior, but the code in connection_threadmain is very 
sensitive - small changes can have very large unintended consequences.
>
> I made changes as mentioned above (is_conn_done) and the problem is no more reproducing.
> But I am not sure will it create any other issues.
>
>
> The other probable fix could be:
>
>     1. Make a note of connection Id (c_connid) and connection object address (conn) before executing any operation.
>     2. Before executing the above mentioned code, check whether c_connid and conn are still same.
>        If not, the connection object is already reallocated. So skip the code and goto starting of the while loop in connection_threadmain().
>
>
>
> connection.c:  conntion_threadmain()
>
> 2071                 /* Once we're here we have a pb */
> 2072                 conn = pb->pb_conn;
> 2073                 op = pb->pb_op;
> 		     connid = conn->c_connid;<---------
>                       connaddr = conn;<-----------
>
>
>
>
>
> ---------->    if( (connid == conn->c_connid)&&  (connaddr == conn ) ) { // do we need to check operation id also?
>
> 2193            if ( ((1 == is_timedout) || (replication_connection&&  !thread_turbo_flag))&&  !more_data )
> 2194                         connection_make_readable(conn);
>                }
> 2195                 pb = NULL;
>
>
> ---------->    if( (connid == conn->c_connid)&&  (connaddr == conn ) ) {
>
> 2197                 if (!thread_turbo_flag&&  !more_data) { /* Don't do this in turbo mode */
> 2198                         PR_Lock( conn->c_mutex );
> 2199                         /* if the threadnumber of now below the maximum, wakeup
> 2200                          * the listener thread so that we start polling on this
> 2201                          * connection again
> 2202                          */
> 2203                         /* DBDB I think this code is bogus -- we already signaled the listener above here */
> 2204                         if (conn->c_threadnumber == config_get_maxthreadsperconn())
> 2205                                 need_wakeup = 1;
> 2206                         else
> 2207                                 need_wakeup = 0;
> 2208                         conn->c_threadnumber--;
> 2209                         PR_Unlock( conn->c_mutex );
> 2210
> 2211                         if (need_wakeup)
> 2212                                 signal_listner();
> 2213                 }
>
>                }
>
>
> Please provide your suggestions on the above fixes.
>
>
> Regards,
> Rushendra
>
>
>
>
> ________________________________________
> From: Rich Megginson [rmeggins@xxxxxxxxxx]
> Sent: Friday, November 18, 2011 1:54 PM
> To: Reddy, Gandikota Rushendra (ES&N); 389-users@xxxxxxxxxxxxxxxxxxxxxxx
> Subject: Re: Multiple threads simultaneously working on connection's private buffer causes ns-slapd to abort
>
> On 11/18/2011 05:31 AM, "Reddy, Gandikota Rushendra (ES&N)"<gandikota-rushendra.reddy@xxxxxx><mailto:gandikota-rushendra.reddy@xxxxxx>  wrote:
>
> This was a bounce - please check your list membership before posting again.
>
> On a multi CPU system, ns-slapd process aborts. Stack trace says it aborted calling PR_ASSERT() at the following location in connection.c (Release version of 389 dirsrv is 1.2.0).
>
>
> Is there any chance you could attempt to reproduce this issue with the latest 1.2.9.9 (stable) or 1.2.10.a5 (testing)? Or provide a reproducer.  1.2.0 is quite old.
>
> Also, please file a bug at https://bugzilla.redhat.com/enter_bug.cgi?product=389
>
>
> -----
> Connection.c
> 1690 int connection_read_operation(Connection *conn, Operation *op, ber_tag_t *tag, int *remaining_data)
> 1691 {
>
> 1734         }
> 1735         /* If we still haven't seen a complete PDU, read from the network */
> 1736         while (*tag == LBER_DEFAULT) {
> 1737                 int ioblocktimeout_waits = config_get_ioblocktimeout() / CONN_TURBO_TIMEOUT_INTERVAL;
> 1738                 /* We should never get here with data remaining in the buffer */
> 1739                 PR_ASSERT( !new_operation || 0 == (conn->c_private->c_buffer_bytes - conn->c_private->c_buffer_offset) );<------------------ aborted calling PR_ASSERT()
> 1740                 /* We make a non-blocking read call */
> -----
>
> Using instrumented image I have collected data and found that:
>
> While a thread is working on connection's private buffer, another thread which belongs to older connection makes the connection readable again by modifying c_gettingber to zero.
> This results in waking up one more thread. Finally more than one thread works simultaneously on connection's private buffer and results in aborting ns-slapd.
>
>
> It happens as below:
>
> 1. Thead X is working on Connection-1's last opereration.
>
> 2. After performing the LDAP operation ( connection_dispatch_operation() ) decrementes the connection's c_refcnt  by calling connection_release_nolock() in connection_threadmain() function.
>
> 3. Since this is the last operation, c_refcnt becomes zero.
>
> 4. Thread X yet to execute the code that updates c_threadnumber and  c_gettingber. But as soon as c_refcnt of the connection becomes zero, main thread allocates the same connection object/structure to new incoming connection connection-2.  Thread-Y started executing operation belongs to this newer connection (Connection-2).
>
> -----
> connection.c
>
> 2008 static void
> 2009 connection_threadmain()
> 2010 {
>
>
> 2187                     if (!thread_turbo_flag&&  !more_data) {
> 2188                                 connection_release_nolock (conn);<------------------  Thread-X decremented c_refcnt and now c_refcnt = 0
> 2189                         }
> 2190                     PR_Unlock( conn->c_mutex );
> 2191                 }
> 2192                 /* Since we didn't do so earlier, we need to make a replication connection readable again here */
> 2193                 if ( ((1 == is_timedout) || (replication_connection&&  !thread_turbo_flag))&&  !more_data)
> 2194                         connection_make_readable(conn);
> 2195                 pb = NULL;
> 2196
> 2197                 if (!thread_turbo_flag&&  !more_data) { /* Don't do this in turbo mode */
> 2198                         PR_Lock( conn->c_mutex );
> 2199                         /* if the threadnumber of now below the maximum, wakeup
> 2200                          * the listener thread so that we start polling on this
> 2201                          * connection again
> 2202                          */
> 2203                         /* DBDB I think this code is bogus -- we already signaled the listener above here */
> 2204                         if (conn->c_threadnumber == config_get_maxthreadsperconn())
> 2205                                 need_wakeup = 1;
> 2206                         else
> 2207                                 need_wakeup = 0;
> 2208                         conn->c_threadnumber--;
> 2209                         PR_Unlock( conn->c_mutex );
> 2210
> 2211                         if (need_wakeup)
> 2212                                 signal_listner();
> 2213                 }
> 2214
> 2215
> 2216         } /* while (1) */
> 2217 }
> ----
>
> 5. Thread-Y started processing operation data belongs to newer connection Connection-2  (connection_read_operation() ).
>
> 6. Now Thread-X of connection-1 called make_connection_readable() and made the connection readable again by setting c_gettingbuffer to 0.
>
> 7. This wakes up another thread of connection-2 and it too starts processing operation data (connection_read_operation() ).
>
> 8. So at this point of time more than one thread is working on connection's private data.
>
> 9. As two threads are working on the connections c_buffer, c_bufferbytes and c_offset, at some time it meats conditions to abort() at the following location:
>
> Connection.c:  connection_read_operation()
>
> 1734         }
> 1735         /* If we still haven't seen a complete PDU, read from the network */
> 1736         while (*tag == LBER_DEFAULT) {
> 1737                 int ioblocktimeout_waits = config_get_ioblocktimeout() / CONN_TURBO_TIMEOUT_INTERVAL;
> 1738                 /* We should never get here with data remaining in the buffer */
> 1739                 PR_ASSERT( !new_operation || 0 == (conn->c_private->c_buffer_bytes - conn->c_private->c_buffer_offset) );<------------------ aborted calling PR_ASSERT()
> 1740                 /* We make a non-blocking read call */
> -----
>
>
>
> The probable fix could be:
>
> In the connection_threadmain() function, if it is CONN_DONE don't make the connection readable again after decrementing the c_refcnt.
>
> So should we handle CONN_DONE and CONN_TIMEDOUT situation differently as below?
>
>
>
> Present code:
>
> 2097                 switch (ret) {
> 2098                         case CONN_DONE:
> 2099                                 /* This means that the connection was closed, so clear turbo mode */
> 2100                         /*FALLTHROUGH*/
> 2101                         case CONN_TIMEDOUT:
> 2102                                 thread_turbo_flag = 0;
> 2103                                 is_timedout = 1;
>
>
> new code:
>
>
> 2097                 switch (ret) {
> 2098                         case CONN_DONE:
> 2099                                 /* This means that the connection was closed, so clear turbo mode */
>                                       is_conn_done = 0;<-----------------
>                                       thread_turbo_flag = 0;<-------------
> 2101                         case CONN_TIMEDOUT:
> 2102                                 thread_turbo_flag = 0;
> 2103                                 is_timedout = 1;
>
>
> Note that you don't have to set thread_turbo_flag = 0 in the CONN_DONE case because it will be set when it falls through to the CONN_TIMEDOUT case.
>
>
>
> Present:
>
> 2193                 if ( ((1 == is_timedout) || (replication_connection&&  !thread_turbo_flag))&&  !more_data )
> 2194                         connection_make_readable(conn);
> 2195                 pb = NULL;
>
>
>
> new:
>
> 2193    --------->   if ( !is_conn_done&&  ((1 == is_timedout) || (replication_connection&&  !thread_turbo_flag))&&  !more_data )
> 2194                         connection_make_readable(conn);
> 2195                 pb = NULL;
>
>
>
> Is it right way to fix it?  Please provide your inputs.
>
>
> If ret is CONN_DONE, but you set is_conn_done = 0, how does the code know the connection is really done?
>
>
>
>
> Thanks&  regards,
> Rushendra
>
>
>
>
>
>
>
>
>
>
>
>
>
>

--
389 users mailing list
389-users@xxxxxxxxxxxxxxxxxxxxxxx
https://admin.fedoraproject.org/mailman/listinfo/389-users