Re: [PATCH 2/2] libceph: fix handle_timeout() racing with con_work()/try_write()

Sage Weil <sage@xxxxxxxxxxxx> · Tue, 17 May 2011 16:38:15 -0700 (PDT)

On Tue, 17 May 2011, Sage Weil wrote:
> On Tue, 17 May 2011, Jim Schutt wrote:
> > Hi Sage,
> > 
> > Sage Weil wrote:
> > > Hi Jim,
> > 
> > > This picks up the second resent message.  But and OPENING is set.. so how is
> > > it that we were in try_write but we didn't pass through the OPENING test in
> > > con_work?
> > > 
> > > > > > con_work()/try_write() start again on new message.
> > > > > > Note that con_work() did not notice that the connection was marked to
> > > > > > be closed/reopened!  This is where things went bad.
> > > > > > Note the connection state: 8220 = 0x201c, and 0x2000 is OPENING.
> > > > [ 2412.643178] libceph:   messenger.c:1969 : con_work ffff8801955c0030
> > > > OPENING
> > > > [ 2412.643182] libceph:   messenger.c:292  : con_close_socket on
> > > > ffff8801955c0030 sock ffff8801825a8c00
> > > > [ 2412.643197] libceph:   messenger.c:162  : ceph_state_change
> > > > ffff8801955c0030 state = 2076 sk_state = 4
> > > 
> > > Ok, now we see it, but this should have executed before we ever got to
> > > try_write!
> > > 
> > > There are a handful of places that drop and re-take con->mutex... I think it
> > > has to be one of those.  They are:
> > > 
> > > - prepare_connect_authorizer (before calling ->prepare_authorizer callback)
> > > - fail_protocol (before calling ->bad_proto callback)
> > > - process_connect RESETSESSION case (before calling ->reset callback)
> > > - process_message (before calling ->dispatch callback)
> > > 
> > > I think after retaking the lock in each of these cases we need to test for
> > > the CLOSED or OPENING bits and, if set, bail out so that con_work can
> > > restart.
> > > 
> > > Does that make sense?  Can you confirm which case of lock droppage inside
> > > con_work is at fault?  (Maybe adding a printk at the top of con_work after
> > > taking the lock too would make it more clear what's going on; currently it
> > > only prints in all the random test_bit cases).
> > 
> > So I've got logs now that show that connection state changes
> > across a call to process_message.  That seems like it proves
> > the concept.
> > 
> > > 
> > > I'm thinking something like the below... does that seem reasonable?  I only
> > > did the prepare_authorizer case to start, but the others should be fixed up
> > > too.  Compile tested only.
> > > 
> > 
> > Is there any reason not to fix all the cases at once, in try_read?
> > It's less code, but far away from where the mutex dropping occurs,
> > so we add a comment:
> 
> That covers the process_message and process_connect cases, provided you 
> also set ret properly, and do the con_close_socket work that the con_work 
> checks would have done.  e.g.,

I pushed a patch to the msgr_race branch that catches all four cases (I 
think).  Does the fix make sense given what you saw?

Thanks!
sage

> 
> diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
> index e15a82c..7ef9eb0 100644
> --- a/net/ceph/messenger.c
> +++ b/net/ceph/messenger.c
> @@ -1810,6 +1810,21 @@ static int try_read(struct ceph_connection *con)
>  more:
>         dout("try_read tag %d in_base_pos %d\n", (int)con->in_tag,
>              con->in_base_pos);
> +
> +       /*
> +        * process_connect and process_message drop and re-take
> +        * con->mutex.  make sure we handle a racing close or reopen.
> +        */
> +       if (test_bit(CLOSED, &con->state)) {
> +               ret = -EAGAIN;
> +               goto out;
> +       }
> +       if (test_bit(OPENING, &con->state)) {
> +               con_close_socket(con);
> +               ret = 0;  /* so we continue to try_write */
> +               goto out;
> +       }
> +
>         if (test_bit(CONNECTING, &con->state)) {
>                 if (!test_bit(NEGOTIATING, &con->state)) {
>                         dout("try_read connecting\n");
> 
> It's a little bit unweildly because we repeat the checks at the top of 
> con_work, though.  It is probably cleaner to always fall back to that case 
> by doing CLOSED|OPENING bit checks and bailing with -EAGAIN any time we 
> retake con->mutex, and then checking try_read/write return values like the 
> patch from yesterday does.  That way we only react to the bits in one 
> place.
> 
> Have you verified that this fixes the problem for you?  It makes sense the 
> race is over ->dispatch, since that is a somewhat long-running call.
> 
> sage
> 
> 
> 
> > 
> > diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
> > index a9bd424..11d6a61 100644
> > --- a/net/ceph/messenger.c
> > +++ b/net/ceph/messenger.c
> > @@ -1816,10 +1820,19 @@ static int try_read(struct ceph_connection *con)
> >  	dout("try_read start on %p state %lu\n", con, con->state);
> > 
> >  more:
> >  	dout("try_read %p state %lu tag %d in_base_pos %d\n", con,
> >  	     con->state, (int)con->in_tag, con->in_base_pos);
> > +
> > +	/* process_connect, process_message need to drop con->mutex,
> > +	 * so connection state might have changed on us.  Check it,
> > +	 * leave if necessary so new state can be processed.
> > +	 */
> > +	if (test_bit(CLOSED, &con->state) ||
> > +	    test_bit(OPENING, &con->state))
> > +		goto out;
> > +
> >  	if (test_bit(CONNECTING, &con->state)) {
> >  		if (!test_bit(NEGOTIATING, &con->state)) {
> >  			dout("try_read %p connecting\n", con);
> >  			ret = read_partial_banner(con);
> >  			if (ret <= 0)
> > 
> > 
> > What am I missing?
> > 
> > -- Jim
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html