Re: [PATCH v3 01/19] libmultipath: fix tur checker timeout

Martin Wilck <mwilck@xxxxxxxx> · Fri, 05 Oct 2018 12:11:18 +0200

On Thu, 2018-10-04 at 11:31 -0500,  Benjamin Marzinski  wrote:
> On Mon, Oct 01, 2018 at 09:51:22PM +0200, Martin Wilck wrote:
> > On Fri, 2018-09-21 at 18:05 -0500, Benjamin Marzinski wrote:
> > > The code previously was timing out mode if ct->thread was 0 but
> > > ct->running wasn't. This combination never happens.  The idea was
> > > to
> > > timeout if for some reason the path checker tried to kill the
> > > thread,
> > > but it didn't die.  The correct thing to check for this is ct-
> > > > holders.
> > > 
> > > ct->holders will always be at least one when libcheck_check() is
> > > called,
> > > since libcheck_free() won't get called until the device is no
> > > longer
> > > being checked. So, if ct->holders is 2, that means that the tur
> > > thread
> > > is has not shut down yet.
> > > 
> > > Also, instead of returning PATH_TIMEOUT whenever the thread
> > > hasn't
> > > died,
> > > it should only time out if the thread didn't successfully get a
> > > value,
> > > which means the previous state was already PATH_TIMEOUT.
> > 
> > What about PATH_UNCHECKED?
> > 
> 
> I don't see how that could happen. In this state we've definitely set
> ct->state to PATH_PENDING when we started the thread. The thread will
> only change  ct->state to the return of tur_check(), which doesn't
> ever return PATH_UNCHECKED. Am I missing something here?

OK. The only thing you missed was a comment :-)

This stuff is so subtle that we should describe exactly how
the locking works, which actor is supposed/entitled to set what field
to what value in what situation, and so on.

> 
> > > 
> > > Signed-off-by: Benjamin Marzinski <bmarzins@xxxxxxxxxx>
> > > ---
> > >  libmultipath/checkers/tur.c | 15 +++++++++------
> > >  1 file changed, 9 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/libmultipath/checkers/tur.c
> > > b/libmultipath/checkers/tur.c
> > > index bf8486d..275541f 100644
> > > --- a/libmultipath/checkers/tur.c
> > > +++ b/libmultipath/checkers/tur.c
> > > @@ -355,12 +355,15 @@ int libcheck_check(struct checker * c)
> > >  		}
> > >  		pthread_mutex_unlock(&ct->lock);
> > >  	} else {
> > > -		if (uatomic_read(&ct->running) != 0) {
> > > -			/* pthread cancel failed. continue in sync mode
> > > */
> > > -			pthread_mutex_unlock(&ct->lock);
> > > -			condlog(3, "%s: tur thread not responding",
> > > -				tur_devt(devt, sizeof(devt), ct));
> > > -			return PATH_TIMEOUT;
> > > +		if (uatomic_read(&ct->holders) > 1) {
> > > +			/* pthread cancel failed. If it didn't get the
> > > path
> > > +			   state already, timeout */
> > > +			if (ct->state == PATH_PENDING) {
> > > +				pthread_mutex_unlock(&ct->lock);
> > > +				condlog(3, "%s: tur thread not
> > > responding",
> > > +					tur_devt(devt, sizeof(devt),
> > > ct));
> > > +				return PATH_TIMEOUT;
> > > +			}
> > >  		}
> > 
> > I'm not quite getting it yet. We're entering this code path if 
> > ct->thread is 0. As you point out, if there are >1 holders, a
> > previously cancelled TUR thread must be still "alive". But now we
> > want
> > to check *again*. The previous code refused to do that - a single
> > hanging TUR thread could block further checks from happening,
> > forever.
> > The PATH_TIMEOUT return code was actually questionable - in this
> > libcheck_check() invocation, we didn't even try to check, so
> > nothing
> > could actually time out (but that's obviously independent of your
> > patch).
> > 
> > With your patch, if ct->state != PATH_PENDING, we'd try to start
> > another TUR thread although there's still a hanging one. That's not
> > necessarily wrong, but I fail to see why it depends on ct->state.
> > Anyway, if we do this, we take the risk of starting more and more
> > TUR
> > threads which might all end up hanging; I wonder if we should stop
> > creating more threads if ct->holders grows further (e.g. > 5).
> 
> It's been a while, and I'm not exactly sure what I was thinking here.
> But IIRC the idea was that if the state isn't set yet, then the old
> thread could mess with the results of a future thread. Also, it was
> to
> avoid a corner case where the tur checker was called, started the
> thread, got the result before the checker exitted, cancelled the
> thread
> and returned the result and then was called very shortly afterwards,
> before the thread could clean itself up.  In that case, the right
> thing
> to do (I thought) would be to start a new thread, because the other
> one
> would be ending soon.  In truth, starting a new thread at all is
> probably a bad idea, since the old thread will still mess with the
> checker context on exit. 
> 
> A better idea might be to simply fail back to a syncronous call to
> tur_check() when you notice a cancelled thread that hasn't exitted.
> That can cause all the other checkers to get delayed, but at least in
> that case you are still checking paths. The other option is to do as
> before and just not check this path, which won't effect the other
> checkers. It seems very unlikely that the thread could remain
> uncancelled forever, especially if the path was usable.
>
> thoughts?

Generally speaking, we're deeply in the realm of highly unlikely
situations I would say. But we should get it right eventually.

Maybe we can add logic to the tur thread to keep its hands off the
context if it's a "zombie", like below (just a thought, untested)?

Martin

diff --git a/libmultipath/checkers/tur.c b/libmultipath/checkers/tur.c
index bf8486d3..e8493ca8 100644
--- a/libmultipath/checkers/tur.c
+++ b/libmultipath/checkers/tur.c
@@ -219,11 +219,18 @@ static void cleanup_func(void *data)
 	rcu_unregister_thread();
 }
 
+#define tur_thread_quit_unless_owner(__ct)	   \
+	if (__ct->thread != pthread_self()) {	   \
+		pthread_mutex_unlock(&__ct->lock); \
+		pthread_exit(NULL);		   \
+	}
+
 static void copy_msg_to_tcc(void *ct_p, const char *msg)
 {
 	struct tur_checker_context *ct = ct_p;
 
 	pthread_mutex_lock(&ct->lock);
+	tur_thread_quit_unless_owner(ct);
 	strlcpy(ct->message, msg, sizeof(ct->message));
 	pthread_mutex_unlock(&ct->lock);
 }
@@ -243,6 +250,7 @@ static void *tur_thread(void *ctx)
 
 	/* TUR checker start up */
 	pthread_mutex_lock(&ct->lock);
+	tur_thread_quit_unless_owner(ct);
 	ct->state = PATH_PENDING;
 	ct->message[0] = '\0';
 	pthread_mutex_unlock(&ct->lock);
@@ -252,6 +260,7 @@ static void *tur_thread(void *ctx)
 
 	/* TUR checker done */
 	pthread_mutex_lock(&ct->lock);
+	tur_thread_quit_unless_owner(ct);
 	ct->state = state;
 	pthread_cond_signal(&ct->active);
 	pthread_mutex_unlock(&ct->lock);




-- 
Dr. Martin Wilck <mwilck@xxxxxxxx>, Tel. +49 (0)911 74053 2107
SUSELinux GmbH, GF: Felix Imendörffer, Jane Smithard,  Graham Norton
HRB 21284 (AG Nürnberg)


--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel