On Fri, 2018-09-21 at 18:05 -0500, Benjamin Marzinski wrote: > The code previously was timing out mode if ct->thread was 0 but > ct->running wasn't. This combination never happens. The idea was to > timeout if for some reason the path checker tried to kill the thread, > but it didn't die. The correct thing to check for this is ct- > >holders. > ct->holders will always be at least one when libcheck_check() is > called, > since libcheck_free() won't get called until the device is no longer > being checked. So, if ct->holders is 2, that means that the tur > thread > is has not shut down yet. > > Also, instead of returning PATH_TIMEOUT whenever the thread hasn't > died, > it should only time out if the thread didn't successfully get a > value, > which means the previous state was already PATH_TIMEOUT. What about PATH_UNCHECKED? > > Signed-off-by: Benjamin Marzinski <bmarzins@xxxxxxxxxx> > --- > libmultipath/checkers/tur.c | 15 +++++++++------ > 1 file changed, 9 insertions(+), 6 deletions(-) > > diff --git a/libmultipath/checkers/tur.c > b/libmultipath/checkers/tur.c > index bf8486d..275541f 100644 > --- a/libmultipath/checkers/tur.c > +++ b/libmultipath/checkers/tur.c > @@ -355,12 +355,15 @@ int libcheck_check(struct checker * c) > } > pthread_mutex_unlock(&ct->lock); > } else { > - if (uatomic_read(&ct->running) != 0) { > - /* pthread cancel failed. continue in sync mode > */ > - pthread_mutex_unlock(&ct->lock); > - condlog(3, "%s: tur thread not responding", > - tur_devt(devt, sizeof(devt), ct)); > - return PATH_TIMEOUT; > + if (uatomic_read(&ct->holders) > 1) { > + /* pthread cancel failed. If it didn't get the > path > + state already, timeout */ > + if (ct->state == PATH_PENDING) { > + pthread_mutex_unlock(&ct->lock); > + condlog(3, "%s: tur thread not > responding", > + tur_devt(devt, sizeof(devt), > ct)); > + return PATH_TIMEOUT; > + } > } I'm not quite getting it yet. We're entering this code path if ct->thread is 0. As you point out, if there are >1 holders, a previously cancelled TUR thread must be still "alive". But now we want to check *again*. The previous code refused to do that - a single hanging TUR thread could block further checks from happening, forever. The PATH_TIMEOUT return code was actually questionable - in this libcheck_check() invocation, we didn't even try to check, so nothing could actually time out (but that's obviously independent of your patch). With your patch, if ct->state != PATH_PENDING, we'd try to start another TUR thread although there's still a hanging one. That's not necessarily wrong, but I fail to see why it depends on ct->state. Anyway, if we do this, we take the risk of starting more and more TUR threads which might all end up hanging; I wonder if we should stop creating more threads if ct->holders grows further (e.g. > 5). Regards, Martin > /* Start new TUR checker */ > ct->state = PATH_UNCHECKED; -- Dr. Martin Wilck <mwilck@xxxxxxxx>, Tel. +49 (0)911 74053 2107 SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel