On Thu, 2018-10-04 at 11:31 -0500, Benjamin Marzinski wrote: > On Mon, Oct 01, 2018 at 09:51:22PM +0200, Martin Wilck wrote: > > On Fri, 2018-09-21 at 18:05 -0500, Benjamin Marzinski wrote: > > > The code previously was timing out mode if ct->thread was 0 but > > > ct->running wasn't. This combination never happens. The idea was > > > to > > > timeout if for some reason the path checker tried to kill the > > > thread, > > > but it didn't die. The correct thing to check for this is ct- > > > > holders. > > > > > > ct->holders will always be at least one when libcheck_check() is > > > called, > > > since libcheck_free() won't get called until the device is no > > > longer > > > being checked. So, if ct->holders is 2, that means that the tur > > > thread > > > is has not shut down yet. > > > > > > Also, instead of returning PATH_TIMEOUT whenever the thread > > > hasn't > > > died, > > > it should only time out if the thread didn't successfully get a > > > value, > > > which means the previous state was already PATH_TIMEOUT. > > > > What about PATH_UNCHECKED? > > > > I don't see how that could happen. In this state we've definitely set > ct->state to PATH_PENDING when we started the thread. The thread will > only change ct->state to the return of tur_check(), which doesn't > ever return PATH_UNCHECKED. Am I missing something here? OK. The only thing you missed was a comment :-) This stuff is so subtle that we should describe exactly how the locking works, which actor is supposed/entitled to set what field to what value in what situation, and so on. > > > > > > > Signed-off-by: Benjamin Marzinski <bmarzins@xxxxxxxxxx> > > > --- > > > libmultipath/checkers/tur.c | 15 +++++++++------ > > > 1 file changed, 9 insertions(+), 6 deletions(-) > > > > > > diff --git a/libmultipath/checkers/tur.c > > > b/libmultipath/checkers/tur.c > > > index bf8486d..275541f 100644 > > > --- a/libmultipath/checkers/tur.c > > > +++ b/libmultipath/checkers/tur.c > > > @@ -355,12 +355,15 @@ int libcheck_check(struct checker * c) > > > } > > > pthread_mutex_unlock(&ct->lock); > > > } else { > > > - if (uatomic_read(&ct->running) != 0) { > > > - /* pthread cancel failed. continue in sync mode > > > */ > > > - pthread_mutex_unlock(&ct->lock); > > > - condlog(3, "%s: tur thread not responding", > > > - tur_devt(devt, sizeof(devt), ct)); > > > - return PATH_TIMEOUT; > > > + if (uatomic_read(&ct->holders) > 1) { > > > + /* pthread cancel failed. If it didn't get the > > > path > > > + state already, timeout */ > > > + if (ct->state == PATH_PENDING) { > > > + pthread_mutex_unlock(&ct->lock); > > > + condlog(3, "%s: tur thread not > > > responding", > > > + tur_devt(devt, sizeof(devt), > > > ct)); > > > + return PATH_TIMEOUT; > > > + } > > > } > > > > I'm not quite getting it yet. We're entering this code path if > > ct->thread is 0. As you point out, if there are >1 holders, a > > previously cancelled TUR thread must be still "alive". But now we > > want > > to check *again*. The previous code refused to do that - a single > > hanging TUR thread could block further checks from happening, > > forever. > > The PATH_TIMEOUT return code was actually questionable - in this > > libcheck_check() invocation, we didn't even try to check, so > > nothing > > could actually time out (but that's obviously independent of your > > patch). > > > > With your patch, if ct->state != PATH_PENDING, we'd try to start > > another TUR thread although there's still a hanging one. That's not > > necessarily wrong, but I fail to see why it depends on ct->state. > > Anyway, if we do this, we take the risk of starting more and more > > TUR > > threads which might all end up hanging; I wonder if we should stop > > creating more threads if ct->holders grows further (e.g. > 5). > > It's been a while, and I'm not exactly sure what I was thinking here. > But IIRC the idea was that if the state isn't set yet, then the old > thread could mess with the results of a future thread. Also, it was > to > avoid a corner case where the tur checker was called, started the > thread, got the result before the checker exitted, cancelled the > thread > and returned the result and then was called very shortly afterwards, > before the thread could clean itself up. In that case, the right > thing > to do (I thought) would be to start a new thread, because the other > one > would be ending soon. In truth, starting a new thread at all is > probably a bad idea, since the old thread will still mess with the > checker context on exit. > > A better idea might be to simply fail back to a syncronous call to > tur_check() when you notice a cancelled thread that hasn't exitted. > That can cause all the other checkers to get delayed, but at least in > that case you are still checking paths. The other option is to do as > before and just not check this path, which won't effect the other > checkers. It seems very unlikely that the thread could remain > uncancelled forever, especially if the path was usable. > > thoughts? Generally speaking, we're deeply in the realm of highly unlikely situations I would say. But we should get it right eventually. Maybe we can add logic to the tur thread to keep its hands off the context if it's a "zombie", like below (just a thought, untested)? Martin diff --git a/libmultipath/checkers/tur.c b/libmultipath/checkers/tur.c index bf8486d3..e8493ca8 100644 --- a/libmultipath/checkers/tur.c +++ b/libmultipath/checkers/tur.c @@ -219,11 +219,18 @@ static void cleanup_func(void *data) rcu_unregister_thread(); } +#define tur_thread_quit_unless_owner(__ct) \ + if (__ct->thread != pthread_self()) { \ + pthread_mutex_unlock(&__ct->lock); \ + pthread_exit(NULL); \ + } + static void copy_msg_to_tcc(void *ct_p, const char *msg) { struct tur_checker_context *ct = ct_p; pthread_mutex_lock(&ct->lock); + tur_thread_quit_unless_owner(ct); strlcpy(ct->message, msg, sizeof(ct->message)); pthread_mutex_unlock(&ct->lock); } @@ -243,6 +250,7 @@ static void *tur_thread(void *ctx) /* TUR checker start up */ pthread_mutex_lock(&ct->lock); + tur_thread_quit_unless_owner(ct); ct->state = PATH_PENDING; ct->message[0] = '\0'; pthread_mutex_unlock(&ct->lock); @@ -252,6 +260,7 @@ static void *tur_thread(void *ctx) /* TUR checker done */ pthread_mutex_lock(&ct->lock); + tur_thread_quit_unless_owner(ct); ct->state = state; pthread_cond_signal(&ct->active); pthread_mutex_unlock(&ct->lock); -- Dr. Martin Wilck <mwilck@xxxxxxxx>, Tel. +49 (0)911 74053 2107 SUSELinux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel