On Wed, Mar 08, 2023 at 06:17:05PM +0000, Martin Wilck wrote: > On Tue, 2023-03-07 at 16:49 -0600, Benjamin Marzinski wrote: > > If a the tur checker creates a new context because an old thread is > > still running, but the old thread finishes before the checker drops > > the old context, the checker should reset nr_timeouts to 0, since > > the old thread did complete in time. > > This looks strange to me. First of all, the old thread _did_ timeout, > otherwise we wouldn't be in this code path at all. And even if you say > this timeout shouldn't count, as the thread isn't in stalled state any > more, shouldn't you just decrease nr_timeouts by 1? The fact that > _this_ thread terminated doesn't tell us anything about other hanging > threads. We can't track the old threads once we drop the old context. So instead, we just assume that if the last thread we created did complete we're back to a normal state, and reset nr_timeouts to 0. This change is just extending that same logic to the case where the old thread didn't complete until the last possible moment where we could find that out. The MAX_NR_TIMEOUTS code was added because a user ran into a situation where a scsi kernel issue made TUR commands block uninterruptibly forever, and multipathd kept making more threads, tens of thousnds of them. The goal was to stop creating threads if two in a row haven't completed, and wait for the last thread to complete. It's not the most robust solution possible, but I think it's good enough. We could revist the whole solution, but the way things are, nr_timeouts = 1 was supposed to mean that, as far as we know, the last thread was cancelled but didn't complete, and we are going to try one more thread before stopping and waiting. I'm just fixing to code to take care of the corner case where we find out that the last thread did complete when we drop the old context. -Ben > Martin > > > > > > > > Signed-off-by: Benjamin Marzinski <bmarzins@xxxxxxxxxx> > > --- > > libmultipath/checkers/tur.c | 4 +++- > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > diff --git a/libmultipath/checkers/tur.c > > b/libmultipath/checkers/tur.c > > index a19becf5..fe6a2f14 100644 > > --- a/libmultipath/checkers/tur.c > > +++ b/libmultipath/checkers/tur.c > > @@ -406,9 +406,11 @@ int libcheck_check(struct checker * c) > > } > > ((struct tur_checker_context *)c->context)- > > >nr_timeouts = ct->nr_timeouts; > > > > - if (!uatomic_sub_return(&ct->holders, 1)) > > + if (!uatomic_sub_return(&ct->holders, 1)) { > > /* It did terminate, eventually */ > > cleanup_context(ct); > > + ((struct tur_checker_context *)c- > > >context)->nr_timeouts = 0; > > + } > > > > ct = c->context; > > } else -- dm-devel mailing list dm-devel@xxxxxxxxxx https://listman.redhat.com/mailman/listinfo/dm-devel