Re: [PATCH 3/3] libmultipath: reset nr_timeouts if we freed the context

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Mar 08, 2023 at 06:17:05PM +0000, Martin Wilck wrote:
> On Tue, 2023-03-07 at 16:49 -0600, Benjamin Marzinski wrote:
> > If a the tur checker creates a new context because an old thread is
> > still running, but the old thread finishes before the checker drops
> > the old context, the checker should reset nr_timeouts to 0, since
> > the old thread did complete in time.
> 
> This looks strange to me. First of all, the old thread _did_ timeout,
> otherwise we wouldn't be in this code path at all. And even if you say
> this timeout shouldn't count, as the thread isn't in stalled state any
> more, shouldn't you just decrease nr_timeouts by 1? The fact that
> _this_ thread terminated doesn't tell us anything about other hanging
> threads.

We can't track the old threads once we drop the old context.  So
instead, we just assume that if the last thread we created did complete
we're back to a normal state, and reset nr_timeouts to 0. This change is
just extending that same logic to the case where the old thread didn't
complete until the last possible moment where we could find that out.

The MAX_NR_TIMEOUTS code was added because a user ran into a situation
where a scsi kernel issue made TUR commands block uninterruptibly
forever, and multipathd kept making more threads, tens of thousnds of
them. The goal was to stop creating threads if two in a row haven't
completed, and wait for the last thread to complete. It's not the most
robust solution possible, but I think it's good enough. We could revist
the whole solution, but the way things are, nr_timeouts = 1 was supposed
to mean that, as far as we know, the last thread was cancelled but
didn't complete, and we are going to try one more thread before stopping
and waiting. I'm just fixing to code to take care of the corner case
where we find out that the last thread did complete when we drop the old
context. 

-Ben
 
> Martin
> 
> 
> 
> 
> > 
> > Signed-off-by: Benjamin Marzinski <bmarzins@xxxxxxxxxx>
> > ---
> >  libmultipath/checkers/tur.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/libmultipath/checkers/tur.c
> > b/libmultipath/checkers/tur.c
> > index a19becf5..fe6a2f14 100644
> > --- a/libmultipath/checkers/tur.c
> > +++ b/libmultipath/checkers/tur.c
> > @@ -406,9 +406,11 @@ int libcheck_check(struct checker * c)
> >                         }
> >                         ((struct tur_checker_context *)c->context)-
> > >nr_timeouts = ct->nr_timeouts;
> >  
> > -                       if (!uatomic_sub_return(&ct->holders, 1))
> > +                       if (!uatomic_sub_return(&ct->holders, 1)) {
> >                                 /* It did terminate, eventually */
> >                                 cleanup_context(ct);
> > +                               ((struct tur_checker_context *)c-
> > >context)->nr_timeouts = 0;
> > +                       }
> >  
> >                         ct = c->context;
> >                 } else
--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://listman.redhat.com/mailman/listinfo/dm-devel





[Index of Archives]     [DM Crypt]     [Fedora Desktop]     [ATA RAID]     [Fedora Marketing]     [Fedora Packaging]     [Fedora SELinux]     [Yosemite Discussion]     [KDE Users]     [Fedora Docs]

  Powered by Linux