Re: [PATCH 3/5] multipathd: refresh all priorities if one has changed

Benjamin Marzinski <bmarzins@xxxxxxxxxx> · Tue, 6 Jun 2023 10:33:08 -0500

On Tue, Jun 06, 2023 at 02:08:04PM +0000, Martin Wilck wrote:
> On Mon, 2023-06-05 at 13:22 -0500, Benjamin Marzinski wrote:
> > On Wed, May 31, 2023 at 04:27:25PM +0000, Martin Wilck wrote:
> > > On Wed, 2023-05-24 at 18:21 -0500, Benjamin Marzinski wrote:
> > > > For multipath devices with path group policies other than
> > > > group_by_prio,
> > > > multipathd wasn't updating all the paths' priorities when calling
> > > > need_switch_pathgroup(), even in cases where it likely was
> > > > necessary.
> > > > When a path just becomes usable again, all paths' priorities get
> > > > updated
> > > > by update_prio().  But if the priority changes on a path that is
> > > > already
> > > > up, the other paths' priorities only get updated if the multipath
> > > > device
> > > > uses the group_by_prio path_grouping_policy, otherwise
> > > > need_switch_pathgroup() is called with refresh set to 0. But if
> > > > the
> > > > priority of the checked path has changed, then likely so have the
> > > > priorities of other paths. Since the pathgroup's priority is the
> > > > average
> > > > of its paths' priorities, changing the priority of just one path
> > > > may
> > > > not
> > > > change the average enough to reorder the pathgroups.
> > > > 
> > > > Instead, set refresh in need_switch_pathgroup() if the priorty
> > > > has
> > > > changed to something other than PRIO_UNDEF (which usually means
> > > > an
> > > > error
> > > > has occured) and the priorities of the other paths haven't
> > > > already
> > > > been
> > > > updated by update_prio().
> > > > 
> > > > Signed-off-by: Benjamin Marzinski <bmarzins@xxxxxxxxxx>
> > > > ---
> > > >  multipathd/main.c | 35 +++++++++++++++++++++--------------
> > > >  1 file changed, 21 insertions(+), 14 deletions(-)
> > > > 
> > > > diff --git a/multipathd/main.c b/multipathd/main.c
> > > > index bdeffe76..e7c272ad 100644
> > > > --- a/multipathd/main.c
> > > > +++ b/multipathd/main.c
> > > > @@ -2575,20 +2575,27 @@ check_path (struct vectors * vecs, struct
> > > > path * pp, unsigned int ticks)
> > > >  
> > > >         if (marginal_changed)
> > > >                 reload_and_sync_map(pp->mpp, vecs, 1);
> > > > -       else if (update_prio(pp, new_path_up) &&
> > > > -           (pp->mpp->pgpolicyfn == (pgpolicyfn *)group_by_prio)
> > > > &&
> > > > -            pp->mpp->pgfailback == -FAILBACK_IMMEDIATE) {
> > > > -               condlog(2, "%s: path priorities changed.
> > > > reloading",
> > > > -                       pp->mpp->alias);
> > > > -               reload_and_sync_map(pp->mpp, vecs, !new_path_up);
> > > > -       } else if (need_switch_pathgroup(pp->mpp, 0)) {
> > > > -               if (pp->mpp->pgfailback > 0 &&
> > > > -                   (new_path_up || pp->mpp->failback_tick <= 0))
> > > > -                       pp->mpp->failback_tick =
> > > > -                               pp->mpp->pgfailback + 1;
> > > > -               else if (pp->mpp->pgfailback == -
> > > > FAILBACK_IMMEDIATE
> > > > > > 
> > > > -                        (chkr_new_path_up &&
> > > > followover_should_failback(pp)))
> > > > -                       switch_pathgroup(pp->mpp);
> > > > +       else {
> > > > +               int prio_changed = update_prio(pp, new_path_up);
> > > > +               bool need_refresh = (!new_path_up && prio_changed
> > > > &&
> > > > +                                    pp->priority != PRIO_UNDEF);
> > > > +
> > > 
> > > I have always found it confusing that we recalculate the priorities
> > > in
> > > two functions (update_prio() and need_switch_pathgroup()), passing
> > > boolean flags back and forth. IMO we should move this logic to
> > > update_prio(), so that we don't need to refresh any priorities in
> > > need_switch_pathgroup() any more. after determining the prio of the
> > > "primary" path device, update_prio() has all the information
> > > it needs to figure out whether priorities of other paths must be
> > > refreshed.
> > > 
> > > That would even make the code easier to understand, IMO.
> > > 
> > > Regards
> > > Martin
> > 
> > So the difference in this code between when we choose to update all
> > the
> > paths' prios for the group_by_prio case, and when we choose to update
> > all the paths' prios for the other pgpolicies comes down to how we
> > treat
> > PRIO_UNDEF. I didn't change the group_by_prio behavior.
> 
> My comment may have caused confusion, sorry. I just wanted to point out
> that we could make the logic clearer by moving it into update_prio(),
> on top of what you did, as in "while we're at it". 
> 
> >  So right now,
> > for group_by_prio devices, we will update all the paths' priorities
> > if
> > the checked path switches priorities to PRIO_UNDEF. My question is,
> > "Is
> > this the right thing to do?"
> > 
> > In the best case, if the prioritizer fails on one path, it will fail
> > on
> > all the other paths in the pathgroup as well, so that they stay
> > together. In the worst case it will fail on paths in other
> > pathgroups,
> > so that incorrect paths get grouped together. Granted, I'm not sure
> > how
> > much of a difference it makes in the worst case, since the other
> > priorities would get checked eventually, and would get placed in the
> > wrong group then.
> 
> No matter what we do, it's always just the state at some point in time.
> If we update all priorities, we are as close to the "real" state of the
> hardware as possible, at this given instant. We don't know what's going
> to happen next. Paths could quickly recover and provide useful prio
> values, but they might as well not. Or their prio could change, and the
> value we just obtained would be obsolete. It makes no sense to reason
> about the "future".
> 
> > Perhaps it would be better to treat PRIO_UNDEF like PATH_PENDING,
> > where
> > we will continue to use the old priority if we get a PRIO_UNDEF
> > result.
> 
> Let's have a look where PRIO_UNDEF occurs. Unless I'm overlooking
> something, get_prio() returns PRIO_UNDEF if no prio algorithm is
> selected, or if the prio algo returns an error *and* the path state as
> returned by path_offline() is neither DOWN nor PENDING. From
> path_offline(), this means the state must be either PATH_REMOVED (no
> point in trying a assign a prio, UNDEF is ok) or PATH_UP, i.e.
> "running". The last case is strange. It can mean a very short-lived
> failure, in which case we could consider retrying prio_getprio() from
> get_prio(), or a general problem with the prio algorithm for the path,
> in which case UNDEF would again be ok (but then, how did the same
> prioritizer assign a valid priority previously?).
> 
> I think that none of these cases really justifies treating UNDEF like
> PENDING, except _maybe_ the "running" case. If that's agreed, we should
> probably just change the way this case is handled in get_prio().

I actually think that the only case where it might NOT make sense to
treat UNDEF like PENDING is the "running" case. If the device no longer
exists, I would argue that keeping the old priority is the best option.
Why do the extra work associated with a path changing priority, when in
reality the path is gone? In all the other cases, I don't see how a path
ever gets out of PRIO_UNDEF to start with. So keeping the old priority
is the same as just returning PRIO_UNDEF in those cases. The only case
were the PRIO_UNDEF is really interesting is the "running" case. If we
think that this is a short lived failure, then doing a bunch of work and
moving the paths around might just be wasted effort. When the failure
resolves, the path might still have the priority it previously had. If
the path wasn't previously in PRIO_UNDEF, then we can be pretty sure
that the existing priority is a temporary error, but we don't know how
long it error will last, and whether the priority after the error will
be the same or different than it was before.

I'm also not sure that retrying the priority immediately is the right
answer, since that might just waste more time and give us the same
result. Waiting till the next path check makes more sense to me. If this
is a very short lived error, and the path did change priority then we
may well not need to wait till the next path_check() for it to get
updated. Assuming this path changed its priority, then likely other
paths will be changing priorities too. When we check them, if the
failure has resolved, we will end up rechecking this path while
refreshing all the priorities.

I don't want to make it seem like I feel like this (option 4 below)
is the obviously right answer. But I do feel like it is just as valid an
option as 1) or 2).

> > The choices are:
> > 1. make both the group_by_prio and the non-group_by_prio cases
> > recheck
> >    all paths on PRIO_UNDEF (this keeps the group_by_prio behavior the
> >    same).
> > 2. make both cases NOT recheck all paths on PRIO_UNDEF.
> > 3. keep the destinction between the two (make update_prio() check the
> >    pgplicy, and act accordingly)
> > 4. Make paths keep their previous priority when they would have
> > returned
> >    PRIO_UNDEF, so we never switch to PRIO_UNDEF.
> > 
> > All the choices except 3 seem reasonable. 1 keeps things how they are
> > for group_by_prio. 2 leans towards moving PRIO_UNDEF paths out of
> > their
> > current pathgroup.  4 leans towards keeping PRIO_UNDEF paths in their
> > current pathgroup.
> 
> I agree that 3) makes no sense. I argued above that I don't think 4)
> does, either. Wrt 1) vs. 2), we should realize that the checker loop
> will eventually run over all paths anyway.  With 1) we are as close as
> possible to the kernel state at any instant, but we may recalculate
> priorities (and possibly regroup) repeatedly during a single checker
> loop iteration, which is suboptimal [1]. With 2), we might never have
> the map in a correct state, not even after the checker loop has
> finished.
> 
> I think we should go with 1), and consider a later change where we just
> set a marker in the checker loop, and do prio updates / path regrouping
> once per map after the checker loop has finished. This requires more
> changes for the checker loop though, and is out of scope for your
> current patch set.
> 
> I wouldn't worry too much about group_by_prio. Regrouping is by design
> with this grouping policy, and it's expected that this results in
> incorrect grouping, at least temporarily. Where this is undesirable, 
> your new group_by_tpg will come to the rescue.

I ended up going with 1) when in my next version, on the groups that
this keeps our behavior the same, and nobody has been complaining about
it.

> > The other question is, what do we do for the delayed case. Right now,
> > once we finish waiting for our delay in deferred_failback_tick(), we
> > automatically refresh the priorities of all the devices in our
> > need_switch_pathgroup() call.  We could add an update_prio() call
> > before
> > it to keep this behavior, but if we are already refreshing all the
> > paths' priorities when we need to, I'm not sure that it's necessary
> > to
> > do it again here.
> 
> Well if we calculated priorities in update_prio() only as I suggested
> in my previous post, we'd call update_prio() in this code path and
> change the way need_switch_pathgroup() works. But I admit I haven't
> thought this through, and it can be done in a separate set, anyway.

I ended up not refreshing the priorities in deferred_failback_tick(). We
will have already done it in check_path() if necessary, either when we
started the deferred timer, or if we notice it's necessary while waiting
for the timeout.

-Ben

> Regards
> Martin
> 
> [1] it seems dumb to reason about "performance" here, but we both know
> that execution time in the checker loop can become critical if there
> are lots of path devices.
--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://listman.redhat.com/mailman/listinfo/dm-devel