Re: [PATCH 3/5] multipathd: refresh all priorities if one has changed

Martin Wilck <martin.wilck@xxxxxxxx> · Tue, 6 Jun 2023 14:08:04 +0000

On Mon, 2023-06-05 at 13:22 -0500, Benjamin Marzinski wrote:
> On Wed, May 31, 2023 at 04:27:25PM +0000, Martin Wilck wrote:
> > On Wed, 2023-05-24 at 18:21 -0500, Benjamin Marzinski wrote:
> > > For multipath devices with path group policies other than
> > > group_by_prio,
> > > multipathd wasn't updating all the paths' priorities when calling
> > > need_switch_pathgroup(), even in cases where it likely was
> > > necessary.
> > > When a path just becomes usable again, all paths' priorities get
> > > updated
> > > by update_prio().  But if the priority changes on a path that is
> > > already
> > > up, the other paths' priorities only get updated if the multipath
> > > device
> > > uses the group_by_prio path_grouping_policy, otherwise
> > > need_switch_pathgroup() is called with refresh set to 0. But if
> > > the
> > > priority of the checked path has changed, then likely so have the
> > > priorities of other paths. Since the pathgroup's priority is the
> > > average
> > > of its paths' priorities, changing the priority of just one path
> > > may
> > > not
> > > change the average enough to reorder the pathgroups.
> > > 
> > > Instead, set refresh in need_switch_pathgroup() if the priorty
> > > has
> > > changed to something other than PRIO_UNDEF (which usually means
> > > an
> > > error
> > > has occured) and the priorities of the other paths haven't
> > > already
> > > been
> > > updated by update_prio().
> > > 
> > > Signed-off-by: Benjamin Marzinski <bmarzins@xxxxxxxxxx>
> > > ---
> > >  multipathd/main.c | 35 +++++++++++++++++++++--------------
> > >  1 file changed, 21 insertions(+), 14 deletions(-)
> > > 
> > > diff --git a/multipathd/main.c b/multipathd/main.c
> > > index bdeffe76..e7c272ad 100644
> > > --- a/multipathd/main.c
> > > +++ b/multipathd/main.c
> > > @@ -2575,20 +2575,27 @@ check_path (struct vectors * vecs, struct
> > > path * pp, unsigned int ticks)
> > >  
> > >         if (marginal_changed)
> > >                 reload_and_sync_map(pp->mpp, vecs, 1);
> > > -       else if (update_prio(pp, new_path_up) &&
> > > -           (pp->mpp->pgpolicyfn == (pgpolicyfn *)group_by_prio)
> > > &&
> > > -            pp->mpp->pgfailback == -FAILBACK_IMMEDIATE) {
> > > -               condlog(2, "%s: path priorities changed.
> > > reloading",
> > > -                       pp->mpp->alias);
> > > -               reload_and_sync_map(pp->mpp, vecs, !new_path_up);
> > > -       } else if (need_switch_pathgroup(pp->mpp, 0)) {
> > > -               if (pp->mpp->pgfailback > 0 &&
> > > -                   (new_path_up || pp->mpp->failback_tick <= 0))
> > > -                       pp->mpp->failback_tick =
> > > -                               pp->mpp->pgfailback + 1;
> > > -               else if (pp->mpp->pgfailback == -
> > > FAILBACK_IMMEDIATE
> > > > > 
> > > -                        (chkr_new_path_up &&
> > > followover_should_failback(pp)))
> > > -                       switch_pathgroup(pp->mpp);
> > > +       else {
> > > +               int prio_changed = update_prio(pp, new_path_up);
> > > +               bool need_refresh = (!new_path_up && prio_changed
> > > &&
> > > +                                    pp->priority != PRIO_UNDEF);
> > > +
> > 
> > I have always found it confusing that we recalculate the priorities
> > in
> > two functions (update_prio() and need_switch_pathgroup()), passing
> > boolean flags back and forth. IMO we should move this logic to
> > update_prio(), so that we don't need to refresh any priorities in
> > need_switch_pathgroup() any more. after determining the prio of the
> > "primary" path device, update_prio() has all the information
> > it needs to figure out whether priorities of other paths must be
> > refreshed.
> > 
> > That would even make the code easier to understand, IMO.
> > 
> > Regards
> > Martin
> 
> So the difference in this code between when we choose to update all
> the
> paths' prios for the group_by_prio case, and when we choose to update
> all the paths' prios for the other pgpolicies comes down to how we
> treat
> PRIO_UNDEF. I didn't change the group_by_prio behavior.

My comment may have caused confusion, sorry. I just wanted to point out
that we could make the logic clearer by moving it into update_prio(),
on top of what you did, as in "while we're at it". 

>  So right now,
> for group_by_prio devices, we will update all the paths' priorities
> if
> the checked path switches priorities to PRIO_UNDEF. My question is,
> "Is
> this the right thing to do?"
> 
> In the best case, if the prioritizer fails on one path, it will fail
> on
> all the other paths in the pathgroup as well, so that they stay
> together. In the worst case it will fail on paths in other
> pathgroups,
> so that incorrect paths get grouped together. Granted, I'm not sure
> how
> much of a difference it makes in the worst case, since the other
> priorities would get checked eventually, and would get placed in the
> wrong group then.

No matter what we do, it's always just the state at some point in time.
If we update all priorities, we are as close to the "real" state of the
hardware as possible, at this given instant. We don't know what's going
to happen next. Paths could quickly recover and provide useful prio
values, but they might as well not. Or their prio could change, and the
value we just obtained would be obsolete. It makes no sense to reason
about the "future".

> Perhaps it would be better to treat PRIO_UNDEF like PATH_PENDING,
> where
> we will continue to use the old priority if we get a PRIO_UNDEF
> result.

Let's have a look where PRIO_UNDEF occurs. Unless I'm overlooking
something, get_prio() returns PRIO_UNDEF if no prio algorithm is
selected, or if the prio algo returns an error *and* the path state as
returned by path_offline() is neither DOWN nor PENDING. From
path_offline(), this means the state must be either PATH_REMOVED (no
point in trying a assign a prio, UNDEF is ok) or PATH_UP, i.e.
"running". The last case is strange. It can mean a very short-lived
failure, in which case we could consider retrying prio_getprio() from
get_prio(), or a general problem with the prio algorithm for the path,
in which case UNDEF would again be ok (but then, how did the same
prioritizer assign a valid priority previously?).

I think that none of these cases really justifies treating UNDEF like
PENDING, except _maybe_ the "running" case. If that's agreed, we should
probably just change the way this case is handled in get_prio().

> The choices are:
> 1. make both the group_by_prio and the non-group_by_prio cases
> recheck
>    all paths on PRIO_UNDEF (this keeps the group_by_prio behavior the
>    same).
> 2. make both cases NOT recheck all paths on PRIO_UNDEF.
> 3. keep the destinction between the two (make update_prio() check the
>    pgplicy, and act accordingly)
> 4. Make paths keep their previous priority when they would have
> returned
>    PRIO_UNDEF, so we never switch to PRIO_UNDEF.
> 
> All the choices except 3 seem reasonable. 1 keeps things how they are
> for group_by_prio. 2 leans towards moving PRIO_UNDEF paths out of
> their
> current pathgroup.  4 leans towards keeping PRIO_UNDEF paths in their
> current pathgroup.

I agree that 3) makes no sense. I argued above that I don't think 4)
does, either. Wrt 1) vs. 2), we should realize that the checker loop
will eventually run over all paths anyway.  With 1) we are as close as
possible to the kernel state at any instant, but we may recalculate
priorities (and possibly regroup) repeatedly during a single checker
loop iteration, which is suboptimal [1]. With 2), we might never have
the map in a correct state, not even after the checker loop has
finished.

I think we should go with 1), and consider a later change where we just
set a marker in the checker loop, and do prio updates / path regrouping
once per map after the checker loop has finished. This requires more
changes for the checker loop though, and is out of scope for your
current patch set.

I wouldn't worry too much about group_by_prio. Regrouping is by design
with this grouping policy, and it's expected that this results in
incorrect grouping, at least temporarily. Where this is undesirable, 
your new group_by_tpg will come to the rescue.

> The other question is, what do we do for the delayed case. Right now,
> once we finish waiting for our delay in deferred_failback_tick(), we
> automatically refresh the priorities of all the devices in our
> need_switch_pathgroup() call.  We could add an update_prio() call
> before
> it to keep this behavior, but if we are already refreshing all the
> paths' priorities when we need to, I'm not sure that it's necessary
> to
> do it again here.

Well if we calculated priorities in update_prio() only as I suggested
in my previous post, we'd call update_prio() in this code path and
change the way need_switch_pathgroup() works. But I admit I haven't
thought this through, and it can be done in a separate set, anyway.

Regards
Martin

[1] it seems dumb to reason about "performance" here, but we both know
that execution time in the checker loop can become critical if there
are lots of path devices.

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://listman.redhat.com/mailman/listinfo/dm-devel