Re: [PATCH] multipathd: Make sure to disable queueing if recovery has failed.

Benjamin Marzinski <bmarzins@xxxxxxxxxx> · Mon, 4 Dec 2023 18:01:23 -0500

On Mon, Dec 04, 2023 at 09:43:58PM +0100, Martin Wilck wrote:
> On Mon, 2023-12-04 at 15:00 -0500, Benjamin Marzinski wrote:
> > On Mon, Dec 04, 2023 at 03:38:03PM +0100, Martin Wilck wrote:
> > > On Mon, 2023-11-27 at 16:54 -0500, Benjamin Marzinski wrote:
> > > > If a multipath device has no_path_retry set to a number and has
> > > > lost
> > > > all
> > > > paths, gone into recovery mode, and timed out, it will disable
> > > > queue_if_no_paths. After that, if one of those failed paths is
> > > > removed,
> > > > when the device is reloaded, queue_if_no_paths will be re-
> > > > enabled. 
> > > > When
> > > > set_no_path_retry() is then called to update the queueing state,
> > > > it
> > > > will
> > > > not disable queue_if_no_paths, since the device is still in the
> > > > recovery
> > > > state, so it believes no work needs to be done. The device will
> > > > remain
> > > > in the recovery state, with retry_ticks at 0, and queueing
> > > > enabled,
> > > > even though there are no usable paths.
> > > > 
> > > > To fix this, in set_no_path_retry(), if no_path_retry is set to a
> > > > number
> > > > and the device is queueing but it is in recovery mode and out of
> > > > retries with no usable paths, manually disable queue_if_no_path.
> > > > 
> > > > Signed-off-by: Benjamin Marzinski <bmarzins@xxxxxxxxxx>
> > > 
> > > Thanks, this makes sense. I am just wondering if this logic should
> > > be moved to enter_recovery_mode() / leave_recovery_mode() instead.
> > > That pair of functions is also called from
> > > update_queue_mode{add,del}_path(), where the same reasoning would
> > > apply, afaics. Am I overlooking something?
> > 
> > In general, I think it's safe to only do this in set_no_path_retry().
> > The functions that call update_queue_mode{add,del}_path() are:
> > fail_path()
> > reinstate_path()
> > update_multipath()
> > io_err_stat_handle_pathfail()
> > 
> > fail_path() and reinstate_path() will always call the
> > update_queue_mode_* function after calling set_no_path_retry(), so
> > the
> > device will already be in the correct state. update_multipath()
> > usually
> > will as well (except when it's called by a command line handler), and
> > further, it will only call update_queue_mode_del_path() when it is
> > actually failing a path, so the multipath device shouldn't already be
> > in recovery_mode. The same goes for io_err_stat_handle_pathfail().
> > 
> > Also we need to make sure that mpp->features is uptodate before doing
> > these checks, so that we know what the current queueing state is.
> > io_err_stat_handle_pathfail() doesn't update mpp->features before
> > calling update_queue_mode_del_path(), so it could end up doing the
> > wrong
> > thing.
> > 
> > And set_no_path_retry() will be called (after updating mpp->features)
> > the next time the path checker completes for a path in the device, so
> > devices won't remain in a bad state long, no matter what.
> > 
> > What looking at this did make me realize is that
> > cli_restore_queueing()
> > and cli_restore_all_queueing() don't update the device strings before
> > calling set_no_path_retry(). Thinking about this, it seems like they
> > should call update_multipath(vecs, mpp->alias, 1) instead, to trigger
> > the set_no_path_retry() call after updating mpp->features.
> > 
> > I'll send another patch to fix cli_restore_queueing() and
> > cli_restore_all_queueing(). If you think it's necessary, I can also
> > send
> > one to make io_err_stat_handle_pathfail() call update_multipath() and
> > then move the code to manually change the queueing state to
> > enter_recovery_mode() and leave_recovery_mode(), instead of doing it
> > in
> > set_no_path_retry().
> > 
> 
> Thanks for working out all the different cases. Personally, I would
> like to have this logic in one place, if possible. I'll leave it to
> your judgement, as you've taken a closer look than myself.
> 
> I'm not sure what you mean with "mpp->features is up to date"; are you
> considering some entity applying changes to the mpp features without
> involving multipathd, e.g. by using direct DM ioctls or dmsetup?

Or right after multipathd calls dm_queue_if_no_path(). We don't
immediately update mpp->features to reflect the changed
queue_if_no_paths setting of the device. So there are windows after
someone, including multipathd, has changed the queue_if_no_paths setting
of the device, during which checking mpp->features won't give the
correct answer unless we first reread the device table. Obviously, if
something other than multipathd has changed this, then there's always
the possiblity of a race, but we can make sure that we don't race with
ourselves.

For instance, the way things are right now, running:

# multipathd disablequeueing maps; multipathd restorequeueing maps

will occasionally leave the devices not queueing until the next time a
path in them is checked, and set_no_path_retry actually restores
queueing.  But I'm actually going to deal with this differently. I'll
send a patchset shortly.

-Ben

> 
> Regards
> Martin
> 
> > -Ben
> > 
> > > 
> > > Regards
> > > Martin
> > > 
> > > > ---
> > > >  libmultipath/structs_vec.c | 12 +++++++++++-
> > > >  1 file changed, 11 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/libmultipath/structs_vec.c
> > > > b/libmultipath/structs_vec.c
> > > > index 0e8a46e7..3cb23c73 100644
> > > > --- a/libmultipath/structs_vec.c
> > > > +++ b/libmultipath/structs_vec.c
> > > > @@ -627,8 +627,18 @@ void set_no_path_retry(struct multipath
> > > > *mpp)
> > > >  			    !mpp->in_recovery)
> > > >  				dm_queue_if_no_path(mpp->alias,
> > > > 1);
> > > >  			leave_recovery_mode(mpp);
> > > > -		} else if (pathcount(mpp, PATH_PENDING) == 0)
> > > > +		} else if (pathcount(mpp, PATH_PENDING) == 0) {
> > > > +			/*
> > > > +			 * If in_recovery is set,
> > > > enter_recovery_mode does
> > > > +			 * nothing. If the device is already in
> > > > recovery
> > > > +			 * mode and has already timed out,
> > > > manually
> > > > call
> > > > +			 * dm_queue_if_no_path to stop it from
> > > > queueing.
> > > > +			 */
> > > > +			if ((!mpp->features || is_queueing) &&
> > > > +			    mpp->in_recovery && mpp->retry_tick
> > > > ==
> > > > 0)
> > > > +				dm_queue_if_no_path(mpp->alias,
> > > > 0);
> > > >  			enter_recovery_mode(mpp);
> > > > +		}
> > > >  		break;
> > > >  	}
> > > >  }
> >