Re: [PATCH] multipathd: Make sure to disable queueing if recovery has failed.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Dec 04, 2023 at 03:38:03PM +0100, Martin Wilck wrote:
> On Mon, 2023-11-27 at 16:54 -0500, Benjamin Marzinski wrote:
> > If a multipath device has no_path_retry set to a number and has lost
> > all
> > paths, gone into recovery mode, and timed out, it will disable
> > queue_if_no_paths. After that, if one of those failed paths is
> > removed,
> > when the device is reloaded, queue_if_no_paths will be re-enabled. 
> > When
> > set_no_path_retry() is then called to update the queueing state, it
> > will
> > not disable queue_if_no_paths, since the device is still in the
> > recovery
> > state, so it believes no work needs to be done. The device will
> > remain
> > in the recovery state, with retry_ticks at 0, and queueing enabled,
> > even though there are no usable paths.
> > 
> > To fix this, in set_no_path_retry(), if no_path_retry is set to a
> > number
> > and the device is queueing but it is in recovery mode and out of
> > retries with no usable paths, manually disable queue_if_no_path.
> > 
> > Signed-off-by: Benjamin Marzinski <bmarzins@xxxxxxxxxx>
> 
> Thanks, this makes sense. I am just wondering if this logic should
> be moved to enter_recovery_mode() / leave_recovery_mode() instead.
> That pair of functions is also called from
> update_queue_mode{add,del}_path(), where the same reasoning would
> apply, afaics. Am I overlooking something?

In general, I think it's safe to only do this in set_no_path_retry().
The functions that call update_queue_mode{add,del}_path() are:
fail_path()
reinstate_path()
update_multipath()
io_err_stat_handle_pathfail()

fail_path() and reinstate_path() will always call the
update_queue_mode_* function after calling set_no_path_retry(), so the
device will already be in the correct state. update_multipath() usually
will as well (except when it's called by a command line handler), and
further, it will only call update_queue_mode_del_path() when it is
actually failing a path, so the multipath device shouldn't already be
in recovery_mode. The same goes for io_err_stat_handle_pathfail().

Also we need to make sure that mpp->features is uptodate before doing
these checks, so that we know what the current queueing state is.
io_err_stat_handle_pathfail() doesn't update mpp->features before
calling update_queue_mode_del_path(), so it could end up doing the wrong
thing.

And set_no_path_retry() will be called (after updating mpp->features)
the next time the path checker completes for a path in the device, so
devices won't remain in a bad state long, no matter what.

What looking at this did make me realize is that cli_restore_queueing()
and cli_restore_all_queueing() don't update the device strings before
calling set_no_path_retry(). Thinking about this, it seems like they
should call update_multipath(vecs, mpp->alias, 1) instead, to trigger
the set_no_path_retry() call after updating mpp->features.

I'll send another patch to fix cli_restore_queueing() and
cli_restore_all_queueing(). If you think it's necessary, I can also send
one to make io_err_stat_handle_pathfail() call update_multipath() and
then move the code to manually change the queueing state to
enter_recovery_mode() and leave_recovery_mode(), instead of doing it in
set_no_path_retry().

-Ben

> 
> Regards
> Martin
> 
> > ---
> >  libmultipath/structs_vec.c | 12 +++++++++++-
> >  1 file changed, 11 insertions(+), 1 deletion(-)
> > 
> > diff --git a/libmultipath/structs_vec.c b/libmultipath/structs_vec.c
> > index 0e8a46e7..3cb23c73 100644
> > --- a/libmultipath/structs_vec.c
> > +++ b/libmultipath/structs_vec.c
> > @@ -627,8 +627,18 @@ void set_no_path_retry(struct multipath *mpp)
> >  			    !mpp->in_recovery)
> >  				dm_queue_if_no_path(mpp->alias, 1);
> >  			leave_recovery_mode(mpp);
> > -		} else if (pathcount(mpp, PATH_PENDING) == 0)
> > +		} else if (pathcount(mpp, PATH_PENDING) == 0) {
> > +			/*
> > +			 * If in_recovery is set,
> > enter_recovery_mode does
> > +			 * nothing. If the device is already in
> > recovery
> > +			 * mode and has already timed out, manually
> > call
> > +			 * dm_queue_if_no_path to stop it from
> > queueing.
> > +			 */
> > +			if ((!mpp->features || is_queueing) &&
> > +			    mpp->in_recovery && mpp->retry_tick ==
> > 0)
> > +				dm_queue_if_no_path(mpp->alias, 0);
> >  			enter_recovery_mode(mpp);
> > +		}
> >  		break;
> >  	}
> >  }





[Index of Archives]     [DM Crypt]     [Fedora Desktop]     [ATA RAID]     [Fedora Marketing]     [Fedora Packaging]     [Fedora SELinux]     [Yosemite Discussion]     [KDE Users]     [Fedora Docs]

  Powered by Linux