On Thu, Dec 14, 2023 at 04:11:32PM +0100, Martin Wilck wrote: > On Tue, 2023-12-05 at 18:57 -0500, Benjamin Marzinski wrote: > > If a multipath device has no_path_retry set to a number and has lost > > all > > paths, gone into recovery mode, and timed out, it will disable > > queue_if_no_paths. After that, if one of those failed paths is > > removed, > > when the device is reloaded, queue_if_no_paths will be re-enabled. > > When > > set_no_path_retry() is then called to update the queueing state, it > > will > > not disable queue_if_no_paths, since the device is still in the > > recovery > > state, so it believes no work needs to be done. The device will > > remain > > in the recovery state, with retry_ticks at 0, and queueing enabled, > > even though there are no usable paths. > > > > To fix this, in set_no_path_retry(), if no_path_retry is set to a > > number > > and the device is queueing but it is in recovery mode and out of > > retries with no usable paths, manually disable queue_if_no_path. > > > > Signed-off-by: Benjamin Marzinski <bmarzins@xxxxxxxxxx> > > --- > > libmultipath/structs_vec.c | 23 ++++++++++++++++++++++- > > 1 file changed, 22 insertions(+), 1 deletion(-) > > > > diff --git a/libmultipath/structs_vec.c b/libmultipath/structs_vec.c > > index 0e8a46e7..9345b622 100644 > > --- a/libmultipath/structs_vec.c > > +++ b/libmultipath/structs_vec.c > > @@ -627,8 +627,18 @@ void set_no_path_retry(struct multipath *mpp) > > !mpp->in_recovery) > > dm_queue_if_no_path(mpp->alias, 1); > > leave_recovery_mode(mpp); > > - } else if (pathcount(mpp, PATH_PENDING) == 0) > > + } else if (pathcount(mpp, PATH_PENDING) == 0) { > > + /* > > + * If in_recovery is set, > > enter_recovery_mode does > > + * nothing. If the device is already in > > recovery > > + * mode and has already timed out, manually > > call > > + * dm_queue_if_no_path to stop it from > > queueing. > > + */ > > + if ((!mpp->features || is_queueing) && > > + mpp->in_recovery && mpp->retry_tick == > > 0) > > + dm_queue_if_no_path(mpp->alias, 0); > > I don't understand why you put this statement in the pathcount(mpp, > PATH_PENDING) == 0 clause. In 7278108 ("libmultipath: don't enter > recovery mode with pending paths"), you argued correctly that recovery > mode should not be *entered* with pending paths, but here we are in > recovery mode already. I see your point. But I think the only way this could happen is if you added a new path to a multipath device that was already in recovery mode. If the path checker didn't finish quickly enough during path addition, the path will be in PATH_PENDING after the reload. The fact that you are adding the path makes it seem more likely that the path is working. In this case, it seems a little wrong to have a multipath device that starts queueing on the reload, and then disables queueing, only to reenable it again a second later when the pending path is rechecked and is found to be up. I think it's reasonable to wait for the PENDING state to get resolved before you take any actions here. But I don't really have strong feelings about this, so if you do, I'm fine with changing it. > Also, what do you think about the following? This way, we'd avoid > enabling queueing temporarily when we reload the map. > > diff --git a/libmultipath/dmparser.c b/libmultipath/dmparser.c > index 3b37a92..3d85e6e 100644 > --- a/libmultipath/dmparser.c > +++ b/libmultipath/dmparser.c > @@ -60,9 +60,19 @@ int assemble_map(struct multipath *mp, char **params) > nr_priority_groups = VECTOR_SIZE(mp->pg); > initial_pg_nr = (nr_priority_groups ? mp->bestpg : 0); > > - if (mp->no_path_retry != NO_PATH_RETRY_UNDEF && > - mp->no_path_retry != NO_PATH_RETRY_FAIL) { > + switch (mp->no_path_retry) { > + case NO_PATH_RETRY_UNDEF: > + case NO_PATH_RETRY_FAIL: > + break; > + default: > + /* don't enable queueing if no_path_retry has timed out */ > + if (mp->in_recovery && mp->retry_tick == 0 && > + count_active_paths(mp) == 0) > + break; > + /* fallthrough */ > + case NO_PATH_RETRY_QUEUE: > add_feature(&mp->features, no_path_retry); > + break; > } > if (mp->retain_hwhandler == RETAIN_HWHANDLER_ON && > get_linux_version_code() < KERNEL_VERSION(4, 3, 0)) This looks good. -Ben > > > Regards > Martin