How long does recovery take? I am unclear on when the explictly set queue_if_no_path is being overridden, and why disabling it is useful. Typically we are setting queue_if_no_path to forever with the intent that it will survive longer storage and/or disk issues without returning an error to the application and/or corrupting fses if the storage issue can be fixed. We generally expect it to recover when the storage comes back, but that the storage could be experiencing significant issues for a significant period of time >10 minutes. Since the storage has to be fixed to get things working again, there is a lot of negative value that requires manual recovery steps when an error gets returned (fsck, loss of data). We also manually disable queueing if we need to remove the mpath devices (paths are already gone as they were non-responsive > 24 hours and removed via tmo_timeout), and/or forcible reboot the nodes when we determine storage is not coming back. On Mon, Nov 27, 2023 at 3:54 PM Benjamin Marzinski <bmarzins@xxxxxxxxxx> wrote: > > If a multipath device has no_path_retry set to a number and has lost all > paths, gone into recovery mode, and timed out, it will disable > queue_if_no_paths. After that, if one of those failed paths is removed, > when the device is reloaded, queue_if_no_paths will be re-enabled. When > set_no_path_retry() is then called to update the queueing state, it will > not disable queue_if_no_paths, since the device is still in the recovery > state, so it believes no work needs to be done. The device will remain > in the recovery state, with retry_ticks at 0, and queueing enabled, > even though there are no usable paths. > > To fix this, in set_no_path_retry(), if no_path_retry is set to a number > and the device is queueing but it is in recovery mode and out of > retries with no usable paths, manually disable queue_if_no_path. > > Signed-off-by: Benjamin Marzinski <bmarzins@xxxxxxxxxx> > --- > libmultipath/structs_vec.c | 12 +++++++++++- > 1 file changed, 11 insertions(+), 1 deletion(-) > > diff --git a/libmultipath/structs_vec.c b/libmultipath/structs_vec.c > index 0e8a46e7..3cb23c73 100644 > --- a/libmultipath/structs_vec.c > +++ b/libmultipath/structs_vec.c > @@ -627,8 +627,18 @@ void set_no_path_retry(struct multipath *mpp) > !mpp->in_recovery) > dm_queue_if_no_path(mpp->alias, 1); > leave_recovery_mode(mpp); > - } else if (pathcount(mpp, PATH_PENDING) == 0) > + } else if (pathcount(mpp, PATH_PENDING) == 0) { > + /* > + * If in_recovery is set, enter_recovery_mode does > + * nothing. If the device is already in recovery > + * mode and has already timed out, manually call > + * dm_queue_if_no_path to stop it from queueing. > + */ > + if ((!mpp->features || is_queueing) && > + mpp->in_recovery && mpp->retry_tick == 0) > + dm_queue_if_no_path(mpp->alias, 0); > enter_recovery_mode(mpp); > + } > break; > } > } > -- > 2.41.0 > >