On Mon, Dec 04, 2023 at 09:43:58PM +0100, Martin Wilck wrote: > On Mon, 2023-12-04 at 15:00 -0500, Benjamin Marzinski wrote: > > On Mon, Dec 04, 2023 at 03:38:03PM +0100, Martin Wilck wrote: > > > On Mon, 2023-11-27 at 16:54 -0500, Benjamin Marzinski wrote: > > > > If a multipath device has no_path_retry set to a number and has > > > > lost > > > > all > > > > paths, gone into recovery mode, and timed out, it will disable > > > > queue_if_no_paths. After that, if one of those failed paths is > > > > removed, > > > > when the device is reloaded, queue_if_no_paths will be re- > > > > enabled. > > > > When > > > > set_no_path_retry() is then called to update the queueing state, > > > > it > > > > will > > > > not disable queue_if_no_paths, since the device is still in the > > > > recovery > > > > state, so it believes no work needs to be done. The device will > > > > remain > > > > in the recovery state, with retry_ticks at 0, and queueing > > > > enabled, > > > > even though there are no usable paths. > > > > > > > > To fix this, in set_no_path_retry(), if no_path_retry is set to a > > > > number > > > > and the device is queueing but it is in recovery mode and out of > > > > retries with no usable paths, manually disable queue_if_no_path. > > > > > > > > Signed-off-by: Benjamin Marzinski <bmarzins@xxxxxxxxxx> > > > > > > Thanks, this makes sense. I am just wondering if this logic should > > > be moved to enter_recovery_mode() / leave_recovery_mode() instead. > > > That pair of functions is also called from > > > update_queue_mode{add,del}_path(), where the same reasoning would > > > apply, afaics. Am I overlooking something? > > > > In general, I think it's safe to only do this in set_no_path_retry(). > > The functions that call update_queue_mode{add,del}_path() are: > > fail_path() > > reinstate_path() > > update_multipath() > > io_err_stat_handle_pathfail() > > > > fail_path() and reinstate_path() will always call the > > update_queue_mode_* function after calling set_no_path_retry(), so > > the > > device will already be in the correct state. update_multipath() > > usually > > will as well (except when it's called by a command line handler), and > > further, it will only call update_queue_mode_del_path() when it is > > actually failing a path, so the multipath device shouldn't already be > > in recovery_mode. The same goes for io_err_stat_handle_pathfail(). > > > > Also we need to make sure that mpp->features is uptodate before doing > > these checks, so that we know what the current queueing state is. > > io_err_stat_handle_pathfail() doesn't update mpp->features before > > calling update_queue_mode_del_path(), so it could end up doing the > > wrong > > thing. > > > > And set_no_path_retry() will be called (after updating mpp->features) > > the next time the path checker completes for a path in the device, so > > devices won't remain in a bad state long, no matter what. > > > > What looking at this did make me realize is that > > cli_restore_queueing() > > and cli_restore_all_queueing() don't update the device strings before > > calling set_no_path_retry(). Thinking about this, it seems like they > > should call update_multipath(vecs, mpp->alias, 1) instead, to trigger > > the set_no_path_retry() call after updating mpp->features. > > > > I'll send another patch to fix cli_restore_queueing() and > > cli_restore_all_queueing(). If you think it's necessary, I can also > > send > > one to make io_err_stat_handle_pathfail() call update_multipath() and > > then move the code to manually change the queueing state to > > enter_recovery_mode() and leave_recovery_mode(), instead of doing it > > in > > set_no_path_retry(). > > > > Thanks for working out all the different cases. Personally, I would > like to have this logic in one place, if possible. I'll leave it to > your judgement, as you've taken a closer look than myself. > > I'm not sure what you mean with "mpp->features is up to date"; are you > considering some entity applying changes to the mpp features without > involving multipathd, e.g. by using direct DM ioctls or dmsetup? Or right after multipathd calls dm_queue_if_no_path(). We don't immediately update mpp->features to reflect the changed queue_if_no_paths setting of the device. So there are windows after someone, including multipathd, has changed the queue_if_no_paths setting of the device, during which checking mpp->features won't give the correct answer unless we first reread the device table. Obviously, if something other than multipathd has changed this, then there's always the possiblity of a race, but we can make sure that we don't race with ourselves. For instance, the way things are right now, running: # multipathd disablequeueing maps; multipathd restorequeueing maps will occasionally leave the devices not queueing until the next time a path in them is checked, and set_no_path_retry actually restores queueing. But I'm actually going to deal with this differently. I'll send a patchset shortly. -Ben > > Regards > Martin > > > -Ben > > > > > > > > Regards > > > Martin > > > > > > > --- > > > > libmultipath/structs_vec.c | 12 +++++++++++- > > > > 1 file changed, 11 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/libmultipath/structs_vec.c > > > > b/libmultipath/structs_vec.c > > > > index 0e8a46e7..3cb23c73 100644 > > > > --- a/libmultipath/structs_vec.c > > > > +++ b/libmultipath/structs_vec.c > > > > @@ -627,8 +627,18 @@ void set_no_path_retry(struct multipath > > > > *mpp) > > > > !mpp->in_recovery) > > > > dm_queue_if_no_path(mpp->alias, > > > > 1); > > > > leave_recovery_mode(mpp); > > > > - } else if (pathcount(mpp, PATH_PENDING) == 0) > > > > + } else if (pathcount(mpp, PATH_PENDING) == 0) { > > > > + /* > > > > + * If in_recovery is set, > > > > enter_recovery_mode does > > > > + * nothing. If the device is already in > > > > recovery > > > > + * mode and has already timed out, > > > > manually > > > > call > > > > + * dm_queue_if_no_path to stop it from > > > > queueing. > > > > + */ > > > > + if ((!mpp->features || is_queueing) && > > > > + mpp->in_recovery && mpp->retry_tick > > > > == > > > > 0) > > > > + dm_queue_if_no_path(mpp->alias, > > > > 0); > > > > enter_recovery_mode(mpp); > > > > + } > > > > break; > > > > } > > > > } > >