On Thu, Jun 27, 2024 at 02:33:00PM -0400, Benjamin Marzinski wrote: > On Thu, Jun 27, 2024 at 09:47:33AM +0200, Martin Wilck wrote: > > On Wed, 2024-06-05 at 19:22 -0400, Benjamin Marzinski wrote: > > > Make sure that all callers of set_no_path_retry() update the > > > multipath->features string from the kernel before calling it, so that > > > they have the current queueing state to check. > > > > > > Signed-off-by: Benjamin Marzinski <bmarzins@xxxxxxxxxx> > > > > I am not sure about this one. > > > > refresh_multipath() is not a cheap operation, requiring at least 3 > > device-mapper ioctls, and we hold the multipath lock while we call it, > > meaning that multipathd can't take care of other important things while > > it's handling this CLI request. > > > > OTOH, the kernel only changes the features string when the map is > > reloaded. Nobody should be reloading the map under normal conditions > > anyway, and if they do, we would see an uevent shortly afterwards. IOW, > > I think it's safe to assume that multipathd has the correct features > > string cached, except in pathological situations. > > > > Am I overlooking something? Do you have evidence of errors occuring > > because multipathd has a wrong features string cached? > > Well, I did run into a situation where I couldn't disable queueing > because of mpp->features being out of date. I fixed that specific issue > in patch 1/7. But in a broader context, the kernel doesn't send out any > events when it gets a message to change the queueing mode in the > features string, so there is always a possibility that something outside > of multipathd changed it, and multipathd doesn't find out about it. > Normally, check_path() will find out and resolve the issue within > seconds. However, if the multipath device lost all its path devices, > then check_path() won't ever call set_no_path_retry() on it Err... It won't ever call update_multipath_strings(), which is what will sync the mpp->features string with the kernel and make set_no_path_retry() do the right thing. > and nothing > will resync it with the kernel state until an event comes in. > > Here's a (admittedly fairly improbable) sequence of events that would > get you into problems. > > - There is a multipath device with no paths that is currently in > recovery mode waiting to timeout. It's opened by something with > outstanding IO. > > - A user runs "multipath -f" to remove the device, and deligation to > multipathd times out so multipath attempts to locally remove the > device. It disables queueing and attempts the remove, which fails > because the device is still in use. > > - multipathd hits the retries timeout and disables queueing. > > - multipath restores queueing because the remove failed. > > - more IO comes into the multipath device. > > In this case, the multipath device will be stuck open due to queued > IO. The user could reasonably try to disable queueing to resolve this > issue. This will report success, but not actually do anything. because multipathd never syncs the features string with the kernel before calling set_no_path_retry(). > > Granted, there is an unavoidable race here. It's always possible that > after multipathd updates the features string, it gets changed outside of > its control. But this cuts the window down to something small instead > of anytime after you lost all your paths. Any if you get unlucky enough > to hit that window, you can just rerun the command and it will work. > > My main reason for wanting this is that cli_disable_queueing() (and > friends) is not a command you run all the time. It's a command to run > when you are cleaning up a mess. It doesn't have to be fast. It has to > work when things are stuck. > > Thoughts? > > -Ben > > > Regards > > Martin > > > > > > > > > > > > > > > --- > > > multipathd/cli_handlers.c | 12 ++++++++++++ > > > 1 file changed, 12 insertions(+) > > > > > > diff --git a/multipathd/cli_handlers.c b/multipathd/cli_handlers.c > > > index 117570e1..3dc5e690 100644 > > > --- a/multipathd/cli_handlers.c > > > +++ b/multipathd/cli_handlers.c > > > @@ -934,6 +934,8 @@ cli_restore_queueing(void *v, struct strbuf > > > *reply, void *data) > > > pthread_cleanup_push(put_multipath_config, conf); > > > select_no_path_retry(conf, mpp); > > > pthread_cleanup_pop(1); > > > + if (refresh_multipath(vecs, mpp)) > > > + return -ENODEV; > > > set_no_path_retry(mpp); > > > } else > > > condlog(2, "%s: queueing not disabled. Nothing to > > > do", mapname); > > > @@ -956,6 +958,10 @@ cli_restore_all_queueing(void *v, struct strbuf > > > *reply, void *data) > > > pthread_cleanup_push(put_multipath_config, > > > conf); > > > select_no_path_retry(conf, mpp); > > > pthread_cleanup_pop(1); > > > + if (refresh_multipath(vecs, mpp)) { > > > + i--; > > > + continue; > > > + } > > > set_no_path_retry(mpp); > > > } else > > > condlog(2, "%s: queueing not disabled. > > > Nothing to do", > > > @@ -983,6 +989,8 @@ cli_disable_queueing(void *v, struct strbuf > > > *reply, void *data) > > > mpp->retry_tick = 0; > > > mpp->no_path_retry = NO_PATH_RETRY_FAIL; > > > mpp->disable_queueing = 1; > > > + if (refresh_multipath(vecs, mpp)) > > > + return -ENODEV; > > > set_no_path_retry(mpp); > > > return 0; > > > } > > > @@ -1001,6 +1009,10 @@ cli_disable_all_queueing(void *v, struct > > > strbuf *reply, void *data) > > > mpp->retry_tick = 0; > > > mpp->no_path_retry = NO_PATH_RETRY_FAIL; > > > mpp->disable_queueing = 1; > > > + if (refresh_multipath(vecs, mpp)) { > > > + i--; > > > + continue; > > > + } > > > set_no_path_retry(mpp); > > > } > > > return 0;