On Thu, Jun 27, 2024 at 09:47:33AM +0200, Martin Wilck wrote: > On Wed, 2024-06-05 at 19:22 -0400, Benjamin Marzinski wrote: > > Make sure that all callers of set_no_path_retry() update the > > multipath->features string from the kernel before calling it, so that > > they have the current queueing state to check. > > > > Signed-off-by: Benjamin Marzinski <bmarzins@xxxxxxxxxx> > > I am not sure about this one. > > refresh_multipath() is not a cheap operation, requiring at least 3 > device-mapper ioctls, and we hold the multipath lock while we call it, > meaning that multipathd can't take care of other important things while > it's handling this CLI request. > > OTOH, the kernel only changes the features string when the map is > reloaded. Nobody should be reloading the map under normal conditions > anyway, and if they do, we would see an uevent shortly afterwards. IOW, > I think it's safe to assume that multipathd has the correct features > string cached, except in pathological situations. > > Am I overlooking something? Do you have evidence of errors occuring > because multipathd has a wrong features string cached? Well, I did run into a situation where I couldn't disable queueing because of mpp->features being out of date. I fixed that specific issue in patch 1/7. But in a broader context, the kernel doesn't send out any events when it gets a message to change the queueing mode in the features string, so there is always a possibility that something outside of multipathd changed it, and multipathd doesn't find out about it. Normally, check_path() will find out and resolve the issue within seconds. However, if the multipath device lost all its path devices, then check_path() won't ever call set_no_path_retry() on it, and nothing will resync it with the kernel state until an event comes in. Here's a (admittedly fairly improbable) sequence of events that would get you into problems. - There is a multipath device with no paths that is currently in recovery mode waiting to timeout. It's opened by something with outstanding IO. - A user runs "multipath -f" to remove the device, and deligation to multipathd times out so multipath attempts to locally remove the device. It disables queueing and attempts the remove, which fails because the device is still in use. - multipathd hits the retries timeout and disables queueing. - multipath restores queueing because the remove failed. - more IO comes into the multipath device. In this case, the multipath device will be stuck open due to queued IO. The user could reasonably try to disable queueing to resolve this issue. This will report success, but not actually do anything. Granted, there is an unavoidable race here. It's always possible that after multipathd updates the features string, it gets changed outside of its control. But this cuts the window down to something small instead of anytime after you lost all your paths. Any if you get unlucky enough to hit that window, you can just rerun the command and it will work. My main reason for wanting this is that cli_disable_queueing() (and friends) is not a command you run all the time. It's a command to run when you are cleaning up a mess. It doesn't have to be fast. It has to work when things are stuck. Thoughts? -Ben > Regards > Martin > > > > > > > > --- > > multipathd/cli_handlers.c | 12 ++++++++++++ > > 1 file changed, 12 insertions(+) > > > > diff --git a/multipathd/cli_handlers.c b/multipathd/cli_handlers.c > > index 117570e1..3dc5e690 100644 > > --- a/multipathd/cli_handlers.c > > +++ b/multipathd/cli_handlers.c > > @@ -934,6 +934,8 @@ cli_restore_queueing(void *v, struct strbuf > > *reply, void *data) > > pthread_cleanup_push(put_multipath_config, conf); > > select_no_path_retry(conf, mpp); > > pthread_cleanup_pop(1); > > + if (refresh_multipath(vecs, mpp)) > > + return -ENODEV; > > set_no_path_retry(mpp); > > } else > > condlog(2, "%s: queueing not disabled. Nothing to > > do", mapname); > > @@ -956,6 +958,10 @@ cli_restore_all_queueing(void *v, struct strbuf > > *reply, void *data) > > pthread_cleanup_push(put_multipath_config, > > conf); > > select_no_path_retry(conf, mpp); > > pthread_cleanup_pop(1); > > + if (refresh_multipath(vecs, mpp)) { > > + i--; > > + continue; > > + } > > set_no_path_retry(mpp); > > } else > > condlog(2, "%s: queueing not disabled. > > Nothing to do", > > @@ -983,6 +989,8 @@ cli_disable_queueing(void *v, struct strbuf > > *reply, void *data) > > mpp->retry_tick = 0; > > mpp->no_path_retry = NO_PATH_RETRY_FAIL; > > mpp->disable_queueing = 1; > > + if (refresh_multipath(vecs, mpp)) > > + return -ENODEV; > > set_no_path_retry(mpp); > > return 0; > > } > > @@ -1001,6 +1009,10 @@ cli_disable_all_queueing(void *v, struct > > strbuf *reply, void *data) > > mpp->retry_tick = 0; > > mpp->no_path_retry = NO_PATH_RETRY_FAIL; > > mpp->disable_queueing = 1; > > + if (refresh_multipath(vecs, mpp)) { > > + i--; > > + continue; > > + } > > set_no_path_retry(mpp); > > } > > return 0;