Re: [PATCH 3/7] multipathd: refresh multipath before calling set_no_path_retry()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Jun 27, 2024 at 02:33:00PM -0400, Benjamin Marzinski wrote:
> On Thu, Jun 27, 2024 at 09:47:33AM +0200, Martin Wilck wrote:
> > On Wed, 2024-06-05 at 19:22 -0400, Benjamin Marzinski wrote:
> > > Make sure that all callers of set_no_path_retry() update the
> > > multipath->features string from the kernel before calling it, so that
> > > they have the current queueing state to check.
> > > 
> > > Signed-off-by: Benjamin Marzinski <bmarzins@xxxxxxxxxx>
> > 
> > I am not sure about this one. 
> > 
> > refresh_multipath() is not a cheap operation, requiring at least 3
> > device-mapper ioctls, and we hold the multipath lock while we call it,
> > meaning that multipathd can't take care of other important things while
> > it's handling this CLI request.
> > 
> > OTOH, the kernel only changes the features string when the map is
> > reloaded. Nobody should be reloading the map under normal conditions
> > anyway, and if they do, we would see an uevent shortly afterwards. IOW,
> > I think it's safe to assume that multipathd has the correct features
> > string cached, except in pathological situations.
> > 
> > Am I overlooking something? Do you have evidence of errors occuring
> > because multipathd has a wrong features string cached?
> 
> Well, I did run into a situation where I couldn't disable queueing
> because of mpp->features being out of date. I fixed that specific issue
> in patch 1/7. But in a broader context, the kernel doesn't send out any
> events when it gets a message to change the queueing mode in the
> features string, so there is always a possibility that something outside
> of multipathd changed it, and multipathd doesn't find out about it.
> Normally, check_path() will find out and resolve the issue within
> seconds. However, if the multipath device lost all its path devices,
> then check_path() won't ever call set_no_path_retry() on it

Err... It won't ever call update_multipath_strings(), which is
what will sync the mpp->features string with the kernel and make
set_no_path_retry() do the right thing.

> and nothing
> will resync it with the kernel state until an event comes in.
> 
> Here's a (admittedly fairly improbable) sequence of events that would
> get you into problems.
> 
> - There is a multipath device with no paths that is currently in
> recovery mode waiting to timeout. It's opened by something with
> outstanding IO.
> 
> - A user runs "multipath -f" to remove the device, and deligation to
> multipathd times out so multipath attempts to locally remove the
> device. It disables queueing and attempts the remove, which fails
> because the device is still in use.
> 
> - multipathd hits the retries timeout and disables queueing.
> 
> - multipath restores queueing because the remove failed.
> 
> - more IO comes into the multipath device.
> 
> In this case, the multipath device will be stuck open due to queued
> IO. The user could reasonably try to disable queueing to resolve this
> issue. This will report success, but not actually do anything.
because multipathd never syncs the features string with the kernel
before calling set_no_path_retry().

> 
> Granted, there is an unavoidable race here. It's always possible that
> after multipathd updates the features string, it gets changed outside of
> its control. But this cuts the window down to something small instead
> of anytime after you lost all your paths. Any if you get unlucky enough
> to hit that window, you can just rerun the command and it will work.
> 
> My main reason for wanting this is that cli_disable_queueing() (and
> friends) is not a command you run all the time. It's a command to run
> when you are cleaning up a mess. It doesn't have to be fast. It has to
> work when things are stuck.
> 
> Thoughts?
> 
> -Ben
> 
> > Regards
> > Martin
> > 
> > 
> > 
> > 
> > 
> > 
> > > ---
> > >  multipathd/cli_handlers.c | 12 ++++++++++++
> > >  1 file changed, 12 insertions(+)
> > > 
> > > diff --git a/multipathd/cli_handlers.c b/multipathd/cli_handlers.c
> > > index 117570e1..3dc5e690 100644
> > > --- a/multipathd/cli_handlers.c
> > > +++ b/multipathd/cli_handlers.c
> > > @@ -934,6 +934,8 @@ cli_restore_queueing(void *v, struct strbuf
> > > *reply, void *data)
> > >  		pthread_cleanup_push(put_multipath_config, conf);
> > >  		select_no_path_retry(conf, mpp);
> > >  		pthread_cleanup_pop(1);
> > > +		if (refresh_multipath(vecs, mpp))
> > > +			return -ENODEV;
> > >  		set_no_path_retry(mpp);
> > >  	} else
> > >  		condlog(2, "%s: queueing not disabled. Nothing to
> > > do", mapname);
> > > @@ -956,6 +958,10 @@ cli_restore_all_queueing(void *v, struct strbuf
> > > *reply, void *data)
> > >  			pthread_cleanup_push(put_multipath_config,
> > > conf);
> > >  			select_no_path_retry(conf, mpp);
> > >  			pthread_cleanup_pop(1);
> > > +			if (refresh_multipath(vecs, mpp)) {
> > > +				i--;
> > > +				continue;
> > > +			}
> > >  			set_no_path_retry(mpp);
> > >  		} else
> > >  			condlog(2, "%s: queueing not disabled.
> > > Nothing to do",
> > > @@ -983,6 +989,8 @@ cli_disable_queueing(void *v, struct strbuf
> > > *reply, void *data)
> > >  	mpp->retry_tick = 0;
> > >  	mpp->no_path_retry = NO_PATH_RETRY_FAIL;
> > >  	mpp->disable_queueing = 1;
> > > +	if (refresh_multipath(vecs, mpp))
> > > +		return -ENODEV;
> > >  	set_no_path_retry(mpp);
> > >  	return 0;
> > >  }
> > > @@ -1001,6 +1009,10 @@ cli_disable_all_queueing(void *v, struct
> > > strbuf *reply, void *data)
> > >  		mpp->retry_tick = 0;
> > >  		mpp->no_path_retry = NO_PATH_RETRY_FAIL;
> > >  		mpp->disable_queueing = 1;
> > > +		if (refresh_multipath(vecs, mpp)) {
> > > +			i--;
> > > +			continue;
> > > +		}
> > >  		set_no_path_retry(mpp);
> > >  	}
> > >  	return 0;





[Index of Archives]     [DM Crypt]     [Fedora Desktop]     [ATA RAID]     [Fedora Marketing]     [Fedora Packaging]     [Fedora SELinux]     [Yosemite Discussion]     [KDE Users]     [Fedora Docs]

  Powered by Linux