Re: [PATCH 2/3] libmultipath: change flush_on_last_del to fix a multipathd hang

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Apr 24, 2024 at 04:42:04PM -0400, Benjamin Marzinski wrote:
> On Wed, Apr 24, 2024 at 06:34:35PM +0200, Martin Wilck wrote:
> > On Tue, 2024-04-23 at 18:25 -0400, Benjamin Marzinski wrote:

<snip>

> > > +
> > > +	/* It's not safe to do a non-deferred remove of a map that
> > > has
> > > +	 * "queue_if_no_path" set, since there could be outstanding
> > > IO which
> > > +	 * will cause multipathd to hang while attempting the remove
> > > */
> > > +	if (mpp->flush_on_last_del == FLUSH_NEVER &&
> > > +	    !do_deferred(deferred_remove) && is_queueing) {
> > > +		condlog(2, "%s: map is queueing, can't remove", mpp-
> > > >alias);
> > > +		return false;
> > > +	}
> > 
> > Just a thought: Why don't we just pretend that deferred_remove is on if
> > flush_on_last_del is set to "never"? Wouldn't that avoid our issues?
> > In the same context, why haven't we made deferred_remove the default
> > yet?
> 
> I'm not sure we want to change the default multipath behavior without a
> good reason.  Users don't expect that removing a multipath device can
> succeed, but the device will still be there.
> 
> Worse, I just tested and found out that my test will still hang
> multipathd with my new code if you have deferred_remove turned on.  This
> is because there is no opener of the kpartx device, so device-mapper
> tries to immediately remove it. Apparently, it's also not safe to
> trigger a deferred remove without disabling queueing.

Thinking about this a bit more, this actually makes deferred_remove a
lot less useful. For devices with no_path_retry set to something other
than fail, deferred_remove now only makes sense when paired with:

flush_on_last_del "always"

That will fail the IO of whatever has the multipath device open so it
can close the device, at which point the deferred remove will happen. So
it's not useless, but it won't make sure that currently queueing devices
eventually get closed.

If we really wanted to keep that behavior, when we disable queueing on a
device, we could check if there are no path devices and do the deffered
remove then. Like I mentioned in my last email, we could even attempt
non-deferred removes then. But unless a lot more people than I expect
are using deferred_remove and complain about it working worse, it's
probably best to just make it safe and less useful. 
 
> > > +	if (mpp->flush_on_last_del == FLUSH_UNUSED &&
> > > +	    !do_deferred(deferred_remove) &&
> > > +            (in_use = partmap_in_use(mpp->alias, NULL)) &&
> > > is_queueing) {
> > > +		condlog(2, "%s: map in use and queueing, can't
> > > remove",
> > > +			mpp->alias);
> > > +		return false;
> > > +	}
> > 
> > Same reasoning as above, why not just defer?
> > 
> > >  	/*
> > > -	 * flush_map will fail if the device is open
> > > +	 * This will flush FLUSH_NEVER devices and FLUSH_UNUSED
> > > devices
> > > +	 * that are in use, but only if this is not a deferred
> > > remove, and
> > 
> > "that are not in use" ?
> 
> This *will* disable queue_if_no_path on in-use devices. But like I said,
> only if, according to mulitpathd, the device is already set to
> queue_if_no_path.  This is just to make sure the mpp->features isn't out
> of date, since getting this wrong and hanging multipathd is a bad thing.
> Alternatively, I could call update_multipath_table() to make sure
> mpp->features is up to date. This way does less work, and if multipathd
> thinks that the device shouldn't be queueing, then it really shouldn't
> be queueing.

Oh, I see. My comment was correct, but my code wasn't.
 
> > > +	 * they are already marked as not queueing. That is just to
> > > make
> > > +	 * absolutely certain that they really are not queueing like
> > > they
> > > +	 * claim.
> > >  	 */
> > > -	if (mpp->flush_on_last_del == FLUSH_ENABLED) {
> > > -		condlog(2, "%s Last path deleted, disabling
> > > queueing",
> > > +	if (mpp->flush_on_last_del == FLUSH_ALWAYS ||
> > > +	    !do_deferred(deferred_remove) || !in_use) {
> > > +		condlog(is_queueing ? 2 : 3,
> > > +			"%s Last path deleted, disabling queueing",
> > >  			mpp->alias);
> > >  		mpp->retry_tick = 0;
> > >  		mpp->no_path_retry = NO_PATH_RETRY_FAIL;
> > > @@ -597,7 +627,7 @@ flush_map_nopaths(struct multipath *mpp, struct
> > > vectors *vecs) {
> > >  		mpp->stat_map_failures++;
> > >  		dm_queue_if_no_path(mpp, 0);
> > >  	}
> > > -	r = dm_flush_map_nopaths(mpp->alias, mpp->deferred_remove);
> > > +	r = dm_flush_map_nopaths(mpp->alias, deferred_remove);
> > 
> > AFAIU, this line should only be executed if either disabling queueing
> > above succeeded, or do_deferred(deferred_remove) is true.
> > This code doesn't guarantee that, or am I overlooking something?
> 
> Oops. Good catch.
>  
> > Regards,
> > Martin
> > 
> > 
> > >  	if (r != DM_FLUSH_OK) {
> > >  		if (r == DM_FLUSH_DEFERRED) {
> > >  			condlog(2, "%s: devmap deferred remove",
> > > mpp->alias);
> 





[Index of Archives]     [DM Crypt]     [Fedora Desktop]     [ATA RAID]     [Fedora Marketing]     [Fedora Packaging]     [Fedora SELinux]     [Yosemite Discussion]     [KDE Users]     [Fedora Docs]

  Powered by Linux