Re: [PATCH] multipathd: Make sure to disable queueing if recovery has failed.

Martin Wilck <martin.wilck@xxxxxxxx> · Tue, 05 Dec 2023 22:01:04 +0100

On Tue, 2023-12-05 at 14:28 -0500, Benjamin Marzinski wrote:
> On Tue, Dec 05, 2023 at 03:57:13PM +0100, Martin Wilck wrote:
> > 
> > You are right, I see. It's actually an inconsistency that I would
> > like
> > to get rid of. I suggest that we ditch mpp->features entirely, and
> > just
> > keep track of the map's features using an abstract representation
> > using
> > booleans and ints. We should never apply kernel status changes that
> > are
> > inconsistent with our own representation of the map's features,
> > IMO.
> 
> I don't agree with ditching it entirely. The existance of mpp-
> >features
> allows people to use a kernel feature with a version of the userspace
> tools that wasn't doesn't know about it. This means that if a new
> feature is added to the kernel, the existing multipath userspace
> tools
> can be used to test it.
> 
> We need to provide users the option of setting features, and for
> backwards compatibility, we need to accept strings in the feature
> line
> that we also have other options to set. And we still need to build up
> a
> features line to pass into the kernel. So we will still need the code
> to
> reconcile the features line with other options.  I'm not sure we
> would
> gain much by converting it to a set of variables and then back again.

Fair enough, but consistency is very important. Let's work towards
never setting anything in the kernel that doesn't match our notion of
the desired state of a map.

> The one case where we really need to track our current value separate
> from the features line is queue_if_no_path, and it does make sense to
> track that as a variable, so we can easily update it when we make
> changes to it. But even there, the current value is not necessarily
> the
> value we want to put in the features line we pass to the kernel. If
> we
> aren't queueing, and then add a working path, when we reload the
> device
> we want to use the configured no_path_retry value, and not our
> current
> queueing state.
> 
> One thing that would be nice would be tweaking the queue_if_no_path
> value in the features string that we pass to the kernel based on the
> current queueing state and the current path states. If we are
> reloading
> a device, and we know that all the paths in that device are down, and
> we
> currently aren't queueing, we shouldn't pass in a features string
> that
> tells the kernel to enable queueing, which is what we currently do.

Agreed.

> The
> patch I posted which started this conversation exists because we
> don't
> do this, and we need to fix the queueing state afterwards.
>  
> > Along a similar line of reasoning, let me emphasize again why I
> > would
> > like to have the logic for enabling or disabling queueing in one
> > place.
> > Your previous post nicely distinguished the various cases and call
> > chains which change the queueing and recovery mode, but it would be
> > better if that kind of assessment wasn't necessary, and instead it
> > was
> > possible to inspect the code in just a block of (say) 50 code lines
> > that would decide whether or not a given map should be queueing and
> > /
> > or in recovery mode.
> > 
> > It's an unfortunate property of the multipath-tools code base that
> > important logic is dispersed over multiple functions and source
> > files,
> > making a correct assessment of the code flow difficult and
> > cumbersome.
> > It can't be completely avoided given the complexity of the subject,
> > but
> > it would be nice to try.
> > 
> > I'm looking forward to your new patch set ;-)
> 
> Unfortunately, my new patchset doesn't change the first patch. It
> just
> builds on it to clean how the interactive client commands call
> things, to
> reduce the callers of some of our functions, and keep the client
> commands from doing more work than they should be. After the changes,
> the only things calling set_no_path_retry() are functions that are
> updating the multipath device due to external events, and the
> check_path
> code.  
> 
> The update_queue_mode{add,del}_path() functions should be (and based
> on
> my code examination, are) able to assume a device in the correct
> queuing
> and recovery state, and just do the work of entering or leaving the
> recovery mode, based on a change in the number of active paths. 

OK. Perhaps not what I was hoping for, but better than what we have
now. Maybe you can add a few comment lines to explain the various
cases, lest your analysis be forgotten.

> If we move the code that checks if the our queuing state is wrong out
> of
> set_no_path_retry() into the update_queue_mode{add,del}_path(), then
> we
> add a requirement that callers of those functions refresh the current
> queueing state. Otherwise we can't actually guarantee that they do
> the
> right thing. With the current callers, this doesn't really matter,
> because we don't need them to fix the queueing state. But why not
> keep
> the code that fixes the queueing state in a function that's called
> when
> it could be wrong, and also is (after my new patchset) only called
> after
> we have refreshed the state so it will do the right thing.
> 
> So, I would prefer if we go with my first patch, as is.  I'm open to
> the
> idea of moving this code into update_queue_mode{add,del}_path() as
> part
> of a future patchset where we switch to tracking the current queueing
> state outside of mpp->features, and updating it whenever we actually
> enable or disable queueing. I can work on this if you want.

I'm fine with leaving your first patch as is, perhaps with some
comments. Anyway please resubmit it as part of your new set.

Regards
Martin