Re: [PATCH] multipath.rules: fix "smart" bug with failed valid path check

Benjamin Marzinski <bmarzins@xxxxxxxxxx> · Wed, 8 Feb 2023 09:52:36 -0600

On Wed, Feb 08, 2023 at 07:35:36AM +0000, Martin Wilck wrote:
> Hello Ben,
> 
> On Tue, 2023-02-07 at 18:32 -0600, Benjamin Marzinski wrote:
> > If "multipath -u" fails, udev doesn't import any values from the
> > program. This means that multipath.rules will continue to use the
> > values
> > for DM_MULTIPATH_DEVICE_PATH and FIND_MULTIPATHS_WAIT_UNTIL that it
> > has
> > already imported from the database. This is the correct thing to do
> > for
> > every case except the MAYBE case for "find_multipaths smart". In that
> > case, DM_MULTIPATH_DEVICE_PATH will be set to 1, and the rules will
> > assume that the device has been definitively claimed.
> > 
> > In this case, we know that the device shouldn't have been claimed
> > before, but we don't know if it should be claimed now, or if we have
> > hit
> > the timeout and it should be released, since we didn't get any
> > information from multipath. The safest thing to do is assume that
> > this
> > was the timeout, and the device shouldn't be claimed. The only time
> > when
> > this could be the wrong answer is when we first see a new multipath
> > device, and it could only cause problems if there is metadata on the
> > device that will cause it to get autoassembled by something else,
> > before
> > multipathd can autoassemble it. If we assume that it is a multipath
> > device, or we assume that this wasn't actually the timeout uevent, we
> > can keep a necessary device from getting released to the reset of the
> > system.
> > 
> > Signed-off-by: Benjamin Marzinski <bmarzins@xxxxxxxxxx>
> 
> We have this code in check_path_valid():
> 
> 	/*
> 	 * multipath -u must exit with status 0, otherwise udev won't
> 	 * import its output.
> 	 */
> 	if (!is_uevent && r == PATH_IS_NOT_VALID)
> 		r = RTVL_FAIL;
> 	else
> 		r = RTVL_OK;
> 
> 
> AFAICS about the only condition I can imagine in which "multipath -u"
> would return an exit status other than 0 would be failure of
> init_config(). If that's the case you're concerned about, the issue
> might be easier to fix in main.c directly. 

I'm tracking down a bug where it appears that either "multpath -u"
failed and caused this issue or incorrectly returned
DM_MULTIPATH_DEVICE_PATH=1. I'm still not sure what exactly happened,
but when going over the code I noticed this issue was possible and it
doesn't seem like it should be. I don't think it's very likely that this
is the issue that occurred in the bug I'm looking into, but I do think
that this is possible for errors other than a config problem.  First,
any memory issue would cause this. Obviously, there's a good chance
that the whole system has bigger issues in that case. Second, we do rely
on the udev library to not fail in a number of places, notably
path_discovery(). Third, we rely on filesystem code in places like
is_failed_wwid(). None of these issue individually bothers me, but since
we can protect the system from an unexpected failure here, it seems
reasonable to do so. Obviously we can solve all of them except a memory
failure or an init_config() issue by making sure check_path_valid() never
returns failure in the uevent case, but it seems more reasonable to me
to actually fail, and have udev handle that. But like I said, I don't
think I've actually seen this issue in practice.

I don't think the udev changes are that intrusive, they only happen in
case where you are running with "find_multipaths smart" and getting a
uevent for a MAYBE path, and aside from changing

ENV{FIND_MULTIPATHS_WAIT_CANCELLED}!="?*", \
        ENV{FIND_MULTIPATHS_WAIT_UNTIL}=="?*", \
        ENV{FIND_MULTIPATHS_WAIT_UNTIL}!="0", \
        ENV{FIND_MULTIPATHS_WAIT_CANCELLED}="1", \
        RUN+="/usr/bin/systemctl stop cancel-multipath-wait-$kernel.timer"

Into:

ENV{FIND_MULTIPATHS_WAIT_CANCELLED}=="?*", GOTO="end_mpath"
ENV{FIND_MULTIPATHS_WAIT_UNTIL}!="?*", GOTO="end_mpath"
ENV{FIND_MULTIPATHS_WAIT_UNTIL}=="0", GOTO="end_mpath"

ENV{FIND_MULTIPATHS_WAIT_CANCELLED}="1"
RUN+="/usr/bin/systemctl stop cancel-multipath-wait-$kernel.timer"

which is functionally the same, the only change is to check if
"multipath -u" failed, and if so, set DM_MULTIPATH_DEVICE_PATH=0,
instead of what we are currently doing implicitly with the

IMPORT{db}="DM_MULTIPATH_DEVICE_PATH"

which is to set DM_MULTIPATH_DEVICE_PATH=1 if "multipath -u" failed.

Unless it turns out that my bug was acutally caused by this, I'm not
super invested in these changes, but I do think they are a reasonable
way to catch an unlikely issue, without having to design the multipath
code around the possibility.

-Ben

> 
> Regards,
> Martin
--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://listman.redhat.com/mailman/listinfo/dm-devel