Re: [PATCH 03/13] multipathd: allow map removal in do_sync_mpp()

Benjamin Marzinski <bmarzins@xxxxxxxxxx> · Wed, 11 Dec 2024 12:09:38 -0500

On Wed, Dec 11, 2024 at 01:06:46PM +0100, Martin Wilck wrote:
> On Tue, 2024-12-10 at 18:30 -0500, Benjamin Marzinski wrote:
> > On Sat, Dec 07, 2024 at 12:36:07AM +0100, Martin Wilck wrote:
> > > We previously didn't allow map removal inside the checker loop. But
> > > with the late updates to the checkerloop code, it should be safe to
> > > orphan
> > > paths and delete maps even in this situation. We remove such maps
> > > everywhere
> > > else in the code already, whenever refresh_multipath() or
> > > setup_multipath()
> > > is called.
> > 
> > Actually, thinking about this more, what do we get by proactively
> > deleting the multipath device if something goes wrong in the checker?
> > If
> > we successfully reload a device, but can't sync it with the kernel,
> > that's one thing, But that was triggered by a change in the device,
> > and
> > we know that when we reloaded the device, device-mapper was working.
> > I'm
> > leery of possibly deleting the map because of a transient device-
> > mapper
> > issue.  I'm not sure if on a check that we do repeatedly, we should
> > delete the device on an error.  We haven't in the past, and as far as
> > I
> > know, it doesn't cause problems.  
> 
> I don't disagree. But the same can be said for basically all call
> chains where setup_multipath() is called for an existing map. I was
> just following the pattern that we use e.g. in ev_add_path(), or in
> update_mpp_prio(). Why would we treat the checker and path addition
> differently in this respect?

I'm confused here. ev_add_path() doesn't remove the device if the reload
fails. If a reload fails, the table should stay the same. That's why I
said that in other cases where we delete the device, we know that when
we just reloaded the device, device-mapper was working. Looking at the
code, that isn't really true. After failed reloads, we still call
setup_multipath to update our state, and we will delete the device if
that fails.

> If we look at this pragmatically (assuming that multipathd gets the
> parameters right), the most probable reason for a map reload failure is
> failure to open a path device in bdev_open(), either because the device
> doesn't exist, or because it's busy or otherwise unavailable. If this
> happens in ev_add_path(), the likely reason is that the path just added
> was busy, and the smartest action upon such a failure would probably be
> to just undo that addition. We currently don't do that; we remove the
> entire map, which is questionable, as you state correctly.

This is why we call setup_multipath after failed reloads, to make sure
multipathd's view of the multipath device resyncs with the kernel's,
which hasn't changed from what it was before the reload failed.

> In the checker, this can't happen. Obviously, no other process can grab
> a path device while the device mapper is holding it, so -EBUSY won't
> occur if we reload an existing map. Even device deletion doesn't cause
> failure on reload. It is possible to delete a SCSI device while it's
> mapped, and execute a table reload / suspend / resume cycle on the map
> while referencing the deleted device. The kernel keeps holding the
> reference to the deleted device, and will simply mark it as
> failed. This holds also if the mapped paths are re-grouped or re-
> ordered in the table. Failure occurs only if we temporarily remove the
> device from the map and re-add it, because as soon as the device is
> removed from the map's dm table, its refcount drops to zero, and it's
> gone for good.
> 
> IOW, reloading a map with a table containing only already-mapped
> devices will never fail, except in extreme situations like kernel OOM.

Maybe I should clarify my position a bit. I am fine with reloading the
device in the checkerloop if something has changed. This obviously
does run a very small risk of something going wrong and a device getting
removed unnecessarily, but we know that we need to reload the device, so
we should.

What I would rather avoid is reloading the device because we failed to
get it's state in do_sync_mpp(). We don't do this because we know that
something has changed. We do this as a safety measure to deal with
corner cases where our state doesn't match the kernel's and we didn't
get an event. Double checking this each time we check a path in a
device saves having to catch all these corner cases elsewhere. But it's
almost always completely unnecessary, and we're doing it on every
multipath device every couple of seconds, unlike reloading a device,
which is rare.

> Thus, AFAICS, the only relevant scenario where a reloading would fail
> is trying to add a path device that was not previously mapped, and
> that's either busy (perhaps in another map) or has been deleted, IOW
> only when we reload after calling adopt_paths(). This is where we could
> improve. If we fail to reload after adopting new paths, we could fall
> back to the existing table, and perhaps try to add paths one by one.
> Again, this is post-0.11 material.
> 
> OTOH, practially impossible is not totally impossible, so we need to be
> prepared to map reload failure either way. IMO the best thing we can do
> in this case is to keep using the kernel's map, and retry reloading
> later. 

I'm not actually worried about the kernel so much as libdevmapper. It is
not designed for multi-threaded processes, and that has bitten us in the
past. For intance, it's why we don't delete devices in dmevent_loop() on
libdevmapper errors. dm_get_events() just waits and retries if getting
the device list fails, and for each device, it calls dm_is_mpath and
will only delete a device on DM_IS_MPATH_NO, which is what I suggested
for the cleanup function.

I'm pretty sure we've handled all of the known issues here, with fixes
like:
02d4bf07 ("libmultipath: protect racy libdevmapper calls with a mutex")
34e01d2f ("multipath-tools: don't call dm_lib_release() any more")

I'd rather not risk having missed some issue that could cause a
temporary error in a function that we call every couple of seconds
(almost always unnecessarily).

-Ben

> The only critical situation is WWID change of path devices. We must try
> to fix this situation ASAP when we detect it. I'm unsure what the best
> action is if a reload fails in that situation, though (other than
> failing the path, as we already do).
> 
> Martin