Re: [PATCH 04/15] libmultipath: remove pending wait code from libcheck_check calls

Benjamin Marzinski <bmarzins@xxxxxxxxxx> · Fri, 6 Sep 2024 13:26:47 -0400

On Thu, Sep 05, 2024 at 09:53:52AM +0200, Martin Wilck wrote:
> On Wed, 2024-09-04 at 17:17 -0400, Benjamin Marzinski wrote:
> > On Wed, Sep 04, 2024 at 10:05:30PM +0200, Martin Wilck wrote:
> > > On Wed, 2024-08-28 at 18:17 -0400, Benjamin Marzinski wrote:

<snip>

> > This way, you just call checker_get_state() on all the paths were you
> > ran the path checker, and if at least one of them needs to wait, then
> > the first one that does will wait, and the rest won't. If none of
> > them
> > need to wait, then none of them will.
> 
> This makes sense. But the "wait" you're talking about is the 1ms delay,
> after which the paths may well still be in PATH_PENDING. It's actually
> not much different from not waiting at all, we've just decreased the
> likelihood of PATH_PENDIN. 1ms is just an estimate, and, in my
> experience, too short in many environments.
> 
> The real benefit of your patch set is that we now start all (async)
> checkers in parallel. Which means that we can wait longer, increasing
> the likelihood that the checkers have actually finished, while spending
> less total time waiting. "Waiting longer" is a bit hand-waving; IMO it
> should be handled further up in the stack rather than down in the
> checker code.
> 
> We could achieve the behavior you describe in checkerloop, like this:
> 
>  - record start_timestamp
>  - fire off all handlers
>  - calculate end_time as start_timestamp + some_delay 
>  - use that end_time for waiting for results of all fired-off checkers
> 
> Like in your example, we'd really wait for no more than 1 path.
> We could do this for the entire set of paths, or for each multipath map
> separately.

Actually, now that I think about it, the biggest benefit of doing the
wait in checkerloop itself is that the code already handles the case
where we drop the vecs lock while there are paths that have started the
checker, but not yet updated the device based on the results. Given
that, we can just do the wait for pending paths without holding the vecs
lock. It's sort of embarassing that I did all this work to cut down on
the time we're waiting but didn't think about the fact that with just a
bit more work, we could avoid sleeping while holding the lock
altogether.

I already did some testing before to make sure that things like
reconfigures, adding/removing paths and manually failing/restoring paths
didn't mess things up if they happened in that window where we drop the
lock in the middle of checking the paths, but I'll probably need to do
some more tests and recheck that code if it's going to be non-rare
occurance for other threads to run in the middle of checking paths. But
this seems like too big of a benefit to avoid doing.

-Ben

> Regards
> Martin