On Thu, Sep 05, 2024 at 09:53:52AM +0200, Martin Wilck wrote: > On Wed, 2024-09-04 at 17:17 -0400, Benjamin Marzinski wrote: > > On Wed, Sep 04, 2024 at 10:05:30PM +0200, Martin Wilck wrote: > > > On Wed, 2024-08-28 at 18:17 -0400, Benjamin Marzinski wrote: <snip> > > This way, you just call checker_get_state() on all the paths were you > > ran the path checker, and if at least one of them needs to wait, then > > the first one that does will wait, and the rest won't. If none of > > them > > need to wait, then none of them will. > > This makes sense. But the "wait" you're talking about is the 1ms delay, > after which the paths may well still be in PATH_PENDING. It's actually > not much different from not waiting at all, we've just decreased the > likelihood of PATH_PENDIN. 1ms is just an estimate, and, in my > experience, too short in many environments. > > The real benefit of your patch set is that we now start all (async) > checkers in parallel. Which means that we can wait longer, increasing > the likelihood that the checkers have actually finished, while spending > less total time waiting. "Waiting longer" is a bit hand-waving; IMO it > should be handled further up in the stack rather than down in the > checker code. > > We could achieve the behavior you describe in checkerloop, like this: > > - record start_timestamp > - fire off all handlers > - calculate end_time as start_timestamp + some_delay > - use that end_time for waiting for results of all fired-off checkers > > Like in your example, we'd really wait for no more than 1 path. > We could do this for the entire set of paths, or for each multipath map > separately. Actually, now that I think about it, the biggest benefit of doing the wait in checkerloop itself is that the code already handles the case where we drop the vecs lock while there are paths that have started the checker, but not yet updated the device based on the results. Given that, we can just do the wait for pending paths without holding the vecs lock. It's sort of embarassing that I did all this work to cut down on the time we're waiting but didn't think about the fact that with just a bit more work, we could avoid sleeping while holding the lock altogether. I already did some testing before to make sure that things like reconfigures, adding/removing paths and manually failing/restoring paths didn't mess things up if they happened in that window where we drop the lock in the middle of checking the paths, but I'll probably need to do some more tests and recheck that code if it's going to be non-rare occurance for other threads to run in the middle of checking paths. But this seems like too big of a benefit to avoid doing. -Ben > Regards > Martin