On Tue, Jan 14, 2025 at 10:36:06PM +0100, Martin Wilck wrote: > On Thu, 2024-12-19 at 18:04 -0500, Benjamin Marzinski wrote: > > On Wed, Dec 11, 2024 at 11:58:58PM +0100, Martin Wilck wrote: > > > @@ -3032,11 +3026,13 @@ checkerloop (void *ap) > > > > > > start_time.tv_sec); > > > if (checker_state == CHECKER_FINISHED) { > > > vector_foreach_slot(vecs->mpvec, > > > mpp, i) { > > > - if ((update_mpp_prio(mpp) > > > || > > > - (mpp->need_reload && > > > mpp->synced_count > 0)) && > > > > When you call reload_and_sync_map(), it will automatically resync the > > map via setup_multipath() -> refresh_multipath() -> > > update_multipath_strings(). > > > > This means that if for some reason multipathd and the kernel disagree > > about a map, and reloading it doesn't fix the problem, you will > > immediately set mpp->need_reload again. With the old mpp- > > >synced_count > > check, you only reload maps with need_reload() when a path is > > checked. > > Without this check, or a (mpp->checker_count > 0) check to replace > > it, > > you will keep reloading these maps every loop, roughly once a second. > > I > > would rather not do this. > > > > If you want to make sure to immediately handle a need_reload that > > wasn't > > triggered by this call to reload_and_sync_map() which was because of > > an > > earlier need_reload, we could make need_reload have three states, to > > distinguish between a reload we want done immediately, and one we > > would > > like to wait on because we just did a reload and it didn't fix the > > problem. Then we could remember if need_reload was set before calling > > reload_and_sync_map(), and if it was, and if it is still set after, > > we > > could switch it to the delayed version. > > > > Or perhaps I'm just being paranoid here. > > As you probably know and as I recently verified, reloading the kernel > from the checker loop will hardly ever fail except with -ENOMEM [1]. We > can pass non-existing or failed devices to the kernel, it will happily > accept them. > > update_pathvec_from_dm() never adds any devices to the map, it just > removes some. If need_reload is set, it means that it has removed > either a path or an entire pathgroup. When the map is reloaded, it will > only reference (a subset of) the devices that were already mapped. I > see no way how this could fail unless either multipathd or the kernel > are really badly malfunctioning, in which case we don't need to bother > about reloading too frequently. > > But *if* the reload succeeds, the set of devices in the kernel is > guaranteed to be identical to the table that we've just used for > reloading. So only way that another difference between kernel and > multipath state could occur between the reload and > update_pathvec_from_dm() running again is that another device has just > diappeared from the system, in which case a quick reload would be a > reasonable action. (Well I guess another possibility would be a 3rd > party maliciously adding wrong path devices to the maps we maintain, > but that's not something we can do much about). > > If need_reload is indeed set again in this situation, I would indeed > prefer to double-check this map quickly. As argued above, I strongly > believe that such a situation will not persist. IMO a detected > inconsistency between the kernel and multipathd is a very bad thing > that we should try to fix rather sooner than later. It's at least as > bad as a failed path, which we'll check every second, too. > > Bottom line: I think re-checking this quickly is actually the right > thing to do. Would you accept this if I add a warning in the > "inconsistent" case, so that in the event that we actually run into a > persistent discrepancy situation, we will notice? > Sure. This is probably just my paranoia here, since I can't actually come up with a concrete case where there would be a persistent discrepancy. If it ever happens, it's almost definitely a bug in the multipath code. -Ben > Regards, > Martin > > [1] > https://lore.kernel.org/dm-devel/ee6fcbda31fd1f13969653782417fbed748f5bc7.camel@xxxxxxxx/ > > > -Ben > > > > > - > > > reload_and_sync_map(mpp, vecs) == 2) > > > + sync_mpp(vecs, mpp, > > > ticks); > > > + if ((update_mpp_prio(mpp) > > > || mpp->need_reload) && > > > + > > > reload_and_sync_map(mpp, vecs) == 2) { > > > /* multipath > > > device deleted */ > > > i--; > > > + continue; > > > + } > > > } > > > } > > > lock_cleanup_pop(vecs->lock); > > > -- > > > 2.47.0 > >