On Tue, Jan 25, 2022 at 10:23:13AM +0100, Michal Hocko wrote: > On Mon 24-01-22 14:22:03, Minchan Kim wrote: > [...] > > CPU 0 CPU 1 > > > > lru_cache_disable lru_cache_disable > > ret = atomic_inc_return;(ret = 1) > > > > ret = atomic_inc_return;(ret = 2) > > > > lru_add_drain_all(true); > > lru_add_drain_all(false) > > mutex_lock() is holding > > mutex_lock() is waiting > > > > IPI with !force_all_cpus > > ... > > ... > > IPI done but it skipped some CPUs > > > > .. > > .. > > > > > > Thus, lru_cache_disable on CPU 1 doesn't run with every CPUs so it > > introduces race of lru_disable_count so some pages on cores > > which didn't run the IPI could accept upcoming pages into per-cpu > > cache. > > Yes, that is certainly possible but the question is whether it really > matters all that much. The race would require also another racer to be > adding a page to an _empty_ pcp list at the same time. > > pagevec_add_and_need_flush > 1) pagevec_add # add to pcp list > 2) lru_cache_disabled > atomic_read(lru_disable_count) = 0 > # no flush but the page is on pcp > > There is no strong memory ordering between 1 and 2 and that is why we > need an IPI to enforce it in general IIRC. Correct. > > But lru_cache_disable is not a strong synchronization primitive. It aims > at providing a best effort means to reduce false positives, right? IMHO Nope. d479960e44f27, mm: disable LRU pagevec during the migration temporarily Originally, it was designed to close the race fundamentally. > it doesn't make much sense to aim for perfection because all users of > this interface already have to live with temporary failures and pcp > caches is not the only reason to fail - e.g. short lived page pins. short lived pages are true but that doesn't mean we need to make the allocation faster. As I mentioned, the IPI takes up to hundreds milliseconds easily once CPUs are fully booked. If we reduce the cost, we could spend the time more productive works. I am working on making CMA more determinstic and this patch is one of parts. > > That being said, I would rather live with a best effort and simpler > implementation approach rather than aim for perfection in this case. > The scheme is already quite complex and another lock in the mix doesn't lru_add_drain_all already hides the whole complexity inside and lru_cache_disable adds A simple synchroniztion to keep ordering on top of it. That's natural SW stack and I don't see too complication here. > make it any easier to follow. If others believe that another lock makes Disagree. lru_cache_disable is designed to guarantee closing the race you are opening again so the other code in allocator since disabling per-cpu cache doesn't need to consider the race at all. It's more simple/deterministic and we could make other stuff based on it(e.g., zone->pcp). > the implementation more straightforward I will not object but I would go > with the following. > > diff --git a/mm/swap.c b/mm/swap.c > index ae8d56848602..c140c3743b9e 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -922,7 +922,8 @@ atomic_t lru_disable_count = ATOMIC_INIT(0); > */ > void lru_cache_disable(void) > { > - atomic_inc(&lru_disable_count); > + int count = atomic_inc_return(&lru_disable_count); > + > #ifdef CONFIG_SMP > /* > * lru_add_drain_all in the force mode will schedule draining on > @@ -931,8 +932,28 @@ void lru_cache_disable(void) > * The atomic operation doesn't need to have stronger ordering > * requirements because that is enforeced by the scheduling > * guarantees. > + * Please note that there is a potential for a race condition: > + * CPU0 CPU1 CPU2 > + * pagevec_add_and_need_flush > + * pagevec_add # to the empty list > + * lru_cache_disabled > + * atomic_read # 0 > + * lru_cache_disable lru_cache_disable > + * atomic_inc_return (1) > + * atomic_inc_return (2) > + * __lru_add_drain_all(true) > + * __lru_add_drain_all(false) > + * mutex_lock > + * mutex_lock > + * # skip cpu0 (pagevec_add not visible yet) > + * mutex_unlock > + * # fail because of pcp(0) pin > + * queue_work_on(0) > + * > + * but the scheme is a best effort and the above race quite unlikely > + * to matter in real life. > */ > - __lru_add_drain_all(true); > + __lru_add_drain_all(count == 1); > #else > lru_add_and_bh_lrus_drain(); > #endif > -- > Michal Hocko > SUSE Labs