On Mon 24-01-22 14:22:03, Minchan Kim wrote: [...] > CPU 0 CPU 1 > > lru_cache_disable lru_cache_disable > ret = atomic_inc_return;(ret = 1) > > ret = atomic_inc_return;(ret = 2) > > lru_add_drain_all(true); > lru_add_drain_all(false) > mutex_lock() is holding > mutex_lock() is waiting > > IPI with !force_all_cpus > ... > ... > IPI done but it skipped some CPUs > > .. > .. > > > Thus, lru_cache_disable on CPU 1 doesn't run with every CPUs so it > introduces race of lru_disable_count so some pages on cores > which didn't run the IPI could accept upcoming pages into per-cpu > cache. Yes, that is certainly possible but the question is whether it really matters all that much. The race would require also another racer to be adding a page to an _empty_ pcp list at the same time. pagevec_add_and_need_flush 1) pagevec_add # add to pcp list 2) lru_cache_disabled atomic_read(lru_disable_count) = 0 # no flush but the page is on pcp There is no strong memory ordering between 1 and 2 and that is why we need an IPI to enforce it in general IIRC. But lru_cache_disable is not a strong synchronization primitive. It aims at providing a best effort means to reduce false positives, right? IMHO it doesn't make much sense to aim for perfection because all users of this interface already have to live with temporary failures and pcp caches is not the only reason to fail - e.g. short lived page pins. That being said, I would rather live with a best effort and simpler implementation approach rather than aim for perfection in this case. The scheme is already quite complex and another lock in the mix doesn't make it any easier to follow. If others believe that another lock makes the implementation more straightforward I will not object but I would go with the following. diff --git a/mm/swap.c b/mm/swap.c index ae8d56848602..c140c3743b9e 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -922,7 +922,8 @@ atomic_t lru_disable_count = ATOMIC_INIT(0); */ void lru_cache_disable(void) { - atomic_inc(&lru_disable_count); + int count = atomic_inc_return(&lru_disable_count); + #ifdef CONFIG_SMP /* * lru_add_drain_all in the force mode will schedule draining on @@ -931,8 +932,28 @@ void lru_cache_disable(void) * The atomic operation doesn't need to have stronger ordering * requirements because that is enforeced by the scheduling * guarantees. + * Please note that there is a potential for a race condition: + * CPU0 CPU1 CPU2 + * pagevec_add_and_need_flush + * pagevec_add # to the empty list + * lru_cache_disabled + * atomic_read # 0 + * lru_cache_disable lru_cache_disable + * atomic_inc_return (1) + * atomic_inc_return (2) + * __lru_add_drain_all(true) + * __lru_add_drain_all(false) + * mutex_lock + * mutex_lock + * # skip cpu0 (pagevec_add not visible yet) + * mutex_unlock + * # fail because of pcp(0) pin + * queue_work_on(0) + * + * but the scheme is a best effort and the above race quite unlikely + * to matter in real life. */ - __lru_add_drain_all(true); + __lru_add_drain_all(count == 1); #else lru_add_and_bh_lrus_drain(); #endif -- Michal Hocko SUSE Labs