Re: [RESEND][PATCH v2] mm: don't call lru draining in the nested lru_cache_disable

Michal Hocko <mhocko@xxxxxxxx> · Tue, 25 Jan 2022 10:23:13 +0100

On Mon 24-01-22 14:22:03, Minchan Kim wrote:
[...]
>  CPU 0                               CPU 1
> 
>  lru_cache_disable                  lru_cache_disable
>    ret = atomic_inc_return;(ret = 1)
>                                      
>                                     ret = atomic_inc_return;(ret = 2)
>                                     
>    lru_add_drain_all(true);         
>                                     lru_add_drain_all(false)
>                                     mutex_lock() is holding
>    mutex_lock() is waiting
> 
>                                     IPI with !force_all_cpus
>                                     ...
>                                     ...
>                                     IPI done but it skipped some CPUs
>                
>      ..
>      ..
>  
> 
> Thus, lru_cache_disable on CPU 1 doesn't run with every CPUs so it
> introduces race of lru_disable_count so some pages on cores
> which didn't run the IPI could accept upcoming pages into per-cpu
> cache.

Yes, that is certainly possible but the question is whether it really
matters all that much. The race would require also another racer to be
adding a page to an _empty_ pcp list at the same time.

pagevec_add_and_need_flush
  1) pagevec_add # add to pcp list
  2) lru_cache_disabled
    atomic_read(lru_disable_count) = 0
  # no flush but the page is on pcp

There is no strong memory ordering between 1 and 2 and that is why we
need an IPI to enforce it in general IIRC.

But lru_cache_disable is not a strong synchronization primitive. It aims
at providing a best effort means to reduce false positives, right? IMHO
it doesn't make much sense to aim for perfection because all users of
this interface already have to live with temporary failures and pcp
caches is not the only reason to fail - e.g. short lived page pins.

That being said, I would rather live with a best effort and simpler
implementation approach rather than aim for perfection in this case.
The scheme is already quite complex and another lock in the mix doesn't
make it any easier to follow. If others believe that another lock makes
the implementation more straightforward I will not object but I would go
with the following.

diff --git a/mm/swap.c b/mm/swap.c
index ae8d56848602..c140c3743b9e 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -922,7 +922,8 @@ atomic_t lru_disable_count = ATOMIC_INIT(0);
  */
 void lru_cache_disable(void)
 {
-	atomic_inc(&lru_disable_count);
+	int count = atomic_inc_return(&lru_disable_count);
+
 #ifdef CONFIG_SMP
 	/*
 	 * lru_add_drain_all in the force mode will schedule draining on
@@ -931,8 +932,28 @@ void lru_cache_disable(void)
 	 * The atomic operation doesn't need to have stronger ordering
 	 * requirements because that is enforeced by the scheduling
 	 * guarantees.
+	 * Please note that there is a potential for a race condition:
+	 * CPU0				CPU1			CPU2
+	 * pagevec_add_and_need_flush
+	 *   pagevec_add # to the empty list
+	 *   lru_cache_disabled
+	 *     atomic_read # 0
+	 *   				lru_cache_disable	lru_cache_disable
+	 *				  atomic_inc_return (1)
+	 *				  			  atomic_inc_return (2)
+	 *				  __lru_add_drain_all(true)
+	 *				  			  __lru_add_drain_all(false)
+	 *				  			    mutex_lock
+	 *				    mutex_lock
+	 *				    			    # skip cpu0 (pagevec_add not visible yet)
+	 *				    			    mutex_unlock
+	 *				    			 # fail because of pcp(0) pin
+	 *				    queue_work_on(0)
+	 *
+	 * but the scheme is a best effort and the above race quite unlikely
+	 * to matter in real life.
 	 */
-	__lru_add_drain_all(true);
+	__lru_add_drain_all(count == 1);
 #else
 	lru_add_and_bh_lrus_drain();
 #endif
-- 
Michal Hocko
SUSE Labs