That is a BUG we found in mm/vmscan.c at KERNEL VERSION 4.9.82 Sumary is TASK A (normal priority) doing __remove_mapping page preempted by TASK B (RT priority) doing __read_swap_cache_async, the TASK A preempted before swapcache_free, left SWAP_HAS_CACHE flag in the swap cache, the TASK B which doing __read_swap_cache_async, will not success at swapcache_prepare(entry) because the swap cache was exist, then it will loop forever because it is a RT thread... the spin lock unlocked before swapcache_free, so disable preemption until swapcache_free executed ... struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, struct vm_area_struct *vma, unsigned long addr, bool *new_page_allocated) { struct page *found_page, *new_page = NULL; struct address_space *swapper_space = swap_address_space(entry); int err; *new_page_allocated = false; do { /* * First check the swap cache. Since this is normally * called after lookup_swap_cache() failed, re-calling * that would confuse statistics. */ found_page = find_get_page(swapper_space, swp_offset(entry)); if (found_page) break; /* * Get a new page to read into from swap. */ if (!new_page) { new_page = alloc_page_vma(gfp_mask, vma, addr); if (!new_page) break; /* Out of memory */ } /* * call radix_tree_preload() while we can wait. */ err = radix_tree_maybe_preload(gfp_mask & GFP_KERNEL); if (err) break; /* * Swap entry may have been freed since our caller observed it. */ err = swapcache_prepare(entry); if (err == -EEXIST) { radix_tree_preload_end(); /* * We might race against get_swap_page() and stumble * across a SWAP_HAS_CACHE swap_map entry whose page * has not been brought into the swapcache yet, while * the other end is scheduled away waiting on discard * I/O completion at scan_swap_map(). * * In order to avoid turning this transitory state * into a permanent loop around this -EEXIST case * if !CONFIG_PREEMPT and the I/O completion happens * to be waiting on the CPU waitqueue where we are now * busy looping, we just conditionally invoke the * scheduler here, if there are some more important * tasks to run. */ cond_resched(); continue; // will loop infinitely } zhaowuyun@xxxxxxxxxxxx From: Michal Hocko Date: 2018-07-25 15:40 To: zhaowuyun@xxxxxxxxxxxx CC: mgorman; akpm; minchan; vinmenon; hannes; hillf.zj; linux-mm; linux-kernel Subject: Re: [PATCH] [PATCH] mm: disable preemption before swapcache_free On Wed 25-07-18 14:37:58, zhaowuyun@xxxxxxxxxxxx wrote: [...] > Change-Id: I36d9df7ccff77c589b7157225410269c675a8504 What is this? > Signed-off-by: zhaowuyun <zhaowuyun@xxxxxxxxxxxx> > --- > mm/vmscan.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 2740973..acede002 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -674,6 +674,12 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, > BUG_ON(!PageLocked(page)); > BUG_ON(mapping != page_mapping(page)); > + /* > + * preemption must be disabled to protect current task preempted before > + * swapcache_free(swap) invoked by the task which do the > + * __read_swap_cache_async job on the same page > + */ > + preempt_disable(); > spin_lock_irqsave(&mapping->tree_lock, flags); Hmm, but spin_lock_irqsave already implies the disabled preemption. -- Michal Hocko SUSE Labs