Re: [PATCH mmotm] mm/swap: fix livelock in __read_swap_cache_async()

Rafael Aquini <aquini@xxxxxxxxxx> · Fri, 22 May 2020 13:08:44 -0400



On Thu, May 21, 2020 at 10:56:20PM -0700, Hugh Dickins wrote:
> I've only seen this livelock on one machine (repeatably, but not to
> order), and not fully analyzed it - two processes seen looping around
> getting -EEXIST from swapcache_prepare(), I guess a third (at lower
> priority? but wanting the same cpu as one of the loopers? preemption
> or cond_resched() not enough to let it back in?) set SWAP_HAS_CACHE,
> then went off into direct reclaim, scheduled away, and somehow could
> not get back to add the page to swap cache and let them all complete.
> 
> Restore the page allocation in __read_swap_cache_async() to before
> the swapcache_prepare() call: "mm: memcontrol: charge swapin pages
> on instantiation" moved it outside the loop, which indeed looks much
> nicer, but exposed this weakness.  We used to allocate new_page once
> and then keep it across all iterations of the loop: but I think that
> just optimizes for a rare case, and complicates the flow, so go with
> the new simpler structure, with allocate+free each time around (which
> is more considerate use of the memory too).
> 
> Fix the comment on the looping case, which has long been inaccurate:
> it's not a racing get_swap_page() that's the problem here.
> 
> Fix the add_to_swap_cache() and mem_cgroup_charge() error recovery:
> not swap_free(), but put_swap_page() to undo SWAP_HAS_CACHE, as was
> done before; but delete_from_swap_cache() already includes it.
> 
> And one more nit: I don't think it makes any difference in practice,
> but remove the "& GFP_KERNEL" mask from the mem_cgroup_charge() call:
> add_to_swap_cache() needs that, to convert gfp_mask from user and page
> cache allocation (e.g. highmem) to radix node allocation (lowmem), but
> we don't need or usually apply that mask when charging mem_cgroup.
> 
> Signed-off-by: Hugh Dickins <hughd@xxxxxxxxxx>
> ---
> Mostly fixing mm-memcontrol-charge-swapin-pages-on-instantiation.patch
> but now I see that mm-memcontrol-delete-unused-lrucare-handling.patch
> made a further change here (took an arg off the mem_cgroup_charge call):
> as is, this patch is diffed to go on top of both of them, and better
> that I get it out now for Johannes look at; but could be rediffed for
> folding into blah-instantiation.patch later.
> 
> Earlier in the day I promised two patches to __read_swap_cache_async(),
> but find now that I cannot quite justify the second patch: it makes a
> slight adjustment in swapcache_prepare(), then removes the redundant
> __swp_swapcount() && swap_slot_cache_enabled business from blah_async().
> 
> I'd still like to do that, but this patch here brings back the
> alloc_page_vma() in between them, and I don't have any evidence to
> reassure us that I'm not then pessimizing a readahead case by doing
> unnecessary allocation and free. Leave it for some other time perhaps.
> 
>  mm/swap_state.c |   52 +++++++++++++++++++++++++---------------------
>  1 file changed, 29 insertions(+), 23 deletions(-)
> 
> --- 5.7-rc6-mm1/mm/swap_state.c	2020-05-20 12:21:56.149694170 -0700
> +++ linux/mm/swap_state.c	2020-05-21 20:17:50.188773901 -0700
> @@ -392,56 +392,62 @@ struct page *__read_swap_cache_async(swp
>  			return NULL;
>  
>  		/*
> +		 * Get a new page to read into from swap.  Allocate it now,
> +		 * before marking swap_map SWAP_HAS_CACHE, when -EEXIST will
> +		 * cause any racers to loop around until we add it to cache.
> +		 */
> +		page = alloc_page_vma(gfp_mask, vma, addr);
> +		if (!page)
> +			return NULL;
> +
> +		/*
>  		 * Swap entry may have been freed since our caller observed it.
>  		 */
>  		err = swapcache_prepare(entry);
>  		if (!err)
>  			break;
>  
> -		if (err == -EEXIST) {
> -			/*
> -			 * We might race against get_swap_page() and stumble
> -			 * across a SWAP_HAS_CACHE swap_map entry whose page
> -			 * has not been brought into the swapcache yet.
> -			 */
> -			cond_resched();
> -			continue;
> -		}
> +		put_page(page);
> +		if (err != -EEXIST)
> +			return NULL;
>  
> -		return NULL;
> +		/*
> +		 * We might race against __delete_from_swap_cache(), and
> +		 * stumble across a swap_map entry whose SWAP_HAS_CACHE
> +		 * has not yet been cleared.  Or race against another
> +		 * __read_swap_cache_async(), which has set SWAP_HAS_CACHE
> +		 * in swap_map, but not yet added its page to swap cache.
> +		 */
> +		cond_resched();
>  	}
>  
>  	/*
> -	 * The swap entry is ours to swap in. Prepare a new page.
> +	 * The swap entry is ours to swap in. Prepare the new page.
>  	 */
>  
> -	page = alloc_page_vma(gfp_mask, vma, addr);
> -	if (!page)
> -		goto fail_free;
> -
>  	__SetPageLocked(page);
>  	__SetPageSwapBacked(page);
>  
>  	/* May fail (-ENOMEM) if XArray node allocation failed. */
> -	if (add_to_swap_cache(page, entry, gfp_mask & GFP_KERNEL))
> +	if (add_to_swap_cache(page, entry, gfp_mask & GFP_KERNEL)) {
> +		put_swap_page(page, entry);
>  		goto fail_unlock;
> +	}
>  
> -	if (mem_cgroup_charge(page, NULL, gfp_mask & GFP_KERNEL))
> -		goto fail_delete;
> +	if (mem_cgroup_charge(page, NULL, gfp_mask)) {
> +		delete_from_swap_cache(page);
> +		goto fail_unlock;
> +	}
>  
> -	/* Initiate read into locked page */
> +	/* Caller will initiate read into locked page */
>  	SetPageWorkingset(page);
>  	lru_cache_add_anon(page);
>  	*new_page_allocated = true;
>  	return page;
>  
> -fail_delete:
> -	delete_from_swap_cache(page);
>  fail_unlock:
>  	unlock_page(page);
>  	put_page(page);
> -fail_free:
> -	swap_free(entry);
>  	return NULL;
>  }
>  
> 
Acked-by: Rafael Aquini <aquini@xxxxxxxxxx>