Re: Re: [PATCH] [PATCH] mm: disable preemption before swapcache_free

"zhaowuyun@xxxxxxxxxxxx" <zhaowuyun@xxxxxxxxxxxx> · Tue, 7 Aug 2018 10:15:40 +0800

>Andrew's msleep(1) may be a good enough bandaid for you. And I share
>Michal's doubt about your design, in which an RT thread meets swap:
>this may not be the last problem you have with that.
>
>But this is a real bug when CONFIG_PREEMPT=y, RT threads or not: we
>just didn't notice, because it's usually hidden by the cond_resched().
>(I think that was added in 3.10, because in 2.6.29 I had been guilty of
>inserting a discard, and wait for completion, in between allocating swap
>and adding to swap cache; but Shao Hua fixed my discard in 3.12.) Thanks
>a lot for making us aware of this bug.
>
>After reminding myself of the issues here, I disagree with much of what
>has been said: we shall "always" want the loop in __read_swap_cache_async()
>(though some of its error handling is probably superfluous now, agreed);
>and your disabling of preemption is not just a bandaid, it's exactly the
>right approach.
>
>We could do something heavier, perhaps rearranging the swapcache tree work
>to be done under swap_lock as well as tree_lock (I'm talking 4.9-language),
>but that's very unlikely to be an improvement. Disabling preemption yokes
>the two spinlocks together in an efficient way, without overhead on other
>paths; on rare occasions we spin around __read_swap_cache_async() instead
>of spinning around to acquire a spinlock.
>
>But your patch is incomplete. The same needs to be done in delete_from_
>swap_cache(), and we would also need to protect against preemption between
>the get_swap_page() and the add_to_swap_cache(), in add_to_swap() and in
>shmem_writepage(). The latter gets messy, but 4.11 (where Tim Chen uses
>SWAP_HAS_CACHE more widely) gives a good hint: __read_swap_cache_async()
>callers are only interested in swap entries that are already in use and
>still in use. (Though swapoff has to be more careful, partly because one
>of its jobs is to clear out swap-cache-only entries, partly because the
>current interface would mistake a NULL for no-entry as out-of-memory.)
>
>Below is the patch I would do for 4.9 (diffed against 4.9.117), and I'm
>showing that because it's the simplest case. Although the principles stay
>the same, the codebase here has gone through several shifts, and 4.19 will
>probably be different again. So I'll test and post a patch against 4.19-rc
>in a few weeks time, and that can then be backported to stable: but will
>need several manual backports because of the intervening changes.
>
>I did wonder whether just to extend the irq-disabled section in
>delete_from_swap_cache() etc: that's easy, and works, and is even better
>protection against spinning too long; but it's not absolutely necessary,
>so all in all, probably better avoided. I did wonder whether to remove
>the cond_resched(), but it's not a bad place for one, so I've left it in.
>
>When checking worst cases of looping around __read_swap_cache_async(),
>after the patch, I was worried for a while. I had naively imagined that
>going more than twice around the loop should be vanishingly rare, but
>that is not so at all. But the bad cases I looked into were all the same:
>after forking, two processes, on HT siblings, each serving do_swap_page(),
>trying to bring the same swap into its mm, with a sparse swapper_space
>tree: one process gets to do all the work of allocating new radix-tree
>nodes and bringing them into D-cache, while the other just spins around
>__read_swap_cache_async() seeing SWAP_HAS_CACHE but not yet the page in
>the radix-tree. That's okay, that makes sense.
>
>Hugh
>---
>
> mm/swap_state.c |   26 +++++++++++++-------------
> mm/swapfile.c   |    8 +++++++-
> mm/vmscan.c     |    3 +++
> 3 files changed, 23 insertions(+), 14 deletions(-)
>
>--- 4.9.117/mm/swap_state.c	2016-12-11 11:17:54.000000000 -0800
>+++ linux/mm/swap_state.c	2018-08-04 11:50:46.577788766 -0700
>@@ -225,9 +225,11 @@ void delete_from_swap_cache(struct page
> address_space = swap_address_space(entry);
> spin_lock_irq(&address_space->tree_lock);
> __delete_from_swap_cache(page);
>+	/* Expedite swapcache_free() to help __read_swap_cache_async() */
>+	preempt_disable();
> spin_unlock_irq(&address_space->tree_lock);
>-
> swapcache_free(entry);
>+	preempt_enable();
> put_page(page);
> }
>
>@@ -337,19 +339,17 @@ struct page *__read_swap_cache_async(swp
> if (err == -EEXIST) {
> radix_tree_preload_end();
> /*
>-	* We might race against get_swap_page() and stumble
>-	* across a SWAP_HAS_CACHE swap_map entry whose page
>-	* has not been brought into the swapcache yet, while
>-	* the other end is scheduled away waiting on discard
>-	* I/O completion at scan_swap_map().
>+	* We might race against __delete_from_swap_cache() and
>+	* stumble across a swap_map entry whose SWAP_HAS_CACHE
>+	* has not yet been cleared: hence preempt_disable()
>+	* in __remove_mapping() and delete_from_swap_cache(),
>+	* so they cannot schedule away before clearing it.
> *
>-	* In order to avoid turning this transitory state
>-	* into a permanent loop around this -EEXIST case
>-	* if !CONFIG_PREEMPT and the I/O completion happens
>-	* to be waiting on the CPU waitqueue where we are now
>-	* busy looping, we just conditionally invoke the
>-	* scheduler here, if there are some more important
>-	* tasks to run.
>+	* We need similar protection against racing calls to
>+	* __read_swap_cache_async(): preempt_disable() before
>+	* swapcache_prepare() above, preempt_enable() after
>+	* __add_to_swap_cache() below: which are already in
>+	* radix_tree_maybe_preload(), radix_tree_preload_end().
> */
> cond_resched();
> continue;
>--- 4.9.117/mm/swapfile.c	2018-08-04 11:40:08.463504848 -0700
>+++ linux/mm/swapfile.c	2018-08-04 11:50:46.577788766 -0700
>@@ -2670,7 +2670,13 @@ static int __swap_duplicate(swp_entry_t
> /* set SWAP_HAS_CACHE if there is no cache and entry is used */
> if (!has_cache && count)
> has_cache = SWAP_HAS_CACHE;
>-	else if (has_cache)	/* someone else added cache */
>+	/*
>+	* __read_swap_cache_async() can usually skip entries without
>+	* real usage (including those in between being allocated and
>+	* added to swap cache); but swapoff (!SWP_WRITEOK) must not.
>+	*/
>+	else if (has_cache &&
>+	(count || !(p->flags & SWP_WRITEOK)))
> err = -EEXIST;
> else	/* no users remaining */
> err = -ENOENT;
>--- 4.9.117/mm/vmscan.c	2018-08-04 11:40:08.471504902 -0700
>+++ linux/mm/vmscan.c	2018-08-04 11:50:46.577788766 -0700
>@@ -709,8 +709,11 @@ static int __remove_mapping(struct addre
> swp_entry_t swap = { .val = page_private(page) };
> mem_cgroup_swapout(page, swap);
> __delete_from_swap_cache(page);
>+	/* Expedite swapcache_free() for __read_swap_cache_async() */
>+	preempt_disable();
> spin_unlock_irqrestore(&mapping->tree_lock, flags);
> swapcache_free(swap);
>+	preempt_enable();
> } else {
> void (*freepage)(struct page *);
> void *shadow = NULL; 

Hi Hugh,

Thanks for affirming the modification of disabling preemption and 
pointing out the incompleteness, delete_from_swap_cache() needs the same protection.
I'm curious about that why don't put swapcache_free(swap) under protection of mapping->tree_lock ??