>Andrew's msleep(1) may be a good enough bandaid for you. And I share >Michal's doubt about your design, in which an RT thread meets swap: >this may not be the last problem you have with that. > >But this is a real bug when CONFIG_PREEMPT=y, RT threads or not: we >just didn't notice, because it's usually hidden by the cond_resched(). >(I think that was added in 3.10, because in 2.6.29 I had been guilty of >inserting a discard, and wait for completion, in between allocating swap >and adding to swap cache; but Shao Hua fixed my discard in 3.12.) Thanks >a lot for making us aware of this bug. > >After reminding myself of the issues here, I disagree with much of what >has been said: we shall "always" want the loop in __read_swap_cache_async() >(though some of its error handling is probably superfluous now, agreed); >and your disabling of preemption is not just a bandaid, it's exactly the >right approach. > >We could do something heavier, perhaps rearranging the swapcache tree work >to be done under swap_lock as well as tree_lock (I'm talking 4.9-language), >but that's very unlikely to be an improvement. Disabling preemption yokes >the two spinlocks together in an efficient way, without overhead on other >paths; on rare occasions we spin around __read_swap_cache_async() instead >of spinning around to acquire a spinlock. > >But your patch is incomplete. The same needs to be done in delete_from_ >swap_cache(), and we would also need to protect against preemption between >the get_swap_page() and the add_to_swap_cache(), in add_to_swap() and in >shmem_writepage(). The latter gets messy, but 4.11 (where Tim Chen uses >SWAP_HAS_CACHE more widely) gives a good hint: __read_swap_cache_async() >callers are only interested in swap entries that are already in use and >still in use. (Though swapoff has to be more careful, partly because one >of its jobs is to clear out swap-cache-only entries, partly because the >current interface would mistake a NULL for no-entry as out-of-memory.) > >Below is the patch I would do for 4.9 (diffed against 4.9.117), and I'm >showing that because it's the simplest case. Although the principles stay >the same, the codebase here has gone through several shifts, and 4.19 will >probably be different again. So I'll test and post a patch against 4.19-rc >in a few weeks time, and that can then be backported to stable: but will >need several manual backports because of the intervening changes. > >I did wonder whether just to extend the irq-disabled section in >delete_from_swap_cache() etc: that's easy, and works, and is even better >protection against spinning too long; but it's not absolutely necessary, >so all in all, probably better avoided. I did wonder whether to remove >the cond_resched(), but it's not a bad place for one, so I've left it in. > >When checking worst cases of looping around __read_swap_cache_async(), >after the patch, I was worried for a while. I had naively imagined that >going more than twice around the loop should be vanishingly rare, but >that is not so at all. But the bad cases I looked into were all the same: >after forking, two processes, on HT siblings, each serving do_swap_page(), >trying to bring the same swap into its mm, with a sparse swapper_space >tree: one process gets to do all the work of allocating new radix-tree >nodes and bringing them into D-cache, while the other just spins around >__read_swap_cache_async() seeing SWAP_HAS_CACHE but not yet the page in >the radix-tree. That's okay, that makes sense. > >Hugh >--- > > mm/swap_state.c | 26 +++++++++++++------------- > mm/swapfile.c | 8 +++++++- > mm/vmscan.c | 3 +++ > 3 files changed, 23 insertions(+), 14 deletions(-) > >--- 4.9.117/mm/swap_state.c 2016-12-11 11:17:54.000000000 -0800 >+++ linux/mm/swap_state.c 2018-08-04 11:50:46.577788766 -0700 >@@ -225,9 +225,11 @@ void delete_from_swap_cache(struct page > address_space = swap_address_space(entry); > spin_lock_irq(&address_space->tree_lock); > __delete_from_swap_cache(page); >+ /* Expedite swapcache_free() to help __read_swap_cache_async() */ >+ preempt_disable(); > spin_unlock_irq(&address_space->tree_lock); >- > swapcache_free(entry); >+ preempt_enable(); > put_page(page); > } > >@@ -337,19 +339,17 @@ struct page *__read_swap_cache_async(swp > if (err == -EEXIST) { > radix_tree_preload_end(); > /* >- * We might race against get_swap_page() and stumble >- * across a SWAP_HAS_CACHE swap_map entry whose page >- * has not been brought into the swapcache yet, while >- * the other end is scheduled away waiting on discard >- * I/O completion at scan_swap_map(). >+ * We might race against __delete_from_swap_cache() and >+ * stumble across a swap_map entry whose SWAP_HAS_CACHE >+ * has not yet been cleared: hence preempt_disable() >+ * in __remove_mapping() and delete_from_swap_cache(), >+ * so they cannot schedule away before clearing it. > * >- * In order to avoid turning this transitory state >- * into a permanent loop around this -EEXIST case >- * if !CONFIG_PREEMPT and the I/O completion happens >- * to be waiting on the CPU waitqueue where we are now >- * busy looping, we just conditionally invoke the >- * scheduler here, if there are some more important >- * tasks to run. >+ * We need similar protection against racing calls to >+ * __read_swap_cache_async(): preempt_disable() before >+ * swapcache_prepare() above, preempt_enable() after >+ * __add_to_swap_cache() below: which are already in >+ * radix_tree_maybe_preload(), radix_tree_preload_end(). > */ > cond_resched(); > continue; >--- 4.9.117/mm/swapfile.c 2018-08-04 11:40:08.463504848 -0700 >+++ linux/mm/swapfile.c 2018-08-04 11:50:46.577788766 -0700 >@@ -2670,7 +2670,13 @@ static int __swap_duplicate(swp_entry_t > /* set SWAP_HAS_CACHE if there is no cache and entry is used */ > if (!has_cache && count) > has_cache = SWAP_HAS_CACHE; >- else if (has_cache) /* someone else added cache */ >+ /* >+ * __read_swap_cache_async() can usually skip entries without >+ * real usage (including those in between being allocated and >+ * added to swap cache); but swapoff (!SWP_WRITEOK) must not. >+ */ >+ else if (has_cache && >+ (count || !(p->flags & SWP_WRITEOK))) > err = -EEXIST; > else /* no users remaining */ > err = -ENOENT; >--- 4.9.117/mm/vmscan.c 2018-08-04 11:40:08.471504902 -0700 >+++ linux/mm/vmscan.c 2018-08-04 11:50:46.577788766 -0700 >@@ -709,8 +709,11 @@ static int __remove_mapping(struct addre > swp_entry_t swap = { .val = page_private(page) }; > mem_cgroup_swapout(page, swap); > __delete_from_swap_cache(page); >+ /* Expedite swapcache_free() for __read_swap_cache_async() */ >+ preempt_disable(); > spin_unlock_irqrestore(&mapping->tree_lock, flags); > swapcache_free(swap); >+ preempt_enable(); > } else { > void (*freepage)(struct page *); > void *shadow = NULL; Hi Hugh, Thanks for affirming the modification of disabling preemption and pointing out the incompleteness, delete_from_swap_cache() needs the same protection. I'm curious about that why don't put swapcache_free(swap) under protection of mapping->tree_lock ??