Ryan Roberts <ryan.roberts@xxxxxxx> writes: > There was previously a theoretical window where swapoff() could run and > teardown a swap_info_struct while a call to free_swap_and_cache() was > running in another thread. This could cause, amongst other bad > possibilities, swap_page_trans_huge_swapped() (called by > free_swap_and_cache()) to access the freed memory for swap_map. > > This is a theoretical problem and I haven't been able to provoke it from > a test case. But there has been agreement based on code review that this > is possible (see link below). > > Fix it by using get_swap_device()/put_swap_device(), which will stall > swapoff(). There was an extra check in _swap_info_get() to confirm that > the swap entry was valid. This wasn't present in get_swap_device() so > I've added it. I couldn't find any existing get_swap_device() call sites > where this extra check would cause any false alarms. > > Details of how to provoke one possible issue (thanks to David Hilenbrand > for deriving this): > > --8<----- > > __swap_entry_free() might be the last user and result in > "count == SWAP_HAS_CACHE". > > swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0. > > So the question is: could someone reclaim the folio and turn > si->inuse_pages==0, before we completed swap_page_trans_huge_swapped(). > > Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are > still references by swap entries. > > Process 1 still references subpage 0 via swap entry. > Process 2 still references subpage 1 via swap entry. > > Process 1 quits. Calls free_swap_and_cache(). > -> count == SWAP_HAS_CACHE > [then, preempted in the hypervisor etc.] > > Process 2 quits. Calls free_swap_and_cache(). > -> count == SWAP_HAS_CACHE > > Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls > __try_to_reclaim_swap(). > > __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()-> > put_swap_folio()->free_swap_slot()->swapcache_free_entries()-> > swap_entry_free()->swap_range_free()-> > ... > WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries); > > What stops swapoff to succeed after process 2 reclaimed the swap cache > but before process1 finished its call to swap_page_trans_huge_swapped()? > > --8<----- I think that this can be simplified. Even for a 4K folio, this could happen. CPU0 CPU1 ---- ---- zap_pte_range free_swap_and_cache __swap_entry_free /* swap count become 0 */ swapoff try_to_unuse filemap_get_folio folio_free_swap /* remove swap cache */ /* free si->swap_map[] */ swap_page_trans_huge_swapped <-- access freed si->swap_map !!! > Fixes: 7c00bafee87c ("mm/swap: free swap slots in batch") > Closes: https://lore.kernel.org/linux-mm/65a66eb9-41f8-4790-8db2-0c70ea15979f@xxxxxxxxxx/ > Cc: stable@xxxxxxxxxxxxxxx > Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx> > --- > > Applies on top of v6.8-rc6 and mm-unstable (b38c34939fe4). > > Thanks, > Ryan > > mm/swapfile.c | 14 +++++++++++--- > 1 file changed, 11 insertions(+), 3 deletions(-) > > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 2b3a2d85e350..f580e6abc674 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -1281,7 +1281,9 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry) > smp_rmb(); > offset = swp_offset(entry); > if (offset >= si->max) > - goto put_out; > + goto bad_offset; > + if (data_race(!si->swap_map[swp_offset(entry)])) > + goto bad_free; > > return si; > bad_nofile: > @@ -1289,9 +1291,14 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry) > out: > return NULL; > put_out: > - pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val); > percpu_ref_put(&si->users); > return NULL; > +bad_offset: > + pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val); > + goto put_out; > +bad_free: > + pr_err("%s: %s%08lx\n", __func__, Unused_offset, entry.val); > + goto put_out; > } I don't think that it's a good idea to warn for bad free entries. get_swap_device() could be called without enough lock to prevent parallel swap operations on entry. So, false positive is possible there. I think that it's good to add more checks in general, for example, in free_swap_and_cache(), we can check more because we are sure the swap entry will not be freed by parallel swap operations. Just don't put the check in general get_swap_device(). We can add another helper to check that. I found that there are other checks in get_swap_device() already. I think that we may need to revert, commit 23b230ba8ac3 ("mm/swap: print bad swap offset entry in get_swap_device") which introduces it. And add check in appropriate places. -- Best Regards, Huang, Ying > static unsigned char __swap_entry_free(struct swap_info_struct *p, > @@ -1609,13 +1616,14 @@ int free_swap_and_cache(swp_entry_t entry) > if (non_swap_entry(entry)) > return 1; > > - p = _swap_info_get(entry); > + p = get_swap_device(entry); > if (p) { > count = __swap_entry_free(p, entry); > if (count == SWAP_HAS_CACHE && > !swap_page_trans_huge_swapped(p, entry)) > __try_to_reclaim_swap(p, swp_offset(entry), > TTRS_UNMAPPED | TTRS_FULL); > + put_swap_device(p); > } > return p != NULL; > } > -- > 2.25.1