On 05/03/2024 15:50, David Hildenbrand wrote: > On 05.03.24 16:13, Ryan Roberts wrote: >> There was previously a theoretical window where swapoff() could run and >> teardown a swap_info_struct while a call to free_swap_and_cache() was >> running in another thread. This could cause, amongst other bad >> possibilities, swap_page_trans_huge_swapped() (called by >> free_swap_and_cache()) to access the freed memory for swap_map. >> >> This is a theoretical problem and I haven't been able to provoke it from >> a test case. But there has been agreement based on code review that this >> is possible (see link below). >> >> Fix it by using get_swap_device()/put_swap_device(), which will stall >> swapoff(). There was an extra check in _swap_info_get() to confirm that >> the swap entry was valid. This wasn't present in get_swap_device() so >> I've added it. I couldn't find any existing get_swap_device() call sites >> where this extra check would cause any false alarms. >> >> Details of how to provoke one possible issue (thanks to David Hilenbrand >> for deriving this): > > Almost > > "s/Hilenbrand/Hildenbrand/" :) Ahh sorry... I even explicitly checked it against your name on emails... fat fingers... > >> >> --8<----- >> >> __swap_entry_free() might be the last user and result in >> "count == SWAP_HAS_CACHE". >> >> swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0. >> >> So the question is: could someone reclaim the folio and turn >> si->inuse_pages==0, before we completed swap_page_trans_huge_swapped(). >> >> Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are >> still references by swap entries. >> >> Process 1 still references subpage 0 via swap entry. >> Process 2 still references subpage 1 via swap entry. >> >> Process 1 quits. Calls free_swap_and_cache(). >> -> count == SWAP_HAS_CACHE >> [then, preempted in the hypervisor etc.] >> >> Process 2 quits. Calls free_swap_and_cache(). >> -> count == SWAP_HAS_CACHE >> >> Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls >> __try_to_reclaim_swap(). >> >> __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()-> >> put_swap_folio()->free_swap_slot()->swapcache_free_entries()-> >> swap_entry_free()->swap_range_free()-> >> ... >> WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries); >> >> What stops swapoff to succeed after process 2 reclaimed the swap cache >> but before process1 finished its call to swap_page_trans_huge_swapped()? >> >> --8<----- >> >> Fixes: 7c00bafee87c ("mm/swap: free swap slots in batch") >> Closes: >> https://lore.kernel.org/linux-mm/65a66eb9-41f8-4790-8db2-0c70ea15979f@xxxxxxxxxx/ >> Cc: stable@xxxxxxxxxxxxxxx >> Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx> >> --- >> >> Applies on top of v6.8-rc6 and mm-unstable (b38c34939fe4). >> >> Thanks, >> Ryan >> >> mm/swapfile.c | 14 +++++++++++--- >> 1 file changed, 11 insertions(+), 3 deletions(-) >> >> diff --git a/mm/swapfile.c b/mm/swapfile.c >> index 2b3a2d85e350..f580e6abc674 100644 >> --- a/mm/swapfile.c >> +++ b/mm/swapfile.c >> @@ -1281,7 +1281,9 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry) >> smp_rmb(); >> offset = swp_offset(entry); >> if (offset >= si->max) >> - goto put_out; >> + goto bad_offset; >> + if (data_race(!si->swap_map[swp_offset(entry)])) >> + goto bad_free; >> >> return si; >> bad_nofile: >> @@ -1289,9 +1291,14 @@ struct swap_info_struct *get_swap_device(swp_entry_t >> entry) >> out: >> return NULL; >> put_out: >> - pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val); >> percpu_ref_put(&si->users); >> return NULL; >> +bad_offset: >> + pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val); >> + goto put_out; >> +bad_free: >> + pr_err("%s: %s%08lx\n", __func__, Unused_offset, entry.val); >> + goto put_out; >> } >> >> static unsigned char __swap_entry_free(struct swap_info_struct *p, >> @@ -1609,13 +1616,14 @@ int free_swap_and_cache(swp_entry_t entry) >> if (non_swap_entry(entry)) >> return 1; >> >> - p = _swap_info_get(entry); >> + p = get_swap_device(entry); >> if (p) { >> count = __swap_entry_free(p, entry); >> if (count == SWAP_HAS_CACHE && >> !swap_page_trans_huge_swapped(p, entry)) >> __try_to_reclaim_swap(p, swp_offset(entry), >> TTRS_UNMAPPED | TTRS_FULL); >> + put_swap_device(p); >> } >> return p != NULL; >> } >> -- >> 2.25.1 >> > > LGTM > > Are you planning on sending a doc extension for get_swap_device()? I saw your comment: We should likely update the documentation of get_swap_device(), that after decrementing the refcount, the SI might become stale and should not be touched without a prior get_swap_device(). But when I went to make the changes, I saw the documentation already said: ...we need to enclose all swap related functions with get_swap_device() and put_swap_device()... Notice that swapoff ... can still happen before the percpu_ref_tryget_live() in get_swap_device() or after the percpu_ref_put() in put_swap_device()... The caller must be prepared for that. I thought that already covered it? I'm sure as usual, I've misunderstood your point. Happy to respin if you have something in mind?