+ Hugh On 04/03/2024 22:02, David Hildenbrand wrote: > On 04.03.24 22:55, Ryan Roberts wrote: >> On 04/03/2024 20:50, David Hildenbrand wrote: >>>>>> >>>>>> This is the existing free_swap_and_cache(). I think _swap_info_get() would >>>>>> break >>>>>> if this could race with swapoff(), and __swap_entry_free() looks up the >>>>>> cluster >>>>>> from an array, which would also be freed by swapoff if racing: >>>>>> >>>>>> int free_swap_and_cache(swp_entry_t entry) >>>>>> { >>>>>> struct swap_info_struct *p; >>>>>> unsigned char count; >>>>>> >>>>>> if (non_swap_entry(entry)) >>>>>> return 1; >>>>>> >>>>>> p = _swap_info_get(entry); >>>>>> if (p) { >>>>>> count = __swap_entry_free(p, entry); >>>>> >>>>> If count dropped to 0 and >>>>> >>>>>> if (count == SWAP_HAS_CACHE) >>>>> >>>>> >>>>> count is now SWAP_HAS_CACHE, there is in fact no swap entry anymore. We >>>>> removed >>>>> it. That one would have to be reclaimed asynchronously. >>>>> >>>>> The existing code we would call swap_page_trans_huge_swapped() with the SI it >>>>> obtained via _swap_info_get(). >>>>> >>>>> I also don't see what should be left protecting the SI. It's not locked >>>>> anymore, >>>>> the swapcounts are at 0. We don't hold the folio lock. >>>>> >>>>> try_to_unuse() will stop as soon as si->inuse_pages is at 0. Hm ... >>>> >>>> But, assuming the caller of free_swap_and_cache() acquires the PTL first, I >>>> think this all works out ok? While free_swap_and_cache() is running, >>>> try_to_unuse() will wait for the PTL. Or if try_to_unuse() runs first, then >>>> free_swap_and_cache() will never be called because the swap entry will have >>>> been >>>> removed from the PTE? >>> >>> But can't try_to_unuse() run, detect !si->inuse_pages and not even bother about >>> scanning any further page tables? >>> >>> But my head hurts from digging through that code. >> >> Yep, glad I'm not the only one that gets headaches from swapfile.c. >> >>> >>> Let me try again: >>> >>> __swap_entry_free() might be the last user and result in "count == >>> SWAP_HAS_CACHE". >>> >>> swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0. >>> >>> >>> So the question is: could someone reclaim the folio and turn si->inuse_pages==0, >>> before we completed swap_page_trans_huge_swapped(). >>> >>> Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are still >>> references by swap entries. >>> >>> Process 1 still references subpage 0 via swap entry. >>> Process 2 still references subpage 1 via swap entry. >>> >>> Process 1 quits. Calls free_swap_and_cache(). >>> -> count == SWAP_HAS_CACHE >>> [then, preempted in the hypervisor etc.] >>> >>> Process 2 quits. Calls free_swap_and_cache(). >>> -> count == SWAP_HAS_CACHE >>> >>> Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls >>> __try_to_reclaim_swap(). >>> >>> __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()->put_swap_folio()-> >>> free_swap_slot()->swapcache_free_entries()->swap_entry_free()->swap_range_free()-> >>> ... >>> WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries); >>> >>> >>> What stops swapoff to succeed after process 2 reclaimed the swap cache but >>> before process 1 finished its call to swap_page_trans_huge_swapped()? >> >> Assuming you are talking about anonymous memory, process 1 has the PTL while >> it's executing free_swap_and_cache(). try_to_unuse() iterates over every vma in >> every mm, and it swaps-in a page for every PTE that holds a swap entry for the >> device being swapoff'ed. It takes the PTL while converting the swap entry to >> present PTE - see unuse_pte(). Process 1 must have beaten try_to_unuse() to the >> particular pte, because if try_to_unuse() got there first, it would have >> converted it from a swap entry to present pte and process 1 would never even >> have called free_swap_and_cache(). So try_to_unuse() will eventually wait on the >> PTL until process 1 has released it after free_swap_and_cache() completes. Am I >> missing something? Because that part feels pretty clear to me. > > Why should try_to_unuse() do *anything* if it already finds > si->inuse_pages == 0 because we (p1 } p2) just freed the swapentries and process > 2 managed to free the last remaining swapcache entry? Yeah ok. For some reason I thought unuse_mm() was iterating over all mms and so the `while (READ_ONCE(si->inuse_pages))` was only evaluated after iterating over every mm. Oops. So yes, I agree with you; I think this is broken. And I'm a bit worried this could be a can of worms; By the same logic, I think folio_free_swap(), swp_swapcount() and probably others are broken in the same way. I wonder if we are missing something here? I've added Hugh - I see he has a lot of commits in this area, perhaps he has some advice? Thanks, Ryan > > I'm probably missing something important :) > > try_to_unuse() really starts with > > if (!READ_ONCE(si->inuse_pages)) > goto success; >