On 05/03/2024 08:35, David Hildenbrand wrote: > On 05.03.24 07:11, Huang, Ying wrote: >> Ryan Roberts <ryan.roberts@xxxxxxx> writes: >> >>> + Hugh >>> >>> On 04/03/2024 22:02, David Hildenbrand wrote: >>>> On 04.03.24 22:55, Ryan Roberts wrote: >>>>> On 04/03/2024 20:50, David Hildenbrand wrote: >>>>>>>>> >>>>>>>>> This is the existing free_swap_and_cache(). I think _swap_info_get() would >>>>>>>>> break >>>>>>>>> if this could race with swapoff(), and __swap_entry_free() looks up the >>>>>>>>> cluster >>>>>>>>> from an array, which would also be freed by swapoff if racing: >>>>>>>>> >>>>>>>>> int free_swap_and_cache(swp_entry_t entry) >>>>>>>>> { >>>>>>>>> struct swap_info_struct *p; >>>>>>>>> unsigned char count; >>>>>>>>> >>>>>>>>> if (non_swap_entry(entry)) >>>>>>>>> return 1; >>>>>>>>> >>>>>>>>> p = _swap_info_get(entry); >>>>>>>>> if (p) { >>>>>>>>> count = __swap_entry_free(p, entry); >>>>>>>> >>>>>>>> If count dropped to 0 and >>>>>>>> >>>>>>>>> if (count == SWAP_HAS_CACHE) >>>>>>>> >>>>>>>> >>>>>>>> count is now SWAP_HAS_CACHE, there is in fact no swap entry anymore. We >>>>>>>> removed >>>>>>>> it. That one would have to be reclaimed asynchronously. >>>>>>>> >>>>>>>> The existing code we would call swap_page_trans_huge_swapped() with the >>>>>>>> SI it >>>>>>>> obtained via _swap_info_get(). >>>>>>>> >>>>>>>> I also don't see what should be left protecting the SI. It's not locked >>>>>>>> anymore, >>>>>>>> the swapcounts are at 0. We don't hold the folio lock. >>>>>>>> >>>>>>>> try_to_unuse() will stop as soon as si->inuse_pages is at 0. Hm ... >>>>>>> >>>>>>> But, assuming the caller of free_swap_and_cache() acquires the PTL first, I >>>>>>> think this all works out ok? While free_swap_and_cache() is running, >>>>>>> try_to_unuse() will wait for the PTL. Or if try_to_unuse() runs first, then >>>>>>> free_swap_and_cache() will never be called because the swap entry will have >>>>>>> been >>>>>>> removed from the PTE? >>>>>> >>>>>> But can't try_to_unuse() run, detect !si->inuse_pages and not even bother >>>>>> about >>>>>> scanning any further page tables? >>>>>> >>>>>> But my head hurts from digging through that code. >>>>> >>>>> Yep, glad I'm not the only one that gets headaches from swapfile.c. >>>>> >>>>>> >>>>>> Let me try again: >>>>>> >>>>>> __swap_entry_free() might be the last user and result in "count == >>>>>> SWAP_HAS_CACHE". >>>>>> >>>>>> swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0. >>>>>> >>>>>> >>>>>> So the question is: could someone reclaim the folio and turn >>>>>> si->inuse_pages==0, >>>>>> before we completed swap_page_trans_huge_swapped(). >>>>>> >>>>>> Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are >>>>>> still >>>>>> references by swap entries. >>>>>> >>>>>> Process 1 still references subpage 0 via swap entry. >>>>>> Process 2 still references subpage 1 via swap entry. >>>>>> >>>>>> Process 1 quits. Calls free_swap_and_cache(). >>>>>> -> count == SWAP_HAS_CACHE >>>>>> [then, preempted in the hypervisor etc.] >>>>>> >>>>>> Process 2 quits. Calls free_swap_and_cache(). >>>>>> -> count == SWAP_HAS_CACHE >>>>>> >>>>>> Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls >>>>>> __try_to_reclaim_swap(). >>>>>> >>>>>> __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()->put_swap_folio()-> >>>>>> free_swap_slot()->swapcache_free_entries()->swap_entry_free()->swap_range_free()-> >>>>>> ... >>>>>> WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries); >>>>>> >>>>>> >>>>>> What stops swapoff to succeed after process 2 reclaimed the swap cache but >>>>>> before process 1 finished its call to swap_page_trans_huge_swapped()? >>>>> >>>>> Assuming you are talking about anonymous memory, process 1 has the PTL while >>>>> it's executing free_swap_and_cache(). try_to_unuse() iterates over every >>>>> vma in >>>>> every mm, and it swaps-in a page for every PTE that holds a swap entry for the >>>>> device being swapoff'ed. It takes the PTL while converting the swap entry to >>>>> present PTE - see unuse_pte(). Process 1 must have beaten try_to_unuse() to >>>>> the >>>>> particular pte, because if try_to_unuse() got there first, it would have >>>>> converted it from a swap entry to present pte and process 1 would never even >>>>> have called free_swap_and_cache(). So try_to_unuse() will eventually wait >>>>> on the >>>>> PTL until process 1 has released it after free_swap_and_cache() completes. >>>>> Am I >>>>> missing something? Because that part feels pretty clear to me. >>>> >>>> Why should try_to_unuse() do *anything* if it already finds >>>> si->inuse_pages == 0 because we (p1 } p2) just freed the swapentries and >>>> process >>>> 2 managed to free the last remaining swapcache entry? >>> >>> Yeah ok. For some reason I thought unuse_mm() was iterating over all mms and so >>> the `while (READ_ONCE(si->inuse_pages))` was only evaluated after iterating over >>> every mm. Oops. >>> >>> So yes, I agree with you; I think this is broken. And I'm a bit worried this >>> could be a can of worms; By the same logic, I think folio_free_swap(), >>> swp_swapcount() and probably others are broken in the same way. >> >> Don't worry too much :-), we have get_swap_device() at least. We can >> insert it anywhere we want because it's quite lightweight. And, because >> swapoff() is so rare, the race is theoretical only. Thanks for the response! >> >> For this specific case, I had thought that PTL is enough. But after >> looking at this more, I found a race here too. Until >> __swap_entry_free() return, we are OK, nobody can reduce the swap count >> because we held the PTL. Even that is not true for the shmem case: As far as I can see, shmem doesn't have the PTL or any other synchronizing lock when it calls free_swap_and_cache(). I don't think that changes anything solution-wise though. >> But, after that, even if its return value is >> SWAP_HAS_CACHE (that is, in swap cache), parallel swap_unuse() or >> __try_to_reclaim_swap() may remove the folio from swap cache, so free >> the swap entry. So, swapoff() can proceed to free the data structures >> in parallel. >> >> To fix the race, we can add get/put_swap_device() in >> free_swap_and_cache(). >> >> For other places, we can check whether get/put_swap_device() has been >> called in callers, and the swap reference we held has been decreased >> (e.g., swap count protected by PTL, SWAP_HAS_CACHE protected by folio >> lock). > > Yes, sounds reasonable. We should likely update the documentation of > get_swap_device(), that after decrementing the refcount, the SI might become > stale and should not be touched without a prior get_swap_device(). Yep agreed. If nobody else is planning to do it, I'll try to create a test case that provokes the problem then put a patch together to fix it.