Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

"Huang, Ying" <ying.huang@xxxxxxxxx> · Wed, 14 Apr 2021 11:07:10 +0800

Miaohe Lin <linmiaohe@xxxxxxxxxx> writes:

> On 2021/4/13 9:27, Huang, Ying wrote:
>> Miaohe Lin <linmiaohe@xxxxxxxxxx> writes:
>> 
>>> When I was investigating the swap code, I found the below possible race
>>> window:
>>>
>>> CPU 1					CPU 2
>>> -----					-----
>>> do_swap_page
>>>   synchronous swap_readpage
>>>     alloc_page_vma
>>> 					swapoff
>>> 					  release swap_file, bdev, or ...
>>>       swap_readpage
>>> 	check sis->flags is ok
>>> 	  access swap_file, bdev...[oops!]
>>> 					    si->flags = 0
>>>
>>> Using current get/put_swap_device() to guard against concurrent swapoff for
>>> swap_readpage() looks terrible because swap_readpage() may take really long
>>> time. And this race may not be really pernicious because swapoff is usually
>>> done when system shutdown only. To reduce the performance overhead on the
>>> hot-path as much as possible, it appears we can use the percpu_ref to close
>>> this race window(as suggested by Huang, Ying).
>>>
>>> Fixes: 235b62176712 ("mm/swap: add cluster lock")
>> 
>> This isn't the commit that introduces the race.  You can use `git blame`
>> find out the correct commit.  For this it's commit 0bcac06f27d7 "mm,
>> swap: skip swapcache for swapin of synchronous device".
>> 
>
> Sorry about it! What I refer to is commit eb085574a752 ("mm, swap: fix race between
> swapoff and some swap operations"). And I think this commit does not fix the race
> condition completely, so I reuse the Fixes tag inside it.
>
>> And I suggest to merge 1/5 and 2/5 to make it easy to get the full
>> picture.
>> 
>>> Signed-off-by: Miaohe Lin <linmiaohe@xxxxxxxxxx>
>>> ---
>>>  include/linux/swap.h |  2 +-
>>>  mm/memory.c          | 10 ++++++++++
>>>  mm/swapfile.c        | 28 +++++++++++-----------------
>>>  3 files changed, 22 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>> index 849ba5265c11..9066addb57fd 100644
>>> --- a/include/linux/swap.h
>>> +++ b/include/linux/swap.h
>>> @@ -513,7 +513,7 @@ sector_t swap_page_sector(struct page *page);
>>>  
>>>  static inline void put_swap_device(struct swap_info_struct *si)
>>>  {
>>> -	rcu_read_unlock();
>>> +	percpu_ref_put(&si->users);
>>>  }
>>>  
>>>  #else /* CONFIG_SWAP */
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index cc71a445c76c..8543c47b955c 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -3311,6 +3311,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>  {
>>>  	struct vm_area_struct *vma = vmf->vma;
>>>  	struct page *page = NULL, *swapcache;
>>> +	struct swap_info_struct *si = NULL;
>>>  	swp_entry_t entry;
>>>  	pte_t pte;
>>>  	int locked;
>>> @@ -3339,6 +3340,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>  	}
>>>  
>>>
>> 
>> I suggest to add comments here as follows (words copy from Matthew Wilcox)
>> 
>> 	/* Prevent swapoff from happening to us */
>
> Ok.
>
>> 
>>> +	si = get_swap_device(entry);
>>> +	/* In case we raced with swapoff. */
>>> +	if (unlikely(!si))
>>> +		goto out;
>>> +
>> 
>> Because we wrap the whole do_swap_page() with get/put_swap_device()
>> now.  We can remove several get/put_swap_device() for function called by
>> do_swap_page().  That can be another optimization patch.
>
> I tried to remove several get/put_swap_device() for function called
> by do_swap_page() only before I send this series. But it seems they have
> other callers without proper get/put_swap_device().

Then we need to revise these callers instead.  Anyway, can be another
series.

Best Regards,
Huang, Ying