Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

Miaohe Lin <linmiaohe@xxxxxxxxxx> · Wed, 14 Apr 2021 11:27:21 +0800

On 2021/4/14 11:07, Huang, Ying wrote:
> Miaohe Lin <linmiaohe@xxxxxxxxxx> writes:
> 
>> On 2021/4/13 9:27, Huang, Ying wrote:
>>> Miaohe Lin <linmiaohe@xxxxxxxxxx> writes:
>>>
>>>> When I was investigating the swap code, I found the below possible race
>>>> window:
>>>>
>>>> CPU 1					CPU 2
>>>> -----					-----
>>>> do_swap_page
>>>>   synchronous swap_readpage
>>>>     alloc_page_vma
>>>> 					swapoff
>>>> 					  release swap_file, bdev, or ...
>>>>       swap_readpage
>>>> 	check sis->flags is ok
>>>> 	  access swap_file, bdev...[oops!]
>>>> 					    si->flags = 0
>>>>
>>>> Using current get/put_swap_device() to guard against concurrent swapoff for
>>>> swap_readpage() looks terrible because swap_readpage() may take really long
>>>> time. And this race may not be really pernicious because swapoff is usually
>>>> done when system shutdown only. To reduce the performance overhead on the
>>>> hot-path as much as possible, it appears we can use the percpu_ref to close
>>>> this race window(as suggested by Huang, Ying).
>>>>
>>>> Fixes: 235b62176712 ("mm/swap: add cluster lock")
>>>
>>> This isn't the commit that introduces the race.  You can use `git blame`
>>> find out the correct commit.  For this it's commit 0bcac06f27d7 "mm,
>>> swap: skip swapcache for swapin of synchronous device".
>>>
>>
>> Sorry about it! What I refer to is commit eb085574a752 ("mm, swap: fix race between
>> swapoff and some swap operations"). And I think this commit does not fix the race
>> condition completely, so I reuse the Fixes tag inside it.
>>
>>> And I suggest to merge 1/5 and 2/5 to make it easy to get the full
>>> picture.
>>>
>>>> Signed-off-by: Miaohe Lin <linmiaohe@xxxxxxxxxx>
>>>> ---
>>>>  include/linux/swap.h |  2 +-
>>>>  mm/memory.c          | 10 ++++++++++
>>>>  mm/swapfile.c        | 28 +++++++++++-----------------
>>>>  3 files changed, 22 insertions(+), 18 deletions(-)
>>>>
>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>> index 849ba5265c11..9066addb57fd 100644
>>>> --- a/include/linux/swap.h
>>>> +++ b/include/linux/swap.h
>>>> @@ -513,7 +513,7 @@ sector_t swap_page_sector(struct page *page);
>>>>  
>>>>  static inline void put_swap_device(struct swap_info_struct *si)
>>>>  {
>>>> -	rcu_read_unlock();
>>>> +	percpu_ref_put(&si->users);
>>>>  }
>>>>  
>>>>  #else /* CONFIG_SWAP */
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index cc71a445c76c..8543c47b955c 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -3311,6 +3311,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>  {
>>>>  	struct vm_area_struct *vma = vmf->vma;
>>>>  	struct page *page = NULL, *swapcache;
>>>> +	struct swap_info_struct *si = NULL;
>>>>  	swp_entry_t entry;
>>>>  	pte_t pte;
>>>>  	int locked;
>>>> @@ -3339,6 +3340,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>  	}
>>>>  
>>>>
>>>
>>> I suggest to add comments here as follows (words copy from Matthew Wilcox)
>>>
>>> 	/* Prevent swapoff from happening to us */
>>
>> Ok.
>>
>>>
>>>> +	si = get_swap_device(entry);
>>>> +	/* In case we raced with swapoff. */
>>>> +	if (unlikely(!si))
>>>> +		goto out;
>>>> +
>>>
>>> Because we wrap the whole do_swap_page() with get/put_swap_device()
>>> now.  We can remove several get/put_swap_device() for function called by
>>> do_swap_page().  That can be another optimization patch.
>>
>> I tried to remove several get/put_swap_device() for function called
>> by do_swap_page() only before I send this series. But it seems they have
>> other callers without proper get/put_swap_device().
> 
> Then we need to revise these callers instead.  Anyway, can be another
> series.

Yes. can be another series.
Thanks.

> 
> Best Regards,
> Huang, Ying
> 
> .
>