On 2021/4/14 11:07, Huang, Ying wrote: > Miaohe Lin <linmiaohe@xxxxxxxxxx> writes: > >> On 2021/4/13 9:27, Huang, Ying wrote: >>> Miaohe Lin <linmiaohe@xxxxxxxxxx> writes: >>> >>>> When I was investigating the swap code, I found the below possible race >>>> window: >>>> >>>> CPU 1 CPU 2 >>>> ----- ----- >>>> do_swap_page >>>> synchronous swap_readpage >>>> alloc_page_vma >>>> swapoff >>>> release swap_file, bdev, or ... >>>> swap_readpage >>>> check sis->flags is ok >>>> access swap_file, bdev...[oops!] >>>> si->flags = 0 >>>> >>>> Using current get/put_swap_device() to guard against concurrent swapoff for >>>> swap_readpage() looks terrible because swap_readpage() may take really long >>>> time. And this race may not be really pernicious because swapoff is usually >>>> done when system shutdown only. To reduce the performance overhead on the >>>> hot-path as much as possible, it appears we can use the percpu_ref to close >>>> this race window(as suggested by Huang, Ying). >>>> >>>> Fixes: 235b62176712 ("mm/swap: add cluster lock") >>> >>> This isn't the commit that introduces the race. You can use `git blame` >>> find out the correct commit. For this it's commit 0bcac06f27d7 "mm, >>> swap: skip swapcache for swapin of synchronous device". >>> >> >> Sorry about it! What I refer to is commit eb085574a752 ("mm, swap: fix race between >> swapoff and some swap operations"). And I think this commit does not fix the race >> condition completely, so I reuse the Fixes tag inside it. >> >>> And I suggest to merge 1/5 and 2/5 to make it easy to get the full >>> picture. >>> >>>> Signed-off-by: Miaohe Lin <linmiaohe@xxxxxxxxxx> >>>> --- >>>> include/linux/swap.h | 2 +- >>>> mm/memory.c | 10 ++++++++++ >>>> mm/swapfile.c | 28 +++++++++++----------------- >>>> 3 files changed, 22 insertions(+), 18 deletions(-) >>>> >>>> diff --git a/include/linux/swap.h b/include/linux/swap.h >>>> index 849ba5265c11..9066addb57fd 100644 >>>> --- a/include/linux/swap.h >>>> +++ b/include/linux/swap.h >>>> @@ -513,7 +513,7 @@ sector_t swap_page_sector(struct page *page); >>>> >>>> static inline void put_swap_device(struct swap_info_struct *si) >>>> { >>>> - rcu_read_unlock(); >>>> + percpu_ref_put(&si->users); >>>> } >>>> >>>> #else /* CONFIG_SWAP */ >>>> diff --git a/mm/memory.c b/mm/memory.c >>>> index cc71a445c76c..8543c47b955c 100644 >>>> --- a/mm/memory.c >>>> +++ b/mm/memory.c >>>> @@ -3311,6 +3311,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>>> { >>>> struct vm_area_struct *vma = vmf->vma; >>>> struct page *page = NULL, *swapcache; >>>> + struct swap_info_struct *si = NULL; >>>> swp_entry_t entry; >>>> pte_t pte; >>>> int locked; >>>> @@ -3339,6 +3340,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>>> } >>>> >>>> >>> >>> I suggest to add comments here as follows (words copy from Matthew Wilcox) >>> >>> /* Prevent swapoff from happening to us */ >> >> Ok. >> >>> >>>> + si = get_swap_device(entry); >>>> + /* In case we raced with swapoff. */ >>>> + if (unlikely(!si)) >>>> + goto out; >>>> + >>> >>> Because we wrap the whole do_swap_page() with get/put_swap_device() >>> now. We can remove several get/put_swap_device() for function called by >>> do_swap_page(). That can be another optimization patch. >> >> I tried to remove several get/put_swap_device() for function called >> by do_swap_page() only before I send this series. But it seems they have >> other callers without proper get/put_swap_device(). > > Then we need to revise these callers instead. Anyway, can be another > series. Yes. can be another series. Thanks. > > Best Regards, > Huang, Ying > > . >