Miaohe Lin <linmiaohe@xxxxxxxxxx> writes: > On 2021/4/13 9:27, Huang, Ying wrote: >> Miaohe Lin <linmiaohe@xxxxxxxxxx> writes: >> >>> When I was investigating the swap code, I found the below possible race >>> window: >>> >>> CPU 1 CPU 2 >>> ----- ----- >>> do_swap_page >>> synchronous swap_readpage >>> alloc_page_vma >>> swapoff >>> release swap_file, bdev, or ... >>> swap_readpage >>> check sis->flags is ok >>> access swap_file, bdev...[oops!] >>> si->flags = 0 >>> >>> Using current get/put_swap_device() to guard against concurrent swapoff for >>> swap_readpage() looks terrible because swap_readpage() may take really long >>> time. And this race may not be really pernicious because swapoff is usually >>> done when system shutdown only. To reduce the performance overhead on the >>> hot-path as much as possible, it appears we can use the percpu_ref to close >>> this race window(as suggested by Huang, Ying). >>> >>> Fixes: 235b62176712 ("mm/swap: add cluster lock") >> >> This isn't the commit that introduces the race. You can use `git blame` >> find out the correct commit. For this it's commit 0bcac06f27d7 "mm, >> swap: skip swapcache for swapin of synchronous device". >> > > Sorry about it! What I refer to is commit eb085574a752 ("mm, swap: fix race between > swapoff and some swap operations"). And I think this commit does not fix the race > condition completely, so I reuse the Fixes tag inside it. > >> And I suggest to merge 1/5 and 2/5 to make it easy to get the full >> picture. >> >>> Signed-off-by: Miaohe Lin <linmiaohe@xxxxxxxxxx> >>> --- >>> include/linux/swap.h | 2 +- >>> mm/memory.c | 10 ++++++++++ >>> mm/swapfile.c | 28 +++++++++++----------------- >>> 3 files changed, 22 insertions(+), 18 deletions(-) >>> >>> diff --git a/include/linux/swap.h b/include/linux/swap.h >>> index 849ba5265c11..9066addb57fd 100644 >>> --- a/include/linux/swap.h >>> +++ b/include/linux/swap.h >>> @@ -513,7 +513,7 @@ sector_t swap_page_sector(struct page *page); >>> >>> static inline void put_swap_device(struct swap_info_struct *si) >>> { >>> - rcu_read_unlock(); >>> + percpu_ref_put(&si->users); >>> } >>> >>> #else /* CONFIG_SWAP */ >>> diff --git a/mm/memory.c b/mm/memory.c >>> index cc71a445c76c..8543c47b955c 100644 >>> --- a/mm/memory.c >>> +++ b/mm/memory.c >>> @@ -3311,6 +3311,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>> { >>> struct vm_area_struct *vma = vmf->vma; >>> struct page *page = NULL, *swapcache; >>> + struct swap_info_struct *si = NULL; >>> swp_entry_t entry; >>> pte_t pte; >>> int locked; >>> @@ -3339,6 +3340,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>> } >>> >>> >> >> I suggest to add comments here as follows (words copy from Matthew Wilcox) >> >> /* Prevent swapoff from happening to us */ > > Ok. > >> >>> + si = get_swap_device(entry); >>> + /* In case we raced with swapoff. */ >>> + if (unlikely(!si)) >>> + goto out; >>> + >> >> Because we wrap the whole do_swap_page() with get/put_swap_device() >> now. We can remove several get/put_swap_device() for function called by >> do_swap_page(). That can be another optimization patch. > > I tried to remove several get/put_swap_device() for function called > by do_swap_page() only before I send this series. But it seems they have > other callers without proper get/put_swap_device(). Then we need to revise these callers instead. Anyway, can be another series. Best Regards, Huang, Ying