On 4/8/21 6:08 AM, Miaohe Lin wrote: > When I was investigating the swap code, I found the below possible race > window: > > CPU 1 CPU 2 > ----- ----- > do_swap_page > synchronous swap_readpage > alloc_page_vma > swapoff > release swap_file, bdev, or ... Perhaps I'm missing something. The release of swap_file, bdev etc happens after we have cleared the SWP_VALID bit in si->flags in destroy_swap_extents if I read the swapoff code correctly. > swap_readpage > check sis->flags is ok > access swap_file, bdev...[oops!] > si->flags = 0 This happens after we clear the si->flags synchronize_rcu() release swap_file, bdev, in destroy_swap_extents() So I think if we have get_swap_device/put_swap_device in do_swap_page, it should fix the race you've pointed out here. Then synchronize_rcu() will wait till we have completed do_swap_page and call put_swap_device. > > Using current get/put_swap_device() to guard against concurrent swapoff for > swap_readpage() looks terrible because swap_readpage() may take really long > time. And this race may not be really pernicious because swapoff is usually > done when system shutdown only. To reduce the performance overhead on the > hot-path as much as possible, it appears we can use the percpu_ref to close > this race window(as suggested by Huang, Ying). I think it is better to break this patch into two. One patch is to fix the race in do_swap_page and swapoff by adding get_swap_device/put_swap_device in do_swap_page. The second patch is to modify get_swap_device and put_swap_device with percpu_ref. But swapoff is a relatively rare events. I am not sure making percpu_ref change for performance is really beneficial. Did you encounter a real use case where you see a problem with swapoff? The delay in swapoff is primarily in try_to_unuse to bring all the swapped off pages back into memory. Synchronizing with other CPU for paging in probably is a small component in overall scheme of things. Thanks. Tim