Re: [PATCH v2] mm/swap: fix swap_info_struct race between swapoff and get_swap_pages()

Aaron Lu <aaron.lu@xxxxxxxxx> · Thu, 6 Apr 2023 22:57:54 +0800

On Thu, Apr 06, 2023 at 10:04:16PM +0800, Aaron Lu wrote:
> On Tue, Apr 04, 2023 at 11:47:16PM +0800, Rongwei Wang wrote:
> > The si->lock must be held when deleting the si from
> > the available list.  Otherwise, another thread can
> > re-add the si to the available list, which can lead
> > to memory corruption. The only place we have found
> > where this happens is in the swapoff path. This case
> > can be described as below:
> > 
> > core 0                       core 1
> > swapoff
> > 
> > del_from_avail_list(si)      waiting
> > 
> > try lock si->lock            acquire swap_avail_lock
> >                              and re-add si into
> >                              swap_avail_head
> 
>                                confused here.
> 
> If del_from_avail_list(si) finished in swaoff path, then this si should
> not exist in any of the per-node avail list and core 1 should not be
> able to re-add it.

I think a possible sequence could be like this:

cpuX                             cpuY
swapoff                          put_swap_folio()

del_from_avail_list(si)
                                 taken si->lock
spin_lock(&si->lock); 

				 swap_range_free()
				 was_full && SWP_WRITEOK -> re-add!
				 drop si->lock

taken si->lock
proceed removing si

End result: si left on avail_list after being swapped off.

The problem is, in add_to_avail_list(), it has no idea this si is being
swapped off and taking si->lock then del_from_avail_list() could avoid
this problem, so I think this patch did the right thing but the
changelog about how this happened needs updating and after that:

Reviewed-by: Aaron Lu <aaron.lu@xxxxxxxxx>

Thanks,
Aaron

> 
> I stared at the code for a while and couldn't figure out how this
> happened, will continue to look at this tomorrow.
> > 
> > acquire si->lock but
> > missing si already be
> > added again, and continuing
> > to clear SWP_WRITEOK, etc.
> > 
> > It can be easily found a massive warning messages can
> > be triggered inside get_swap_pages() by some special
> > cases, for example, we call madvise(MADV_PAGEOUT) on
> > blocks of touched memory concurrently, meanwhile, run
> > much swapon-swapoff operations (e.g. stress-ng-swap).
> > 
> > However, in the worst case, panic can be caused by the
> > above scene. In swapoff(), the memory used by si could
> > be kept in swap_info[] after turning off a swap. This
> > means memory corruption will not be caused immediately
> > until allocated and reset for a new swap in the swapon
> > path. A panic message caused:
> > (with CONFIG_PLIST_DEBUG enabled)
> > 
> > ------------[ cut here ]------------
> > top: 00000000e58a3003, n: 0000000013e75cda, p: 000000008cd4451a
> > prev: 0000000035b1e58a, n: 000000008cd4451a, p: 000000002150ee8d
> > next: 000000008cd4451a, n: 000000008cd4451a, p: 000000008cd4451a
> > WARNING: CPU: 21 PID: 1843 at lib/plist.c:60 plist_check_prev_next_node+0x50/0x70
> > Modules linked in: rfkill(E) crct10dif_ce(E)...
> > CPU: 21 PID: 1843 Comm: stress-ng Kdump: ... 5.10.134+
> > Hardware name: Alibaba Cloud ECS, BIOS 0.0.0 02/06/2015
> > pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
> > pc : plist_check_prev_next_node+0x50/0x70
> > lr : plist_check_prev_next_node+0x50/0x70
> > sp : ffff0018009d3c30
> > x29: ffff0018009d3c40 x28: ffff800011b32a98
> > x27: 0000000000000000 x26: ffff001803908000
> > x25: ffff8000128ea088 x24: ffff800011b32a48
> > x23: 0000000000000028 x22: ffff001800875c00
> > x21: ffff800010f9e520 x20: ffff001800875c00
> > x19: ffff001800fdc6e0 x18: 0000000000000030
> > x17: 0000000000000000 x16: 0000000000000000
> > x15: 0736076307640766 x14: 0730073007380731
> > x13: 0736076307640766 x12: 0730073007380731
> > x11: 000000000004058d x10: 0000000085a85b76
> > x9 : ffff8000101436e4 x8 : ffff800011c8ce08
> > x7 : 0000000000000000 x6 : 0000000000000001
> > x5 : ffff0017df9ed338 x4 : 0000000000000001
> > x3 : ffff8017ce62a000 x2 : ffff0017df9ed340
> > x1 : 0000000000000000 x0 : 0000000000000000
> > Call trace:
> >  plist_check_prev_next_node+0x50/0x70
> >  plist_check_head+0x80/0xf0
> >  plist_add+0x28/0x140
> >  add_to_avail_list+0x9c/0xf0
> >  _enable_swap_info+0x78/0xb4
> >  __do_sys_swapon+0x918/0xa10
> >  __arm64_sys_swapon+0x20/0x30
> >  el0_svc_common+0x8c/0x220
> >  do_el0_svc+0x2c/0x90
> >  el0_svc+0x1c/0x30
> >  el0_sync_handler+0xa8/0xb0
> >  el0_sync+0x148/0x180
> > irq event stamp: 2082270
> > 
> > Now, si->lock locked before calling 'del_from_avail_list()'
> > to make sure other thread see the si had been deleted
> > and SWP_WRITEOK cleared together, will not reinsert again.
> > 
> > This problem exists in versions after stable 5.10.y.
> > 
> > Cc: stable@xxxxxxxxxxxxxxx
> > Tested-by: Yongchen Yin <wb-yyc939293@xxxxxxxxxxxxxxx>
> > Signed-off-by: Rongwei Wang <rongwei.wang@xxxxxxxxxxxxxxxxx>
> > ---
> >  mm/swapfile.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 62ba2bf577d7..2c718f45745f 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -679,6 +679,7 @@ static void __del_from_avail_list(struct swap_info_struct *p)
> >  {
> >  	int nid;
> >  
> > +	assert_spin_locked(&p->lock);
> >  	for_each_node(nid)
> >  		plist_del(&p->avail_lists[nid], &swap_avail_heads[nid]);
> >  }
> > @@ -2434,8 +2435,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
> >  		spin_unlock(&swap_lock);
> >  		goto out_dput;
> >  	}
> > -	del_from_avail_list(p);
> >  	spin_lock(&p->lock);
> > +	del_from_avail_list(p);
> >  	if (p->prio < 0) {
> >  		struct swap_info_struct *si = p;
> >  		int nid;
> > -- 
> > 2.27.0
> >