Hello
I have fix up some stuff base on Patch v1. And in order to help all
readers and reviewers to
reproduce this bug, share a reproducer here:
swap_bomb.sh
#!/usr/bin/env bash
stress-ng -a 1 --class vm -t 12h --metrics --times -x
bigheap,stackmmap,mlock,vm-splice,mmapaddr,mmapfixed,mmapfork,mmaphuge,mmapmany,mprotect,mremap,msync,msyncmany,physpage,tmpfs,vm-addr,vm-rw,brk,vm-segv,userfaultfd,malloc,stack,munmap,dev-shm,bad-altstack,shm-sysv,pageswap,madvise,vm,shm,env,mmap
--verify -v &
stress-ng -a 1 --class vm -t 12h --metrics --times -x
bigheap,stackmmap,mlock,vm-splice,mmapaddr,mmapfixed,mmapfork,mmaphuge,mmapmany,mprotect,mremap,msync,msyncmany,physpage,tmpfs,vm-addr,vm-rw,brk,vm-segv,userfaultfd,malloc,stack,munmap,dev-shm,bad-altstack,shm-sysv,pageswap,madvise,vm,shm,env,mmap
--verify -v &
stress-ng -a 1 --class vm -t 12h --metrics --times -x
bigheap,stackmmap,mlock,vm-splice,mmapaddr,mmapfixed,mmapfork,mmaphuge,mmapmany,mprotect,mremap,msync,msyncmany,physpage,tmpfs,vm-addr,vm-rw,brk,vm-segv,userfaultfd,malloc,stack,munmap,dev-shm,bad-altstack,shm-sysv,pageswap,madvise,vm,shm,env,mmap
--verify -v &
stress-ng -a 1 --class vm -t 12h --metrics --times -x
bigheap,stackmmap,mlock,vm-splice,mmapaddr,mmapfixed,mmapfork,mmaphuge,mmapmany,mprotect,mremap,msync,msyncmany,physpage,tmpfs,vm-addr,vm-rw,brk,vm-segv,userfaultfd,malloc,stack,munmap,dev-shm,bad-altstack,shm-sysv,pageswap,madvise,vm,shm,env,mmap
--verify -v
madvise_shared.c
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <unistd.h>
#define MSIZE (1024 * 1024 * 2)
int main()
{
char *shm_addr;
unsigned long i;
while (1) {
// Map shared memory segment
shm_addr =
mmap(NULL, MSIZE, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
if (shm_addr == MAP_FAILED) {
perror("Failed to map shared memory segment");
exit(EXIT_FAILURE);
}
for (i = 0; i < MSIZE; i++) {
shm_addr[i] = 1;
}
// Advise kernel on usage pattern of shared memory
if (madvise(shm_addr, MSIZE, MADV_PAGEOUT) == -1) {
perror
("Failed to advise kernel on shared memory
usage");
exit(EXIT_FAILURE);
}
for (i = 0; i < MSIZE; i++) {
shm_addr[i] = 1;
}
// Advise kernel on usage pattern of shared memory
if (madvise(shm_addr, MSIZE, MADV_PAGEOUT) == -1) {
perror
("Failed to advise kernel on shared memory
usage");
exit(EXIT_FAILURE);
}
// Use shared memory
printf("Hello, shared memory: 0x%lx\n", shm_addr);
// Unmap shared memory segment
if (munmap(shm_addr, MSIZE) == -1) {
perror("Failed to unmap shared memory segment");
exit(EXIT_FAILURE);
}
}
return 0;
}
The bug will reproduce more quickly (about 2~5 minutes) if concurrent
more swap_bomb.sh and madvise_shared.
Thanks.
change log:
v1 -> v2
* fix up some commits and add assert_spin_locked(&p->lock) inside
__delete_from_avail_list() (suggested by Matthew Wilcox and Bagas Sanjaya)
On 4/4/23 11:47 PM, Rongwei Wang wrote:
The si->lock must be held when deleting the si from
the available list. Otherwise, another thread can
re-add the si to the available list, which can lead
to memory corruption. The only place we have found
where this happens is in the swapoff path. This case
can be described as below:
core 0 core 1
swapoff
del_from_avail_list(si) waiting
try lock si->lock acquire swap_avail_lock
and re-add si into
swap_avail_head
acquire si->lock but
missing si already be
added again, and continuing
to clear SWP_WRITEOK, etc.
It can be easily found a massive warning messages can
be triggered inside get_swap_pages() by some special
cases, for example, we call madvise(MADV_PAGEOUT) on
blocks of touched memory concurrently, meanwhile, run
much swapon-swapoff operations (e.g. stress-ng-swap).
However, in the worst case, panic can be caused by the
above scene. In swapoff(), the memory used by si could
be kept in swap_info[] after turning off a swap. This
means memory corruption will not be caused immediately
until allocated and reset for a new swap in the swapon
path. A panic message caused:
(with CONFIG_PLIST_DEBUG enabled)
------------[ cut here ]------------
top: 00000000e58a3003, n: 0000000013e75cda, p: 000000008cd4451a
prev: 0000000035b1e58a, n: 000000008cd4451a, p: 000000002150ee8d
next: 000000008cd4451a, n: 000000008cd4451a, p: 000000008cd4451a
WARNING: CPU: 21 PID: 1843 at lib/plist.c:60 plist_check_prev_next_node+0x50/0x70
Modules linked in: rfkill(E) crct10dif_ce(E)...
CPU: 21 PID: 1843 Comm: stress-ng Kdump: ... 5.10.134+
Hardware name: Alibaba Cloud ECS, BIOS 0.0.0 02/06/2015
pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
pc : plist_check_prev_next_node+0x50/0x70
lr : plist_check_prev_next_node+0x50/0x70
sp : ffff0018009d3c30
x29: ffff0018009d3c40 x28: ffff800011b32a98
x27: 0000000000000000 x26: ffff001803908000
x25: ffff8000128ea088 x24: ffff800011b32a48
x23: 0000000000000028 x22: ffff001800875c00
x21: ffff800010f9e520 x20: ffff001800875c00
x19: ffff001800fdc6e0 x18: 0000000000000030
x17: 0000000000000000 x16: 0000000000000000
x15: 0736076307640766 x14: 0730073007380731
x13: 0736076307640766 x12: 0730073007380731
x11: 000000000004058d x10: 0000000085a85b76
x9 : ffff8000101436e4 x8 : ffff800011c8ce08
x7 : 0000000000000000 x6 : 0000000000000001
x5 : ffff0017df9ed338 x4 : 0000000000000001
x3 : ffff8017ce62a000 x2 : ffff0017df9ed340
x1 : 0000000000000000 x0 : 0000000000000000
Call trace:
plist_check_prev_next_node+0x50/0x70
plist_check_head+0x80/0xf0
plist_add+0x28/0x140
add_to_avail_list+0x9c/0xf0
_enable_swap_info+0x78/0xb4
__do_sys_swapon+0x918/0xa10
__arm64_sys_swapon+0x20/0x30
el0_svc_common+0x8c/0x220
do_el0_svc+0x2c/0x90
el0_svc+0x1c/0x30
el0_sync_handler+0xa8/0xb0
el0_sync+0x148/0x180
irq event stamp: 2082270
Now, si->lock locked before calling 'del_from_avail_list()'
to make sure other thread see the si had been deleted
and SWP_WRITEOK cleared together, will not reinsert again.
This problem exists in versions after stable 5.10.y.
Cc: stable@xxxxxxxxxxxxxxx
Tested-by: Yongchen Yin <wb-yyc939293@xxxxxxxxxxxxxxx>
Signed-off-by: Rongwei Wang <rongwei.wang@xxxxxxxxxxxxxxxxx>
---
mm/swapfile.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 62ba2bf577d7..2c718f45745f 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -679,6 +679,7 @@ static void __del_from_avail_list(struct swap_info_struct *p)
{
int nid;
+ assert_spin_locked(&p->lock);
for_each_node(nid)
plist_del(&p->avail_lists[nid], &swap_avail_heads[nid]);
}
@@ -2434,8 +2435,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
spin_unlock(&swap_lock);
goto out_dput;
}
- del_from_avail_list(p);
spin_lock(&p->lock);
+ del_from_avail_list(p);
if (p->prio < 0) {
struct swap_info_struct *si = p;
int nid;