Re: [RFC PATCH] mm/swap: fix system stuck due to infinite loop

Stillinux <stillinux@xxxxxxxxx> · Sun, 4 Apr 2021 17:26:22 +0800

> I'm not understanding why swapcache_prepare() repeatedly returns
-EEXIST in this situation?

We found this case which led to system stuck on our several arm64 PC. and  We using sysrq grabbed a copy of vmcore based on the 5.4.18 kernel version. Later, we also ran the 5.11.0 version, which can reproduce this case. According to vmcore analysis, the latest kernel code still preserves this logic.

Our analysis steps are as follows:
1. We found a large number of D tasks, and the kernel stack is basically stuck in io_schedule after rq_qos_wait. We analyzed the rq_wb structure, and we found that inflight in rq_wait[0] is 1, and there is one request that has not been completed, thus blk-wbt block the io operation of other tasks.

crash> ps -S
  RU: 5
  IN: 468
  ID: 65
  UN: 117
  ZO: 4

  ==============
  rq_wait = {{
	wait = {

	inflight = {
 		counter = 1
}

2, We tracked down that the request was pluged in the blk_plug of the task 938

crash> ps -A
   PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
>     0      0   1  ffffffa0f5e0b4c0  RU   0.0       0      0  [swapper/1]
>     0      0   2  ffffffa0f5ec0040  RU   0.0       0      0  [swapper/2]
>     0      0   3  ffffffa0f5ec6940  RU   0.0       0      0  [swapper/3]
>   938    929   0  ffffffa0eb00c640  RU   0.0 2698236  15092  qaxsafed
crash> task 938 | grep TASK
PID: 938    TASK: ffffffa0eb00c640  CPU: 0   COMMAND: "qaxsafed"
crash> struct task_struct ffffffa0eb00c640 | grep plug
  plug = 0xffffffa0e918fbe8, 
crash> struct blk_plug 0xffffffa0e918fbe8
struct blk_plug {
  mq_list = {
    next = 0xffffffa0e7f81a88, 
    prev = 0xffffffa0e7f81a88
  }, 
  cb_list = {
    next = 0xffffffa0e918fbf8, 
    prev = 0xffffffa0e918fbf8
  }, 
  rq_count = 1, 
  multiple_queues = false
}

3, We grabbed the allocation time of the request and found that it was allocated in 1062 seconds, and then added to the plug of the 938 process, but combined with the dmesg time information, the last scheduling time of task 938 has run to 2083 seconds,  plus for the scheduling information of other processes of cpu0, we confirm that the scheduling is normal, and there is a problem if the task 938 does not do the brushing out the request on the plug for so long.

crash> struct request ffffffa0e7f81a40 | grep start_time_ns
  start_time_ns = 1062021311093, 
  io_start_time_ns = 0, 
crash> log | tail
[ 2183.803249]  do_swap_page+0x1e4/0x968
[ 2183.808028]  __handle_mm_fault+0xc18/0x10a8
[ 2183.813328]  handle_mm_fault+0x144/0x198
[ 2183.818368]  do_page_fault+0x2d0/0x518
[ 2183.823235]  do_translation_fault+0x3c/0x50
[ 2183.828535]  do_mem_abort+0x3c/0x98
[ 2183.833140]  el0_da+0x1c/0x20
[ 2183.837229] SMP: stopping secondary CPUs
[ 2183.842451] Starting crashdump kernel...
[ 2183.876226] Bye!
crash> task 938 | grep "exec_start\|sum_exec_runtime"
    exec_start = 2183587475680, 
    sum_exec_runtime = 1120286568937, 
    prev_sum_exec_runtime = 1120278576499, 
  last_sum_exec_runtime = 0,

4. We analyzed the scheduling because schedule->sched_submit_work will be judged based on task->state, and the request in the plug will not be flushed. then we know that the request data on the plug of the runnable process needs to be flushed out by itself.

5. We combine the call stack of the task 938, the X0 register value saved by its interrupt context, and the parameter passed in with get_swap_device as the swp_entry value, which we convert to its entry value.

get_swap_device function and its save context list as follow, and we dis  get_swap_device function :

struct swap_info_struct *get_swap_device(swp_entry_t entry)
====================================
#27 [ffffffa0e918fa90] el1_irq+0xb4 at ffffffc0100833b4
     PC: ffffffc01022df04  [get_swap_device+36]
     LR: ffffffc01022dfb4  [__swap_duplicate+28]
     SP: ffffffa0e918faa0  PSTATE: 20000005
    X29: ffffffa0e918faa0  X28: ffffffc01151e000  X27: 0000000000000000
    X26: ffffffa0f0d8cfc8  X25: 0000007f9e57a000  X24: 0000000000100cca
    X23: 0000000000000040  X22: 00000000001d1ca6  X21: 0000000000007b40
    X20: 00000000001d1ca6  X19: ffffffff00502800  X18: 0000000000000000
    X17: 0000000000000000  X16: 0000000000000000  X15: 0000000000000000
    X14: 0000000000000000  X13: 0000000000000001  X12: 0000000000000000
    X11: 0000000000000001  X10: 0000000000000bc0   X9: ffffffa0e918fa50
     X8: ffffffa0eb00d260   X7: ffffffa0e9875600   X6: 00000000ffffffff
     X5: 00000000000002af   X4: ffffffa0e918fa80   X3: 0000000000000002
     X2: 0000000000000000   X1: ffffffc0117caf60   X0: 0SWAP_HAS_CACHE0000000001d1ca6<<<<---
    ffffffa0e918fa90: ffffffa0e918faa0 get_swap_device+36 
#28 [ffffffa0e918faa0] get_swap_device+0x20 at ffffffc01022df00
    ffffffa0e918faa0: ffffffa0e918fab0 __swap_duplicate+28 
#29 [ffffffa0e918fab0] __swap_duplicate+0x18 at ffffffc01022dfb0

============================
crash> dis -rx get_swap_device+36information
0xffffffc01022dee0 <get_swap_device>:	cbz	x0, 0xffffffc01022df80 <get_swap_device+0xa0>
0xffffffc01022dee4 <get_swap_device+0x4>:	adrp	x1, 0xffffffc0117ca000 <memblock_reserved_init_regions+0x1500>
0xffffffc01022dee8 <get_swap_device+0x8>:	add	x1, x1, #0xf60
0xffffffc01022deec <get_swap_device+0xc>:	stp	x29, x30, [sp,#-16]!
0xffffffc01022def0 <get_swap_device+0x10>:	lsr	x2, x0, #58
0xffffffc01022def4 <get_swapWe found that this case is probabilistic system stuck  on our arm64 machine._device+0x14>:	mov	x29, sp
0xffffffc01022def8 <get_swap_device+0x18>:	ldr	w3, [x1,#4]
0xffffffc01022defc <get_swap_device+0x1c>:	cmp	w3, w2
0xffffffc01022df00 <get_swap_device+0x20>:	b.ls	0xffffffc01022df40 <get_swap_device+0x60>
0xffffffc01022df04 <get_swap_device+0x24>:	dmb	ishld

6. Combined with the swap information, we look up the value of swap_map corresponding to the entry, and we find that it is 0x41<SWAP_HAS_CACHE| 0x1>.

crash> swap
SWAP_INFO_STRUCT    TYPE       SIZE       USED     PCT  PRI  FILENAME
ffffffa0ebae9a00  PARTITION  9693180k   2713508k   27%   -2  /dev/sda6
crash> struct swap_info_struct ffffffa0ebae9a00 | grep swap_map
  swap_map = 0xffffffc012e01000 "?", 
  frontswap_map = 0xffffffa0ea580000, 
crash> px (0xffffffc012e01000+0x1d1ca6)
$4 = 0xffffffc012fd2ca6
crash> rd -8  0xffffffc012fd2ca6
ffffffc012fd2ca6:  41                                                A

7, We bring 0x41 into the __read_swap_cache_async function, which will return -EEXIST in swapcache_prepare, if the value in swap_map corresponding to entry is always 0x41, wapcache_prepare
It will always return -EEXIST, and the task 938 will do infinite loop.

8. The value of swap_map corresponding to the entry is 0x41, which means that the swap_cache page corresponding to the entry has not been added successfully. If added successfully, __read_swap_cache_async can get the page from swap_address_space, without infinite loop.

9. Based on the above information, we guess the swap_cache page corresponding to the entry of the task 938  has not been added successfully, and it may also be blocked. Looking at the stacks of other tasks, it is found that other tasks have triggered the shrink flush io of direct memory reclaim in swap_cluster_readahead->read_swap_cache_async->swap_readpage->get_swap_bio, but it was blocked by blk-wbt because a request was not completed above. 

AFAIK, we guess that there is a potential logical infinite loop, that is the root cause of our system's stuck. The task 938 waits for the page add to swapcache to join successfully through swap_prepare loop, but the task of reading data from the swap area to the page in the swapcache is blocked by io when access blk-wbt or hctx->tags resources, but the task 938 has not been flushed out because of the request held by its blk-plug, which is the cause of other task io blocking.

> And how does the switch to GFP_NOIO fix this?  Simply by avoiding
direct reclaim altogether?

In our kernel stack information, our crawling tasks kernel stack may block the 938 task. By crawling swap_cluster_readahead, we found that most of them are stuck in swap_cluster_readahead->read_swap_cache_async->swap_readpage->get_swap_bio. When get_swap_bio does direct memory reclaim, it enters the D state when flushing io stuck by task 938 blk_plug request. Therefore, from an engineering perspective, we just only solve this case urgently, which can effectively reduce the probability of the stuck situation, can realize the memory pressure, reduce the readahead reading, and read the formal orig_pte using read_swap_cache_async for the second time. Remedial opportunities, but there are still risks. If the entry is not the entry corresponding to orig_pte, there is no big problem, but if it is the entry corresponding to orig_pte, returning -ENOMEM may cause serious problems. It is beyond my ability to solve this potential circular problem fundamentally, so I request the community to discuss and research to solve the problem fundamentally.

On Sat, Apr 3, 2021 at 8:44 AM Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
On Fri, 2 Apr 2021 15:03:37 +0800 Stillinux <stillinux@xxxxxxxxx> wrote:

> In the case of high system memory and load pressure, we ran ltp test

> and found that the system was stuck, the direct memory reclaim was

> all stuck in io_schedule, the waiting request was stuck in the blk_plug

> flow of one process, and this process fell into an infinite loop.

> not do the action of brushing out the request.

> 

> The call flow of this process is swap_cluster_readahead.

> Use blk_start/finish_plug for blk_plug operation,

> flow swap_cluster_readahead->__read_swap_cache_async->swapcache_prepare.

> When swapcache_prepare return -EEXIST, it will fall into an infinite loop,

> even if cond_resched is called, but according to the schedule,

> sched_submit_work will be based on tsk->state, and will not flash out

> the blk_plug request, so will hang io, causing the overall system  hang.

> 

> For the first time involving the swap part, there is no good way to fix

> the problem from the fundamental problem. In order to solve the

> engineering situation, we chose to make swap_cluster_readahead aware of

> the memory pressure situation as soon as possible, and do io_schedule to

> flush out the blk_plug request, thereby changing the allocation flag in

> swap_readpage to GFP_NOIO , No longer do the memory reclaim of flush io.

> Although system operating normally, but not the most fundamental way.

> 

Thanks.

I'm not understanding why swapcache_prepare() repeatedly returns

-EEXIST in this situation?

And how does the switch to GFP_NOIO fix this?  Simply by avoiding

direct reclaim altogether?

> ---

>  mm/page_io.c | 2 +-

>  1 file changed, 1 insertion(+), 1 deletion(-)

> 

> diff --git a/mm/page_io.c b/mm/page_io.c

> index c493ce9ebcf5..87392ffabb12 100644

> --- a/mm/page_io.c

> +++ b/mm/page_io.c

> @@ -403,7 +403,7 @@ int swap_readpage(struct page *page, bool synchronous)

>       }

> 

>       ret = 0;

> -     bio = bio_alloc(GFP_KERNEL, 1);

> +     bio = bio_alloc(GFP_NOIO, 1);

>       bio_set_dev(bio, sis->bdev);

>       bio->bi_opf = REQ_OP_READ;

>       bio->bi_iter.bi_sector = swap_page_sector(page);