On Thu, May 20, 2021 at 12:42:57PM +0100, Aaron Tomlin wrote: > On Thu 2021-05-20 12:20 +0200, Vlastimil Babka wrote: > > On 5/20/21 6:34 AM, Andrew Morton wrote: > > > > > > What observed problems motivated this change? > > > > > > What were the observed runtime effects of this change? > > > > Yep those details from the previous thread should be included here. > > Fair enough. > > During kernel crash dump/or vmcore analysis: I discovered in the context of > __alloc_pages_slowpath() the value stored in the no_progress_loops variable > was found to be 31,611,688 i.e. well above MAX_RECLAIM_RETRIES; and a fatal > signal was pending against current. While this is true, it's not really answering Andrew's question. What we want as part of the commit message is something like: "A customer experienced a low memory situation and sent their task a fatal signal. Instead of dying promptly, it looped in the page allocator failing to make progress because ..." > > #6 [ffff00002e78f7c0] do_try_to_free_pages+0xe4 at ffff00001028bd24 > #7 [ffff00002e78f840] try_to_free_pages+0xe4 at ffff00001028c0f4 > #8 [ffff00002e78f900] __alloc_pages_nodemask+0x500 at ffff0000102cd130 > > // w28 = *(sp + 148) /* no_progress_loops */ > 0xffff0000102cd1e0 <__alloc_pages_nodemask+0x5b0>: ldr w0, [sp,#148] > // w0 = w0 + 0x1 > 0xffff0000102cd1e4 <__alloc_pages_nodemask+0x5b4>: add w0, w0, #0x1 > // *(sp + 148) = w0 > 0xffff0000102cd1e8 <__alloc_pages_nodemask+0x5b8>: str w0, [sp,#148] > // if (w0 >= 0x10) > // goto __alloc_pages_nodemask+0x904 > 0xffff0000102cd1ec <__alloc_pages_nodemask+0x5bc>: cmp w0, #0x10 > 0xffff0000102cd1f0 <__alloc_pages_nodemask+0x5c0>: b.gt 0xffff0000102cd534 > > - The stack pointer was 0xffff00002e78f900 > > crash> p *(int *)(0xffff00002e78f900+148) > $1 = 31611688 > > crash> ps 521171 > PID PPID CPU TASK ST %MEM VSZ RSS COMM > > 521171 1 36 ffff8080e2128800 RU 0.0 34789440 18624 special > > crash> p &((struct task_struct *)0xffff8080e2128800)->signal.shared_pending > $2 = (struct sigpending *) 0xffff80809a416e40 > > crash> p ((struct sigpending *)0xffff80809a416e40)->signal.sig[0] > $3 = 0x804100 > > crash> sig -s 0x804100 > SIGKILL SIGTERM SIGXCPU > > crash> p ((struct sigpending *)0xffff80809a416e40)->signal.sig[0] & 1U << (9 - 1) > $4 = 0x100 > > > Unfortunately, this incident was not reproduced, to date. > > > > > > Kind regards, > > -- > Aaron Tomlin