Even after commit 501b26510ae3 ("vmstat: allow_direct_reclaim should use zone_page_state_snapshot"), a task may remain indefinitely stuck in throttle_direct_reclaim() while holding mm->rwsem. __alloc_pages_nodemask try_to_free_pages throttle_direct_reclaim This can cause numerous other tasks to wait on the same rwsem, leading to severe system hangups: [1088963.358712] INFO: task python3:1670971 blocked for more than 120 seconds. [1088963.365653] Tainted: G OE -------- - - 4.18.0-553.el8_10.aarch64 #1 [1088963.373887] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [1088963.381862] task:python3 state:D stack:0 pid:1670971 ppid:1667117 flags:0x00800080 [1088963.381869] Call trace: [1088963.381872] __switch_to+0xd0/0x120 [1088963.381877] __schedule+0x340/0xac8 [1088963.381881] schedule+0x68/0x118 [1088963.381886] rwsem_down_read_slowpath+0x2d4/0x4b8 The issue arises when allow_direct_reclaim(pgdat) returns false, preventing progress even when the pgdat->pfmemalloc_wait wait queue is empty. Despite the wait queue being empty, the condition, allow_direct_reclaim(pgdat), may still be returning false, causing it to continue looping. In some cases, reclaimable pages exist (zone_reclaimable_pages() returns > 0), but calculations of pfmemalloc_reserve and free_pages result in wmark_ok being false. And then, despite the pgdat->kswapd_wait queue being non-empty, kswapd is not woken up, further exacerbating the problem: crash> px ((struct pglist_data *) 0xffff00817fffe540)->kswapd_highest_zoneidx $775 = __MAX_NR_ZONES This patch modifies allow_direct_reclaim() to wake kswapd if the pgdat->kswapd_wait queue is active, regardless of whether wmark_ok is true or false. This change ensures kswapd does not miss wake-ups under high memory pressure, reducing the risk of task stalls in the throttled reclaim path. Signed-off-by: Seiji Nishikawa <snishika@xxxxxxxxxx> --- mm/vmscan.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 76378bc257e3..b1b3e5a116a8 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -6389,8 +6389,8 @@ static bool allow_direct_reclaim(pg_data_t *pgdat) wmark_ok = free_pages > pfmemalloc_reserve / 2; - /* kswapd must be awake if processes are being throttled */ - if (!wmark_ok && waitqueue_active(&pgdat->kswapd_wait)) { + /* Always wake up kswapd if the wait queue is not empty */ + if (waitqueue_active(&pgdat->kswapd_wait)) { if (READ_ONCE(pgdat->kswapd_highest_zoneidx) > ZONE_NORMAL) WRITE_ONCE(pgdat->kswapd_highest_zoneidx, ZONE_NORMAL); -- 2.47.0