Tejun Heo wrote: > Hello, Tetsuo. > > I went through the logs and it doesn't look like the mm workqueue > actually stalled, it was just slow to make progress. Please see > below. > > On Fri, Jul 07, 2017 at 07:27:06PM +0900, Tetsuo Handa wrote: > > Since drain_local_pages_wq work was stalling for 144 seconds as of uptime = 541, > > drain_local_pages_wq work was queued around uptime = 397 (which is about 6 seconds > > since the OOM killer/reaper reclaimed some memory for the last time). > > > > But as far as I can see from traces, the mm_percpu_wq thread as of uptime = 444 was > > idle, while drain_local_pages_wq work was pending from uptime = 541 to uptime = 605. > > This means that the mm_percpu_wq thread did not start processing drain_local_pages_wq > > work immediately. (I don't know what made drain_local_pages_wq work be processed.) > > > > Why? Is this a workqueue implementation bug? Is this a workqueue usage bug? > > So, rescuer doesn't kick as soon as the workqueue becomes slow. It > kicks in if the worker pool that the workqueue is associated with > hangs. That is, if you have other work items actively running, e.g., > for reclaim on the pool, the pool isn't stalled and rescuers won't be > woken up. IOW, having a rescuer prevents a workqueue from deadlocking > due to resource starvation but it doesn't necessarily make it go > faster. It's a deadlock prevention mechanism, not a priority raising > one. If the work items need preferential execution, it should use > WQ_HIGHPRI. Thank you for explanation. I tried below change. It indeed reduced delays, but even with WQ_HIGHPRI, up to a few seconds of delay is unavoidable? I wished it is processed within a few jiffies. ---------- diff --git a/mm/vmstat.c b/mm/vmstat.c index 9a4441b..c099ebf 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1768,7 +1768,8 @@ void __init init_mm_internals(void) { int ret __maybe_unused; - mm_percpu_wq = alloc_workqueue("mm_percpu_wq", WQ_MEM_RECLAIM, 0); + mm_percpu_wq = alloc_workqueue("mm_percpu_wq", + WQ_MEM_RECLAIM | WQ_HIGHPRI, 0); #ifdef CONFIG_SMP ret = cpuhp_setup_state_nocalls(CPUHP_MM_VMSTAT_DEAD, "mm/vmstat:dead", ---------- Before: ---------- [ 906.781160] Showing busy workqueues and worker pools: [ 906.789620] workqueue events: flags=0x0 [ 906.796439] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=7/256 [ 906.805291] in-flight: 99:vmw_fb_dirty_flush [vmwgfx]{513504} [ 906.809676] pending: vmpressure_work_fn{571835}, e1000_watchdog [e1000]{571413}, vmstat_shepherd{571093}, vmw_fb_dirty_flush [vmwgfx]{513504}, free_work{50571}, free_obj_work{50546} [ 906.821048] workqueue events_power_efficient: flags=0x80 [ 906.825113] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=3/256 [ 906.829567] pending: fb_flashcursor{571684}, do_cache_clean{566868}, neigh_periodic_work{564836} [ 906.835508] workqueue events_freezable_power_: flags=0x84 [ 906.838162] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256 [ 906.840906] in-flight: 200:disk_events_workfn{571819} [ 906.843485] workqueue mm_percpu_wq: flags=0x8 [ 906.845675] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 [ 906.848754] pending: drain_local_pages_wq{50498} BAR(2104){50498} [ 906.851717] workqueue writeback: flags=0x4e [ 906.853841] pwq 128: cpus=0-63 flags=0x4 nice=0 active=2/256 [ 906.856556] in-flight: 354:wb_workfn{571030} wb_workfn{571030} [ 906.859881] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=571s workers=2 manager: 33 [ 906.863663] pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=7s workers=3 idle: 29 1548 [ 906.867153] pool 128: cpus=0-63 flags=0x4 nice=0 hung=43s workers=3 idle: 355 2115 ---------- After: ---------- [ 778.377896] Showing busy workqueues and worker pools: [ 778.380129] workqueue events: flags=0x0 [ 778.381879] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=4/256 [ 778.384371] pending: vmpressure_work_fn{117854}, free_work{117845}, e1000_watchdog [e1000]{117406}, e1000_watchdog [e1000]{117406} [ 778.389198] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256 [ 778.391732] in-flight: 1635:vmw_fb_dirty_flush [vmwgfx]{154837} vmw_fb_dirty_flush [vmwgfx]{154837} [ 778.395522] workqueue events_power_efficient: flags=0x80 [ 778.397828] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 [ 778.400385] pending: do_cache_clean{96351} [ 778.402395] workqueue events_freezable_power_: flags=0x84 [ 778.404734] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256 [ 778.407360] in-flight: 185:disk_events_workfn{156895} [ 778.409902] workqueue mm_percpu_wq: flags=0x18 [ 778.412035] pwq 7: cpus=3 node=0 flags=0x0 nice=-20 active=1/256 [ 778.414675] pending: vmstat_update{5644} [ 778.416681] workqueue writeback: flags=0x4e [ 778.418701] pwq 128: cpus=0-63 flags=0x4 nice=0 active=2/256 [ 778.421210] in-flight: 358:wb_workfn{546} wb_workfn{546} [ 778.424101] workqueue xfs-eofblocks/sda1: flags=0xc [ 778.426304] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 [ 778.429112] in-flight: 52:xfs_eofblocks_worker [xfs]{117964} [ 778.431749] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=2 manager: 231 idle: 3 [ 778.435105] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 1606 49 [ 778.438235] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=117s workers=2 manager: 215 [ 778.441389] pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=6s workers=3 idle: 63 29 [ 778.444882] pool 128: cpus=0-63 flags=0x4 nice=0 hung=0s workers=3 idle: 2449 360 ---------- By the way, I think it might be useful if delay of each work item is printed together... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>