Re: [PATCH] mm: Use WQ_HIGHPRI for mm_percpu_wq.

Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> · Tue, 29 Aug 2017 07:15:05 +0900

Tejun Heo wrote:
> On Mon, Aug 28, 2017 at 02:10:56PM +0200, Michal Hocko wrote:
> > I am not sure I understand how WQ_HIGHPRI actually helps. The work item
> > will get served by a thread with higher priority and from a different
> > pool than regular WQs. But what prevents the same issue as described
> > above when the highprio pool gets congested? In other words what make
> > WQ_HIGHPRI less prone to long stalls when we are under low memory
> > situation and new workers cannot be allocated?
> 
> So, the problem wasn't new worker not getting allocated due to memory
> pressure.  Rescuer can handle that.  The problem is that the regular
> worker pool is occupied with something which is constantly in runnable
> state - most likely writeback / reclaim, so the workqueue doesn't
> schedule the other work items.
> 
> Setting WQ_HIGHPRI works as highpri worker pool isn't likely to be
> contended that way but might not be the best solution.  The right
> thing to do would be setting WQ_CPU_INTENSIVE on the work items which
> can burn a lot of CPU cycles so that it doesn't get in the way of
> other work items (workqueue should probably trigger a warning on these
> work items too).
> 
> Tetuso, can you please try to find which work items are occupying the
> worker pool for an extended period time under memory pressure and set
> WQ_CPU_INTENSIVE on them?
> 

Isn't it any work item which does __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory
allocation, for doing __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory allocation
burns a lot of CPU cycles under memory pressure? In other words, won't we end up
with setting WQ_CPU_INTENSIVE to almost all workqueues?

----------
[  605.720125] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 208s!
[  605.736025] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 209s!
[  605.746669] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 209s!
[  605.755091] BUG: workqueue lockup - pool cpus=3 node=0 flags=0x0 nice=0 stuck for 64s!
[  605.763390] Showing busy workqueues and worker pools:
[  605.769436] workqueue events: flags=0x0
[  605.772204]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=4/256
[  605.775548]     pending: console_callback{197431}, vmw_fb_dirty_flush [vmwgfx]{174896}, sysrq_reinject_alt_sysrq{174440}, push_to_pool{162245}
[  605.780761]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=3/256
[  605.783603]     pending: e1000_watchdog [e1000]{207984}, check_corruption{166511}, rht_deferred_worker{28894}
[  605.787725]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
[  605.790682]     pending: vmpressure_work_fn{209065}, e1000_watchdog [e1000]{207615}
[  605.794271]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  605.797150]     pending: vmstat_shepherd{208067}
[  605.799610] workqueue events_long: flags=0x0
[  605.801951]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  605.805098]     pending: gc_worker [nf_conntrack]{208961}
[  605.807976] workqueue events_freezable: flags=0x4
[  605.810391]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  605.813151]     pending: vmballoon_work [vmw_balloon]{208085}
[  605.815851] workqueue events_power_efficient: flags=0x80
[  605.818382]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[  605.821124]     pending: check_lifetime{64453}
[  605.823337]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  605.826091]     pending: neigh_periodic_work{199329}
[  605.828426]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
[  605.831068]     pending: fb_flashcursor{209042}, do_cache_clean{201882}
[  605.833902]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  605.836545]     pending: neigh_periodic_work{195234}
[  605.838838] workqueue events_freezable_power_: flags=0x84
[  605.841295]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  605.843824]     pending: disk_events_workfn{208625}
[  605.846084] workqueue mm_percpu_wq: flags=0x8
[  605.848145]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  605.850667]     pending: drain_local_pages_wq{209047} BAR(4561){209047}
[  605.853368] workqueue writeback: flags=0x4e
[  605.855382]   pwq 128: cpus=0-63 flags=0x4 nice=0 active=1/256
[  605.857793]     in-flight: 354:wb_workfn{182977}
[  605.860182] workqueue xfs-data/sda1: flags=0xc
[  605.862314]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=23/256 MAYDAY
[  605.865010]     in-flight: 3221:xfs_end_io [xfs]{209041}, 3212:xfs_end_io [xfs]{209158}, 29:xfs_end_io [xfs]{209200}, 3230:xfs_end_io [xfs]{209171}, 3229:xfs_end_io [xfs]{209099}, 50:xfs_end_io [xfs]{209099}, 3223:xfs_end_io [xfs]{209045}, 165:xfs_end_io [xfs]{209052}, 3215:xfs_end_io [xfs]{209046}
[  605.874362]     pending: xfs_end_io [xfs]{209011}, xfs_end_io [xfs]{209007}, xfs_end_io [xfs]{209007}, xfs_end_io [xfs]{208999}, xfs_end_io [xfs]{208977}, xfs_end_io [xfs]{208975}, xfs_end_io [xfs]{208970}, xfs_end_io [xfs]{208963}, xfs_end_io [xfs]{208963}, xfs_end_io [xfs]{208950}, xfs_end_io [xfs]{208948}, xfs_end_io [xfs]{208948}, xfs_end_io [xfs]{208946}, xfs_end_io [xfs]{30655}
[  605.886882]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=28/256 MAYDAY
[  605.889661]     in-flight: 3225:xfs_end_io [xfs]{209249}, 236:xfs_end_io [xfs]{209163}, 23:xfs_end_io [xfs]{209151}, 3228:xfs_end_io [xfs]{209151}, 4380:xfs_end_io [xfs]{209259}, 3214:xfs_end_io [xfs]{209240}, 3220:xfs_end_io [xfs]{209212}, 3227:xfs_end_io [xfs]{209233}
[  605.898706]     pending: xfs_end_io [xfs]{209159}, xfs_end_io [xfs]{209149}, xfs_end_io [xfs]{209128}, xfs_end_io [xfs]{209128}, xfs_end_io [xfs]{209128}, xfs_end_io [xfs]{209128}, xfs_end_io [xfs]{209128}, xfs_end_io [xfs]{209116}, xfs_end_io [xfs]{209110}, xfs_end_io [xfs]{209096}, xfs_end_io [xfs]{209096}, xfs_end_io [xfs]{209092}, xfs_end_io [xfs]{209082}, xfs_end_io [xfs]{209061}, xfs_end_io [xfs]{209058}, xfs_end_io [xfs]{209051}, xfs_end_io [xfs]{209040}, xfs_end_io [xfs]{209021}, xfs_end_io [xfs]{209014}, xfs_end_io [xfs]{30678}
[  605.917299]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=24/256 MAYDAY
[  605.920254]     in-flight: 375(RESCUER):xfs_end_io [xfs]{209194}, 42:xfs_end_io [xfs]{209278}, 3222:xfs_end_io [xfs]{209286}, 17:xfs_end_io [xfs]{209195}, 65:xfs_end_io [xfs]{209241}, 122:xfs_end_io [xfs]{209230}
[  605.927845]     pending: xfs_end_io [xfs]{209187}, xfs_end_io [xfs]{209154}, xfs_end_io [xfs]{209113}, xfs_end_io [xfs]{209088}, xfs_end_io [xfs]{209081}, xfs_end_io [xfs]{209071}, xfs_end_io [xfs]{209070}, xfs_end_io [xfs]{209067}, xfs_end_io [xfs]{209062}, xfs_end_io [xfs]{209053}, xfs_end_io [xfs]{209051}, xfs_end_io [xfs]{209047}, xfs_end_io [xfs]{209032}, xfs_end_io [xfs]{209027}, xfs_end_io [xfs]{209017}, xfs_end_io [xfs]{209016}, xfs_end_io [xfs]{209014}, xfs_end_io [xfs]{209011}
[  605.944773]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=13/256
[  605.947616]     in-flight: 3218:xfs_end_io [xfs]{209268}, 3216:xfs_end_io [xfs]{209265}, 3:xfs_end_io [xfs]{209223}, 33:xfs_end_io [xfs]{209181}, 101:xfs_end_io [xfs]{209159}, 4381:xfs_end_io [xfs]{209294}, 3219:xfs_end_io [xfs]{209181}
[  605.956005]     pending: xfs_end_io [xfs]{209149}, xfs_end_io [xfs]{209141}, xfs_end_io [xfs]{209133}, xfs_end_io [xfs]{209057}, xfs_end_io [xfs]{209026}, xfs_end_io [xfs]{209025}
[  605.963018] workqueue xfs-sync/sda1: flags=0x4
[  605.965455]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  605.968299]     pending: xfs_log_worker [xfs]{202031}
[  605.970872] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=209s workers=8 manager: 3224
[  605.974252] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=209s workers=6 manager: 3213
[  605.977682] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=209s workers=9 manager: 47
[  605.981015] pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=64s workers=10 manager: 3217
[  605.984382] pool 128: cpus=0-63 flags=0x4 nice=0 hung=183s workers=3 idle: 355 356
----------

> > > If we do want to make
> > > sure that work items on mm_percpu_wq workqueue are executed without delays,
> > > we need to consider using kthread_workers instead of workqueue. (Or, maybe
> > > somehow we can share one kthread with constantly manipulating cpumask?)
> > 
> > Hmm, that doesn't sound like a bad idea to me. We already have a rescuer
> > thread that basically sits idle all the time so having a dedicated
> > kernel thread will not be more expensive wrt. resources. So I think this
> > is a more reasonable approach than playing with WQ_HIGHPRI which smells
> > like a quite obscure workaround than a real fix to me.
> 
> Well, there's one rescuer in the whole system and you'd need
> nr_online_cpus kthreads if you wanna avoid constant cacheline
> bouncing.

Excuse me, one rescuer kernel thread per each WQ_MEM_RECLAIM workqueue, doesn't it?

My thought is to stop using WQ_MEM_RECLAIM workqueue for mm_percpu_wq and use a
dedicated kernel thread like oom_reaper. Since the frequency of calling handler
function seems to be once per a second for each online CPU, I thought switching
cpumask for NR_CPUS times per a second is tolerable.

Or, yet another approach would be to use split counters

  Each CPU writes up-to-date values to per-CPU counters.
  The aggregator kernel thread reads up-to-date values from per-CPU counters,
  calculates diff between up-to-date values and previous values, saves up-to-date
  values as previous values, and reflects the diff to global counters.

if cost of reading per-CPU counters of online CPUs is smaller than cost of
switching cpumask for each online CPU.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>