Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

Tejun Heo <tj@xxxxxxxxxx> · Mon, 30 Jul 2018 12:14:23 -0700

Hello, Michal.

On Mon, Jul 30, 2018 at 08:51:10PM +0200, Michal Hocko wrote:
> > Yeah, workqueue can choke on things like that and kthread indefinitely
> > busy looping doesn't do anybody any good.
> 
> Yeah, I do agree. But this is much easier said than done ;) Sure
> we have that hack that does sleep rather than cond_resched in the
> page allocator. We can and will "fix" it to be unconditional in the
> should_reclaim_retry [1] but this whole thing is really subtle. It just
> take one misbehaving worker and something which is really important to
> run will get stuck.

Oh yeah, I'm not saying the current behavior is ideal or anything, but
since the behavior has been put in many years ago, it only became a
problem only a couple times and all cases were rather easy and obvious
fixes on the wq user side.  It shouldn't be difficult to add a timer
mechanism on top.  We might be able to simply extend the hang
detection mechanism to kick off all pending rescuers after detecting a
wq stall.  I'm wary about making it a part of normal operation
(ie. silent timeout).  per-cpu kworkers really shouldn't busy loop for
an extended period of time.

Thanks.

-- 
tejun