On Thu 05-11-15 19:16:48, Tejun Heo wrote: > Hello, > > On Thu, Nov 05, 2015 at 11:45:42AM -0600, Christoph Lameter wrote: > > Sorry but we need work queue processing for vmstat counters that is > > I made this analogy before but this is similar to looping with > preemption off. If anything on workqueue stays RUNNING w/o making > forward progress, it's buggy. I'd venture to say any code which busy > loops without making forward progress in the time scale noticeable to > human beings is borderline buggy too. Well, the caller asked for a memory but the request cannot succeed. Due to the memory allocator semantic we cannot fail the request so we have to loop. If we had an event to wait for we would do so, of course. Now wrt. to a small sleep. We used to do that and called congestion_wait(HZ/50) before retry. This has proved to cause stalls during high memory pressure 0e093d99763e ("writeback: do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone"). I do not really remember what was CONFIG_HZ in those reports but it is quite possible it was 250. So there is a risk of (partial) re-introducing of those stalls with the patch from Tetsuo (http://lkml.kernel.org/r/201510251952.CEF04109.OSOtLFHFVFJMQO@xxxxxxxxxxxxxxxxxxx) If we really have to do short sleep, though, then I would suggest sticking that into wait_iff_congested rather than spread it into more places and reduce it only to worker threads. This should be much more safer. Thought? --- diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 8ed2ffd963c5..7340353f8aea 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -957,8 +957,9 @@ EXPORT_SYMBOL(congestion_wait); * jiffies for either a BDI to exit congestion of the given @sync queue * or a write to complete. * - * In the absence of zone congestion, cond_resched() is called to yield - * the processor if necessary but otherwise does not sleep. + * In the absence of zone congestion, a short sleep or a cond_resched is + * performed to yield the processor and to allow other subsystems to make + * a forward progress. * * The return value is 0 if the sleep is for the full timeout. Otherwise, * it is the number of jiffies that were still remaining when the function @@ -978,7 +979,19 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout) */ if (atomic_read(&nr_wb_congested[sync]) == 0 || !test_bit(ZONE_CONGESTED, &zone->flags)) { - cond_resched(); + + /* + * Memory allocation/reclaim might be called from a WQ + * context and the current implementation of the WQ + * concurrency control doesn't recognize that a particular + * WQ is congested if the worker thread is looping without + * ever sleeping. Therefore we have to do a short sleep + * here rather than calling cond_resched(). + */ + if (current->flags & PF_WQ_WORKER) + schedule_timeout(1); + else + cond_resched(); /* In case we scheduled, work out time remaining */ ret = timeout - (jiffies - start); -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>