On Wed, Nov 30, 2016 at 02:19:10PM +0100, Michal Hocko wrote: > On Wed 30-11-16 03:53:20, Paul E. McKenney wrote: > > On Wed, Nov 30, 2016 at 12:09:44PM +0100, Michal Hocko wrote: > > > [CCing Paul] > > > > > > On Wed 30-11-16 11:28:34, Donald Buczek wrote: > > > [...] > > > > shrink_active_list gets and releases the spinlock and calls cond_resched(). > > > > This should give other tasks a chance to run. Just as an experiment, I'm > > > > trying > > > > > > > > --- a/mm/vmscan.c > > > > +++ b/mm/vmscan.c > > > > @@ -1921,7 +1921,7 @@ static void shrink_active_list(unsigned long > > > > nr_to_scan, > > > > spin_unlock_irq(&pgdat->lru_lock); > > > > > > > > while (!list_empty(&l_hold)) { > > > > - cond_resched(); > > > > + cond_resched_rcu_qs(); > > > > page = lru_to_page(&l_hold); > > > > list_del(&page->lru); > > > > > > > > and didn't hit a rcu_sched warning for >21 hours uptime now. We'll see. > > > > > > This is really interesting! Is it possible that the RCU stall detector > > > is somehow confused? > > > > No, it is not confused. Again, cond_resched() is not a quiescent > > state unless it does a context switch. Therefore, if the task running > > in that loop was the only runnable task on its CPU, cond_resched() > > would -never- provide RCU with a quiescent state. > > Sorry for being dense here. But why cannot we hide the QS handling into > cond_resched()? I mean doesn't every current usage of cond_resched > suffer from the same problem wrt RCU stalls? We can, and you are correct that cond_resched() does not unconditionally supply RCU quiescent states, and never has. Last time I tried to add cond_resched_rcu_qs() semantics to cond_resched(), I got told "no", but perhaps it is time to try again. One of the challenges is that there are two different timeframes. If we want CONFIG_PREEMPT=n kernels to have millisecond-level scheduling latencies, we need a cond_resched() more than once per millisecond, and the usual uncertainties will mean more like once per hundred microseconds or so. In contrast, the occasional 100-millisecond RCU grace period when under heavy load is normally not considered to be a problem, which means that a cond_resched_rcu_qs() every 10 milliseconds or so is just fine. Which means that cond_resched() is much more sensitive to overhead than is cond_resched_rcu_qs(). No reason not to give it another try, though! (Adding Peter Zijlstra to CC for his reactions.) Right now, the added overhead is a function call, two tests of per-CPU variables, one increment of a per-CPU variable, and a barrier() before and after. I could probably combine the tests, but I do need at least one test. I cannot see how I can eliminate either barrier(). I might be able to pull the increment under the test. The patch below is instead very straightforward, avoiding any optimizations. Untested, probably does not even build. Failing this approach, the rule is as follows: 1. Add cond_resched() to in-kernel loops that cause excessive scheduling latencies. 2. Add cond_resched_rcu_qs() to in-kernel loops that cause RCU CPU stall warnings. Thanx, Paul > > In contrast, cond_resched_rcu_qs() unconditionally provides RCU > > with a quiescent state (hence the _rcu_qs in its name), regardless > > of whether or not a context switch happens. > > > > It is therefore expected behavior that this change might prevent > > RCU CPU stall warnings. ------------------------------------------------------------------------ commit d7100358d066cd7d64301a2da161390e9f4aa63f Author: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx> Date: Wed Nov 30 06:24:30 2016 -0800 sched,rcu: Make cond_resched() provide RCU quiescent state There is some confusion as to which of cond_resched() or cond_resched_rcu_qs() should be added to long in-kernel loops. This commit therefore eliminates the decision by adding RCU quiescent states to cond_resched(). Warning: This is a prototype. For example, it does not correctly handle Tasks RCU. Which is OK for the moment, given that no one actually uses Tasks RCU yet. Reported-by: Michal Hocko <mhocko@xxxxxxxxxx> Not-yet-signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx> diff --git a/include/linux/sched.h b/include/linux/sched.h index 348f51b0ec92..ccdb6064884e 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -3308,10 +3308,11 @@ static inline int signal_pending_state(long state, struct task_struct *p) * cond_resched_lock() will drop the spinlock before scheduling, * cond_resched_softirq() will enable bhs before scheduling. */ +void rcu_all_qs(void); #ifndef CONFIG_PREEMPT extern int _cond_resched(void); #else -static inline int _cond_resched(void) { return 0; } +static inline int _cond_resched(void) { rcu_all_qs(); return 0; } #endif #define cond_resched() ({ \ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 94732d1ab00a..40b690813b80 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4906,6 +4906,7 @@ int __sched _cond_resched(void) preempt_schedule_common(); return 1; } + rcu_all_qs(); return 0; } EXPORT_SYMBOL(_cond_resched); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>