On Wed, Jan 28, 2015 at 01:39:16PM -0500, Luiz Capitulino wrote: > On Wed, 28 Jan 2015 10:09:50 -0800 > "Paul E. McKenney" <paulmck@xxxxxxxxxxxxxxxxxx> wrote: > > > On Wed, Jan 28, 2015 at 09:18:36AM -0500, Luiz Capitulino wrote: > > > On Tue, 27 Jan 2015 23:55:08 -0200 > > > Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote: > > > > > > > On Tue, Jan 27, 2015 at 12:37:52PM -0800, Paul E. McKenney wrote: > > > > > On Mon, Jan 26, 2015 at 02:14:03PM -0500, Luiz Capitulino wrote: > > > > > > Paul, > > > > > > > > > > > > We're running some measurements with cyclictest running inside a > > > > > > KVM guest where we could observe spinlock contention among rcuc > > > > > > threads. > > > > > > > > > > > > Basically, we have a 16-CPU NUMA machine very well setup for RT. > > > > > > This machine and the guest run the RT kernel. As our test-case > > > > > > requires an application in the guest taking 100% of the CPU, the > > > > > > RT priority configuration that gives the best latency is this one: > > > > > > > > > > > > 263 FF 3 [rcuc/15] > > > > > > 13 FF 3 [rcub/1] > > > > > > 12 FF 3 [rcub/0] > > > > > > 265 FF 2 [ksoftirqd/15] > > > > > > 3181 FF 1 qemu-kvm > > > > > > > > > > > > In this configuration, the rcuc can preempt the guest's vcpu > > > > > > thread. This shouldn't be a problem, except for the fact that > > > > > > we're seeing that in some cases the rcuc/15 thread spends 10us > > > > > > or more spinning in this spinlock (note that IRQs are disabled > > > > > > during this period): > > > > > > > > > > > > __rcu_process_callbacks() > > > > > > { > > > > > > ... > > > > > > local_irq_save(flags); > > > > > > if (cpu_needs_another_gp(rsp, rdp)) { > > > > > > raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */ > > > > > > rcu_start_gp(rsp); > > > > > > raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); > > > > > > ... > > > > > > > > > > Life can be hard when irq-disabled spinlocks can be preempted! But how > > > > > often does this happen? > > > > > > I have to run cyclictest in the guest for 16m a few times to reproduce it. > > > > So you are seeing the high contention in the guest, correct? > > No, it's in the host. OK, good to know. ;-) > > > > > Also, does this happen on smaller systems, for > > > > > example, with four or eight CPUs? > > > > > > Didn't test. > > > > > > > > And I confess to be a bit surprised > > > > > that you expect real-time response from a guest that is subject to > > > > > preemption -- as I understand it, the usual approach is to give RT guests > > > > > their own CPUs. > > > > > > > > > > Or am I missing something? > > > > > > > > We are trying to avoid relying on the guest VCPU to voluntarily yield > > > > the CPU therefore allowing the critical services (such as rcu callback > > > > processing and sched tick processing) to execute. > > > > > > Yes. I hope I won't regret saying this, but what I'm observing is that > > > preempting-off the vcpu is not the end of the world as long as you're > > > quick. > > > > And as long as you get lucky and avoid preempting a VCPU that happens to > > be holding a critical lock. > > That's not the case. Everything I mentioned in this thread about RCU > and contention happens in the host. > > > Look, if you want real-time response in a guest OS, there simply is no > > substitute for ensuring that the guest has its own CPUs that are not used > > for anything else, either by anything in the host or by another guest. > > If you do allow preemption of a guest OS that might be holding a critical > > guest-OS lock, you are going to see latency blows. Count on it! ;-) > > > > > > > > We've tried playing with the rcu_nocbs= option. However, it > > > > > > did not help because, for reasons we don't understand, the rcuc > > > > > > threads have to handle grace period start even when callback > > > > > > offloading is used. Handling this case requires this code path > > > > > > to be executed. > > > > > > > > > > Yep. The rcu_nocbs= option offloads invocation of RCU callbacks, but not > > > > > the per-CPU work required to inform RCU of quiescent states. > > > > > > > > Can't you execute that on vCPU entry/exit? Those are quiescent states > > > > after all. > > > > > > > > > > We've cooked the following extremely dirty patch, just to see > > > > > > what would happen: > > > > > > > > > > > > diff --git a/kernel/rcutree.c b/kernel/rcutree.c > > > > > > index eaed1ef..c0771cc 100644 > > > > > > --- a/kernel/rcutree.c > > > > > > +++ b/kernel/rcutree.c > > > > > > @@ -2298,9 +2298,19 @@ __rcu_process_callbacks(struct rcu_state *rsp) > > > > > > /* Does this CPU require a not-yet-started grace period? */ > > > > > > local_irq_save(flags); > > > > > > if (cpu_needs_another_gp(rsp, rdp)) { > > > > > > - raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */ > > > > > > - rcu_start_gp(rsp); > > > > > > - raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); > > > > > > + for (;;) { > > > > > > + if (!raw_spin_trylock(&rcu_get_root(rsp)->lock)) { > > > > > > + local_irq_restore(flags); > > > > > > + local_bh_enable(); > > > > > > + schedule_timeout_interruptible(2); > > > > > > > > > > Yes, the above will get you a splat in mainline kernels, which do not > > > > > necessarily push softirq processing to the ksoftirqd kthreads. ;-) > > > > > > > > > > > + local_bh_disable(); > > > > > > + local_irq_save(flags); > > > > > > + continue; > > > > > > + } > > > > > > + rcu_start_gp(rsp); > > > > > > + raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); > > > > > > + break; > > > > > > + } > > > > > > } else { > > > > > > local_irq_restore(flags); > > > > > > } > > > > > > > > > > > > With this patch rcuc is gone from our traces and the scheduling > > > > > > latency is reduced by 3us in our CPU-bound test-case. > > > > > > > > > > > > Could you please advice on how to solve this contention problem? > > > > > > > > > > The usual advice would be to configure the system such that the guest's > > > > > VCPUs do not get preempted. > > > > > > > > The guest vcpus can consume 100% of CPU time (imagine a guest vcpu busy > > > > spinning). In that case, rcuc would never execute, because it has a > > > > lower priority than guest VCPUs. > > > > > > > > I do not think we want that. > > > > > > > > > Or is the contention on the root rcu_node structure's ->lock field > > > > > high for some other reason? > > > > > > I didn't go far on trying to determine the reason. What I observed > > > was the rcuc preempting-off the vcpu and taking 10us+. I debugged it > > > and most of this time it spends spinning on the spinlock. The patch > > > above makes the rcuc disappear from our traces. This is all I've got. > > > I could try to debug it further if you have suggestions on how to > > > trace the cause. > > > > My current guess is that either: > > > > 1. You are allowing the host or another guest to preempt this > > guest's VCPU. Don't do that. ;-) > > We do allow the rcuc kthread to preempt the guest's vCPU (not other > guests). The reason for this is that the workload running inside the > guest may take 100% of the CPU, which won't allow the rcuc thread > to ever execute. > > > 2. You are letting the rcuc kthreads contend for the worker CPUs. > > Pin them to housekeeping CPUs. This applies to both the > > host and the guest rcuc kthreads, but especially to the > > host rcuc kthreads. > > I'd love to be able to do this, but the rcuc threads are CPU-bound > threads. There's one per CPU and the kernel doesn't allow me to move > them around. Can you build with CONFIG_RCU_BOOST=n? Then you won't have any rcuc kthreads. If you are preventing preemption of the VCPUs, you should not need RCU priority boosting. Thanx, Paul > > > > Or am I still unclear on your goals and configuration? > > > > Thanx, Paul > > > > > > Luiz? > > > > > > > > > > Can we test whether the local CPU is nocb, and in that case, > > > > > > skip rcu_start_gp entirely for example? > > > > > > > > > > If you do that, you can see system hangs due to needed grace periods never > > > > > getting started. > > > > > > > > So it is not enough for CB CPUs to execute rcu_start_gp. Why is it > > > > necessary for nocb CPUs to execute rcu_start_gp? > > > > > > > > > Are you using the default value of 16 for CONFIG_RCU_FANOUT_LEAF? > > > > > If you are using a smaller value, it would be possible to rework the > > > > > code to reduce contention on ->lock, though if a VCPU does get preempted > > > > > while holding the root rcu_node structure's ->lock, life will be hard. > > > > > > > > Its a raw spinlock, isnt it? > > > > > > > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html