On Thu, 2012-09-13 at 17:18 +0530, Raghavendra K T wrote: > * Andrew Theurer <habanero@xxxxxxxxxxxxxxxxxx> [2012-09-11 13:27:41]: > > > On Tue, 2012-09-11 at 11:38 +0530, Raghavendra K T wrote: > > > On 09/11/2012 01:42 AM, Andrew Theurer wrote: > > > > On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote: > > > >> On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote: > > > >>>> +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p) > > > >>>> +{ > > > >>>> + if (!curr->sched_class->yield_to_task) > > > >>>> + return false; > > > >>>> + > > > >>>> + if (curr->sched_class != p->sched_class) > > > >>>> + return false; > > > >>> > > > >>> > > > >>> Peter, > > > >>> > > > >>> Should we also add a check if the runq has a skip buddy (as pointed out > > > >>> by Raghu) and return if the skip buddy is already set. > > > >> > > > >> Oh right, I missed that suggestion.. the performance improvement went > > > >> from 81% to 139% using this, right? > > > >> > > > >> It might make more sense to keep that separate, outside of this > > > >> function, since its not a strict prerequisite. > > > >> > > > >>>> > > > >>>> + if (task_running(p_rq, p) || p->state) > > > >>>> + return false; > > > >>>> + > > > >>>> + return true; > > > >>>> +} > > > >> > > > >> > > > >>>> @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p, > > > >>> bool preempt) > > > >>>> rq = this_rq(); > > > >>>> > > > >>>> again: > > > >>>> + /* optimistic test to avoid taking locks */ > > > >>>> + if (!__yield_to_candidate(curr, p)) > > > >>>> + goto out_irq; > > > >>>> + > > > >> > > > >> So add something like: > > > >> > > > >> /* Optimistic, if we 'raced' with another yield_to(), don't bother */ > > > >> if (p_rq->cfs_rq->skip) > > > >> goto out_irq; > > > >>> > > > >>> > > > >>>> p_rq = task_rq(p); > > > >>>> double_rq_lock(rq, p_rq); > > > >>> > > > >>> > > > >> But I do have a question on this optimization though,.. Why do we check > > > >> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ? > > > >> > > > >> That is, I'd like to see this thing explained a little better. > > > >> > > > >> Does it go something like: p_rq is the runqueue of the task we'd like to > > > >> yield to, rq is our own, they might be the same. If we have a ->skip, > > > >> there's nothing we can do about it, OTOH p_rq having a ->skip and > > > >> failing the yield_to() simply means us picking the next VCPU thread, > > > >> which might be running on an entirely different cpu (rq) and could > > > >> succeed? > > > > > > > > Here's two new versions, both include a __yield_to_candidate(): "v3" > > > > uses the check for p_rq->curr in guest mode, and "v4" uses the cfs_rq > > > > skip check. Raghu, I am not sure if this is exactly what you want > > > > implemented in v4. > > > > > > > > > > Andrew, Yes that is what I had. I think there was a mis-understanding. > > > My intention was to if there is a directed_yield happened in runqueue > > > (say rqA), do not bother to directed yield to that. But unfortunately as > > > PeterZ pointed that would have resulted in setting next buddy of a > > > different run queue than rqA. > > > So we can drop this "skip" idea. Pondering more over what to do? can we > > > use next buddy itself ... thinking.. > > > > As I mentioned earlier today, I did not have your changes from kvm.git > > tree when I tested my changes. Here are your changes and my changes > > compared: > > > > throughput in MB/sec > > > > kvm_vcpu_on_spin changes: 4636 +/- 15.74% > > yield_to changes: 4515 +/- 12.73% > > > > I would be inclined to stick with your changes which are kept in kvm > > code. I did try both combined, and did not get good results: > > > > both changes: 4074 +/- 19.12% > > > > So, having both is probably not a good idea. However, I feel like > > there's more work to be done. With no over-commit (10 VMs), total > > throughput is 23427 +/- 2.76%. A 2x over-commit will no doubt have some > > overhead, but a reduction to ~4500 is still terrible. By contrast, > > 8-way VMs with 2x over-commit have a total throughput roughly 10% less > > than 8-way VMs with no overcommit (20 vs 10 8-way VMs on 80 cpu-thread > > host). We still have what appears to be scalability problems, but now > > it's not so much in runqueue locks for yield_to(), but now > > get_pid_task(): > > > > Hi Andrew, > IMHO, reducing the double runqueue lock overhead is a good idea, > and may be we see the benefits when we increase the overcommit further. > > The explaination for not seeing good benefit on top of PLE handler > optimization patch is because we filter the yield_to candidates, > and hence resulting in less contention for double runqueue lock. > and extra code overhead during genuine yield_to might have resulted in > some degradation in the case you tested. > > However, did you use cfs.next also?. I hope it helps, when we combine. > > Here is the result that is showing positive benefit. > I experimented on a 32 core (no HT) PLE machine with 32 vcpu guest(s). > > +-----------+-----------+-----------+------------+-----------+ > kernbench time in sec, lower is better > +-----------+-----------+-----------+------------+-----------+ > base stddev patched stddev %improve > +-----------+-----------+-----------+------------+-----------+ > 1x 44.3880 1.8699 40.8180 1.9173 8.04271 > 2x 96.7580 4.2787 93.4188 3.5150 3.45108 > +-----------+-----------+-----------+------------+-----------+ > > > +-----------+-----------+-----------+------------+-----------+ > ebizzy record/sec higher is better > +-----------+-----------+-----------+------------+-----------+ > base stddev patched stddev %improve > +-----------+-----------+-----------+------------+-----------+ > 1x 2374.1250 50.9718 3816.2500 54.0681 60.74343 > 2x 2536.2500 93.0403 2789.3750 204.7897 9.98029 > +-----------+-----------+-----------+------------+-----------+ > > > Below is the patch which combine suggestions of peterZ on your > original approach with cfs.next (already posted by Srikar in the other > thread) I did get a chance to run with the below patch and your changes in kvm.git, but the results were not too different: Dbench, 10 x 16-way VMs on 80-way host: kvm_vcpu_on_spin changes: 4636 +/- 15.74% yield_to changes: 4515 +/- 12.73% both changes from above: 4074 +/- 19.12% ...plus cfs.next check: 4418 +/- 16.97% Still hovering around 4500 MB/sec The concern I have is that even though we have gone through changes to help reduce the candidate vcpus we yield to, we still have a very poor idea of which vcpu really needs to run. The result is high cpu usage in the get_pid_task and still some contention in the double runqueue lock. To make this scalable, we either need to significantly reduce the occurrence of the lock-holder preemption, or do a much better job of knowing which vcpu needs to run (and not unnecessarily yielding to vcpus which do not need to run). On reducing the occurrence: The worst case for lock-holder preemption is having vcpus of same VM on the same runqueue. This guarantees the situation of 1 vcpu running while another [of the same VM] is not. To prove the point, I ran the same test, but with vcpus restricted to a range of host cpus, such that any single VM's vcpus can never be on the same runqueue. In this case, all 10 VMs' vcpu-0's are on host cpus 0-4, vcpu-1's are on host cpus 5-9, and so on. Here is the result: kvm_cpu_spin, and all yield_to changes, plus restricted vcpu placement: 8823 +/- 3.20% much, much better On picking a better vcpu to yield to: I really hesitate to rely on paravirt hint [telling us which vcpu is holding a lock], but I am not sure how else to reduce the candidate vcpus to yield to. I suspect we are yielding to way more vcpus than are prempted lock-holders, and that IMO is just work accomplishing nothing. Trying to think of way to further reduce candidate vcpus.... -Andrew > > ----8<---- > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index fbf1fd0..8551f57 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -4820,6 +4820,24 @@ void __sched yield(void) > } > EXPORT_SYMBOL(yield); > > +/* > + * Tests preconditions required for sched_class::yield_to(). > + */ > +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p, > + struct rq *p_rq) > +{ > + if (!curr->sched_class->yield_to_task) > + return false; > + > + if (curr->sched_class != p->sched_class) > + return false; > + > + if (task_running(p_rq, p) || p->state) > + return false; > + > + return true; > +} > + > /** > * yield_to - yield the current processor to another thread in > * your thread group, or accelerate that thread toward the > @@ -4844,20 +4862,24 @@ bool __sched yield_to(struct task_struct *p, bool preempt) > > again: > p_rq = task_rq(p); > + > + /* optimistic test to avoid taking locks */ > + if (!__yield_to_candidate(curr, p, p_rq)) > + goto out_irq; > + > + /* if next buddy is set, assume yield is in progress */ > + if (p_rq->cfs.next) > + goto out_irq; > + > double_rq_lock(rq, p_rq); > while (task_rq(p) != p_rq) { > double_rq_unlock(rq, p_rq); > goto again; > } > > - if (!curr->sched_class->yield_to_task) > - goto out; > - > - if (curr->sched_class != p->sched_class) > - goto out; > - > - if (task_running(p_rq, p) || p->state) > - goto out; > + /* validate state, holding p_rq ensures p's state cannot change */ > + if (!__yield_to_candidate(curr, p, p_rq)) > + goto out_unlock; > > yielded = curr->sched_class->yield_to_task(rq, p, preempt); > if (yielded) { > @@ -4877,8 +4899,9 @@ again: > rq->skip_clock_update = 0; > } > > -out: > +out_unlock: > double_rq_unlock(rq, p_rq); > +out_irq: > local_irq_restore(flags); > > if (yielded) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html