On 01/07/2011 12:29 AM, Mike Galbraith wrote:
+#ifdef CONFIG_SMP + /* + * If this yield is important enough to want to preempt instead + * of only dropping a ->next hint, we're alone, and the target + * is not alone, pull the target to this cpu. + * + * NOTE: the target may be alone in it's cfs_rq if another class + * task or another task group is currently executing on it's cpu. + * In this case, we still pull, to accelerate it toward the cpu. + */ + if (cfs_rq != p_cfs_rq&& preempt&& cfs_rq->nr_running == 1&& + cpumask_test_cpu(this_cpu,&p->cpus_allowed)) { + pull_task(task_rq(p), p, this_rq(), this_cpu); + p_cfs_rq = cfs_rq_of(pse); + } +#endif
This causes some fun issues in a simple test case on my system. The test consists of 2 4-VCPU KVM guests, bound to the same 4 CPUs on the host. One guest is running the AMQP performance test, the other guest is totally idle. This means that besides the 4 very busy VCPUs, there is only a few percent CPU use in background tasks from the idle guest and qemu-kvm userspace bits. However, the busy guest is restricted to using just 3 out of the 4 CPUs, leaving one idle! A simple explanation for this is that the above pulling code will pull another VCPU onto the local CPU whenever we run into contention inside the guest and some random background task runs on the CPU where that other VCPU was. From that point on, the 4 VCPUs will stay on 3 CPUs, leaving one idle. Any time we have contention inside the guest (pretty frequent), we move whoever is not currently running to another CPU. Cgroups only makes the matter worse - libvirt places each KVM guest into its own cgroup, so a VCPU will generally always be alone on its own per-cgroup, per-cpu runqueue! That can lead to pulling a VCPU onto our local CPU because we think we are alone, when in reality we share the CPU with others... Removing the pulling code allows me to use all 4 CPUs with a 4-VCPU KVM guest in an uncontended situation.
+ /* Tell the scheduler that we'd really like pse to run next. */ + p_cfs_rq->next = pse;
Using set_next_buddy propagates this up to the root, allowing the scheduler to actually know who we want to run next when cgroups is involved.
+ /* We know whether we want to preempt or not, but are we allowed? */ + if (preempt&& same_thread_group(p, task_of(p_cfs_rq->curr))) + resched_task(task_of(p_cfs_rq->curr));
With this in place, we can get into the situation where we will gladly give up CPU time, but not actually give any to the other VCPUs in our guest. I believe we can get rid of that test, because pick_next_entity already makes sure it ignores ->next if picking ->next would lead to unfairness. Removing this test (and simplifying yield_to_task_fair) seems to lead to more predictable test results. I'll send the updated patch in another email, since this one is already way too long for a changelog :) -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html