Re: [RFC][PATCH] Improving directed yield scalability for PLE handler

Andrew Jones <drjones@xxxxxxxxxx> · Mon, 17 Sep 2012 15:48:36 +0200

On Sat, Sep 15, 2012 at 09:38:54PM +0530, Raghavendra K T wrote:
> On 09/14/2012 10:40 PM, Andrew Jones wrote:
> >On Thu, Sep 13, 2012 at 04:30:58PM -0500, Andrew Theurer wrote:
> >>On Thu, 2012-09-13 at 17:18 +0530, Raghavendra K T wrote:
> >>>* Andrew Theurer<habanero@xxxxxxxxxxxxxxxxxx>  [2012-09-11 13:27:41]:
> >>>
> [...]
> >>
> >>On picking a better vcpu to yield to:  I really hesitate to rely on
> >>paravirt hint [telling us which vcpu is holding a lock], but I am not
> >>sure how else to reduce the candidate vcpus to yield to.  I suspect we
> >>are yielding to way more vcpus than are prempted lock-holders, and that
> >>IMO is just work accomplishing nothing.  Trying to think of way to
> >>further reduce candidate vcpus....
> >>
> >
> >wrt to yielding to vcpus for the same cpu, I recently noticed that
> >there's a bug in yield_to_task_fair. yield_task_fair() calls
> >clear_buddies(), so if we're yielding to a task that has been running on
> >the same cpu that we're currently running on, and thus is also on the
> >current cfs runqueue, then our 'who to pick next' hint is getting cleared
> >right after we set it.
> >
> >I had hoped that the patch below would show a general improvement in the
> >vpu overcommit performance, however the results were variable - no worse,
> >no better. Based on your results above showing good improvement from
> >interleaving vcpus across the cpus, then that means there was a decent
> >percent of these types of yields going on. So since the patch didn't
> >change much that indicates that the next hinting isn't generally taken
> >too seriously by the scheduler.  Anyway, the patch should correct the
> >code per its design, and testing shows that it didn't make anything worse,
> >so I'll post it soon. Also, in order to try and improve how far set-next
> >can jump ahead in the queue, I tested a kernel with group scheduling
> >compiled out (libvirt uses cgroups and I'm not sure autogroups may affect
> >things). I did get slight improvement with that, but nothing to write home
> >to mom about.
> >
> >diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >index c219bf8..7d8a21d 100644
> >--- a/kernel/sched/fair.c
> >+++ b/kernel/sched/fair.c
> >@@ -3037,11 +3037,12 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool preemp
> >  	if (!se->on_rq || throttled_hierarchy(cfs_rq_of(se)))
> >  		return false;
> >
> >+	/* We're yielding, so tell the scheduler we don't want to be picked */
> >+	yield_task_fair(rq);
> >+
> >  	/* Tell the scheduler that we'd really like pse to run next. */
> >  	set_next_buddy(se);
> >
> >-	yield_task_fair(rq);
> >-
> >  	return true;
> >  }
> >
> 
> Hi Drew,  Agree with your fix and tested the patch too.. results are
> pretty much same.  puzzled why so.

Looking at the code I see that the next hint might be used more frequently
if we bump up sysctl/kernel.sched_wakeup_granularity_ns. I also just found
out that some virt tuned profiles do that, so maybe I should try running
with one of those profiles.

> 
> thinking ... may be we hit this when #vcpu (of a  VM) > #pcpu?
> (pigeonhole principle ;)).

Not sure, but I haven't done any experiments where a single VM has >
#vcpus than the system as pcpus. For my vcpu overcommit I increase the
VM count, where each VM has #vcpus <= #pcpus.

Drew
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html