Re: [patch 0/3] KVM CPU frequency change hypercalls (resend)

Marcelo Tosatti <mtosatti@xxxxxxxxxx> · Tue, 14 Mar 2017 20:27:51 -0300

Hi Paolo,

On Tue, Mar 14, 2017 at 05:40:21PM +0100, Paolo Bonzini wrote:
> 
> 
> On 02/03/2017 14:59, Marcelo Tosatti wrote:
> > On Thu, Mar 02, 2017 at 11:15:00AM +0100, Paolo Bonzini wrote:
> >>  one obvious downside is that any application that you
> >> run after DPDK will have its CPU frequency hardcoded to something that
> >> is not appropriate.  
> > 
> > To isolate the CPU where DPDK runs it is already necessary to perform
> > special procedures such as changing the cpumask of other tasks, changing
> > cpumask of interrupt handlers (to remove the isolated CPU from that
> > cpumask), etc. Changing the cpufreq governor to userspace is another
> > step of that setup phase.
> > 
> > On shutdown (or CPU unpin), you can switch back the CPU to the previous
> > governor, which can switch the frequency to whatever it finds suitable.
> 
> But I thought that one of the reasons to do NFV is to simplify this
> setup.  If you now have to do the same thing on virtual machines, things
> become more complicated to set up, and I don't think that NFV virtual
> machines are _that_ special.
> 
> In addition, in the list of setup steps above you forgot "chmod the
> sysfs files for cpufreq so that DPDK can access it".  Doing that chmod
> is a very explicit act, and that's unlike the functionality of this patch.
> 
> By letting virtual machines do the same with a simple hypercall, you're
> giving powers to whoever opens /dev/kvm that they didn't have before
> (unless the userspace process also had access to sysfs).  Worse, the
> effects last beyond the moment /dev/kvm is closed.

This can be fixed by requiring qemu-kvm-vcpu thread, which runs 
the hypercall, to have sufficient priority (similar to other cpufreq
users). Fine, good point.

> So, the question then is how to design the hypervisor so that these NFV
> virtual machines can play with cpufreq, but there are no adverse
> indefinite effects. 

Ok, we can modify the cpufreq cgroups patch, to, from the hypercalls
set the:

"The first three patches of this series introduces
capacity_{min,max} tracking
in the core scheduler, as an extension of the CPU controller."

capacity_min == capacity_max values (which forces the CPU to run
at that frequency, given there are no other tasks requesting
frequency information on that CPU).

This is good enough DPDK.

> One possibility is to have some kind of per-task
> cpufreq.  Another is to do everything in userspace with virtual ACPI
> P-states and the userspace governor in the VM.

Virtual ACPI P-state, that is an option. But why not make it
in-kernel, the exit to userspace can be a significant
fraction of the total if the frequency change time is small (say, 10us
freq change and 5us for userspace exit).

> I was hoping to get more feedback from linux-pm.
> 
> >> Here are two possibilities that I could think of:
> >>
> >> 1) Introduce a mechanism that allows a task to override the governor's
> >> choice of CPU frequency.  This could be a ioctl, a prctl, a cgroup-based
> >> mechanism or whatever else.  As Marcelo pointed out in the original kvm@
> >> thread, the latency and overhead of switching frequencies make it
> >> impractical to associate a desired CPU frequency with a task, because
> >> multiple tasks could be requesting a given frequency.  One possibility
> >> could be to treat the per-task CPU frequency as advisory
> > 
> > DPDK can't afford the frequency as advisory: failure in setting the
> > processor frequency when requested means dropped packets (not 
> > dropping packets being a requirement).
> 
> It can be advisory if you document a proper configuration where it's obeyed.

Sure.

> 
> Paolo
> 
> >>  and only obey
> >> it in restricted cases---for example only if nohz_full is in effect.
> > 
> > From cpufreq documentation:
> > 
> > "On all other cpufreq implementations, these boundaries still need to
> > be set. Then, a "governor" must be selected. Such a "governor" decides
> > what speed the processor shall run within the boundaries. One such
> > "governor" is the "userspace" governor. This one allows the user - or
> > a yet-to-implement userspace program - to decide what specific speed
> > the processor shall run at."
> > 
> > (it seems the cpufreq-hypercall+cpufreq-userspace combination is in 
> > accord with what cpufreq-userspace has been designed for).
> > 
> > Secondly, setting frequencies for multiple tasks is somewhat
> > contradictory:
> > 
> > In the DPDK context, or in any context actually, it makes sense for a
> > program to lower processor frequency when it decides the current 
> > frequency is sufficient to handle the job: that is lowering the
> > frequency will still make it possible to handle the load.
> > 
> > With multiple applications sharing that processor, the percentage 
> > of time given to a certain application also interferes with the
> > time it spends handling the job. So the other variable that 
> > affects "instructions per second" is timeslice given to the
> > task by the scheduler, not only "frequency".
> > 
> > Having a task request for a particular frequency in that case becomes
> > ambiguous: you could be asking for "increased timeslice".