Hi Paolo, On Tue, Mar 14, 2017 at 05:40:21PM +0100, Paolo Bonzini wrote: > > > On 02/03/2017 14:59, Marcelo Tosatti wrote: > > On Thu, Mar 02, 2017 at 11:15:00AM +0100, Paolo Bonzini wrote: > >> one obvious downside is that any application that you > >> run after DPDK will have its CPU frequency hardcoded to something that > >> is not appropriate. > > > > To isolate the CPU where DPDK runs it is already necessary to perform > > special procedures such as changing the cpumask of other tasks, changing > > cpumask of interrupt handlers (to remove the isolated CPU from that > > cpumask), etc. Changing the cpufreq governor to userspace is another > > step of that setup phase. > > > > On shutdown (or CPU unpin), you can switch back the CPU to the previous > > governor, which can switch the frequency to whatever it finds suitable. > > But I thought that one of the reasons to do NFV is to simplify this > setup. If you now have to do the same thing on virtual machines, things > become more complicated to set up, and I don't think that NFV virtual > machines are _that_ special. > > In addition, in the list of setup steps above you forgot "chmod the > sysfs files for cpufreq so that DPDK can access it". Doing that chmod > is a very explicit act, and that's unlike the functionality of this patch. > > By letting virtual machines do the same with a simple hypercall, you're > giving powers to whoever opens /dev/kvm that they didn't have before > (unless the userspace process also had access to sysfs). Worse, the > effects last beyond the moment /dev/kvm is closed. This can be fixed by requiring qemu-kvm-vcpu thread, which runs the hypercall, to have sufficient priority (similar to other cpufreq users). Fine, good point. > So, the question then is how to design the hypervisor so that these NFV > virtual machines can play with cpufreq, but there are no adverse > indefinite effects. Ok, we can modify the cpufreq cgroups patch, to, from the hypercalls set the: "The first three patches of this series introduces capacity_{min,max} tracking in the core scheduler, as an extension of the CPU controller." capacity_min == capacity_max values (which forces the CPU to run at that frequency, given there are no other tasks requesting frequency information on that CPU). This is good enough DPDK. > One possibility is to have some kind of per-task > cpufreq. Another is to do everything in userspace with virtual ACPI > P-states and the userspace governor in the VM. Virtual ACPI P-state, that is an option. But why not make it in-kernel, the exit to userspace can be a significant fraction of the total if the frequency change time is small (say, 10us freq change and 5us for userspace exit). > I was hoping to get more feedback from linux-pm. > > >> Here are two possibilities that I could think of: > >> > >> 1) Introduce a mechanism that allows a task to override the governor's > >> choice of CPU frequency. This could be a ioctl, a prctl, a cgroup-based > >> mechanism or whatever else. As Marcelo pointed out in the original kvm@ > >> thread, the latency and overhead of switching frequencies make it > >> impractical to associate a desired CPU frequency with a task, because > >> multiple tasks could be requesting a given frequency. One possibility > >> could be to treat the per-task CPU frequency as advisory > > > > DPDK can't afford the frequency as advisory: failure in setting the > > processor frequency when requested means dropped packets (not > > dropping packets being a requirement). > > It can be advisory if you document a proper configuration where it's obeyed. Sure. > > Paolo > > >> and only obey > >> it in restricted cases---for example only if nohz_full is in effect. > > > > From cpufreq documentation: > > > > "On all other cpufreq implementations, these boundaries still need to > > be set. Then, a "governor" must be selected. Such a "governor" decides > > what speed the processor shall run within the boundaries. One such > > "governor" is the "userspace" governor. This one allows the user - or > > a yet-to-implement userspace program - to decide what specific speed > > the processor shall run at." > > > > (it seems the cpufreq-hypercall+cpufreq-userspace combination is in > > accord with what cpufreq-userspace has been designed for). > > > > Secondly, setting frequencies for multiple tasks is somewhat > > contradictory: > > > > In the DPDK context, or in any context actually, it makes sense for a > > program to lower processor frequency when it decides the current > > frequency is sufficient to handle the job: that is lowering the > > frequency will still make it possible to handle the load. > > > > With multiple applications sharing that processor, the percentage > > of time given to a certain application also interferes with the > > time it spends handling the job. So the other variable that > > affects "instructions per second" is timeslice given to the > > task by the scheduler, not only "frequency". > > > > Having a task request for a particular frequency in that case becomes > > ambiguous: you could be asking for "increased timeslice".