On Fri, Mar 31, 2023 at 01:49:48AM +0100, Matthew Wilcox wrote: > On Thu, Mar 30, 2023 at 03:43:35PM -0700, David Dai wrote: > > Hi, > > > > This patch series is a continuation of the talk Saravana gave at LPC 2022 > > titled "CPUfreq/sched and VM guest workload problems" [1][2][3]. The gist > > of the talk is that workloads running in a guest VM get terrible task > > placement and DVFS behavior when compared to running the same workload in > > DVFS? Some new filesystem, perhaps? > Dynamic Voltage and Frequency Scaling (DVFS) -- it's a well known term in cpufreq/cpuidle/schedutil land. > > the host. Effectively, no EAS for threads inside VMs. This would make power > > EAS? > Energy Aware Scheduling (EAS) is mostly a kernel/sched thing that has an impact on cpufreq and my recollection is that it was discussed at conferences long before kernel/sched had any EAS awareness. I don't have the full series in my inbox and didn't dig further but patch 1 at least is providing additional information to schedutil which impacts CPU frequency selection on systems to varying degrees. The full impact would depend on what cpufreq driver is in use and the specific hardware so even if the series benefits one set of hardware, it's not necessarily a guaranteed win. > Two unfamiliar and undefined acronyms in your opening paragraph. > You're not making me want to read the rest of your opus. It depends on the audience and mm/ is not the audience. VM in the title refers to Virtual Machine, not Virtual Memory although I confess I originally read it as mm/ and wondered initially how mm/ affects DVFS to the extent it triggered a "wtf happened in mm/ recently that I completely missed?". This series is mostly of concern to scheduler, cpufreq or KVM depending on your perspective. For example, on KVM, I'd immediately wonder if the hypercall overhead exceeds any benefit from better task placement although the leader suggests the answer is "no". However, it didn't comment (or I didn't read carefully enough) on whether MMIO overhead or alternative communication methods have constant cost across different hardware or, much more likely, depend on the hardware that could potentially opt-in. Various cpufreq hardware has very different costs when measuring or alterating CPU frequency stuff, even within different generations of chips from the same vendor. While the data also shows performance improvements, it doesn't indicate how close to bare metal the improvement is. Even if it's 50% faster within a VM, how much slower than bare metal is it? In terms of data presentation, it might be better to assign bare metal a score of 1 at the best possible score and show the VM performance as a relative ratio (1.00 for bare metal, 0.5 for VM with a vanilla kernel, 0.75 using improved task placement). It would also be preferred to have x86-64 data as the hazards the series details with impacts arm64 and x86-64 has the additional challenge that cpufreq is often managed by the hardware so it should be demonstrated the the series "does no harm" on x86-64 for recent generation Intel and AMD chips if possible. The lack of that data doesn't kill the series as a large improvement is still very interesting even if it's not perfect and possible specific to arm64. If this *was* my area or I happened to be paying close attention to it at the time, I would likely favour using hypercalls only at the start because it can be used universally and suggest adding alternative communication methods later using the same metric "is an alternative method of Guest<->Host communication worse, neutral or better at getting close to bare metal performance?" I'd also push for the ratio tables as it's easier to see at a glance how close to bare metal performance the series achieves. Finally, I would look for x86-64 data just in case it causes harm due to hypercall overhead on chips that management frequency in firmware. So while I haven't read the series and only patches 2+6 reached by inbox, I understand the point in principle. The scheduler on wakeup paths for bare metal also tries to favour recently used CPUs and spurious CPU migration even though it is only tangentially related to EAS. For example, a recently used CPUs may still be polling (drivers/cpuidle/poll_state.c:poll_idle) or at least not entered a deep C-state so the wakeup penalty is lower. So whatever critism the series deserves, it's not due to using obscure terms that no one in kernel/sched/, drivers/cpuidle of drivers/cpufreq would recognise. -- Mel Gorman SUSE Labs