On Mon, May 13, 2019 at 05:13:29PM +0800, Wanpeng Li wrote: > On Sat, 11 May 2019 at 01:17, Sean Christopherson > <sean.j.christopherson@xxxxxxxxx> wrote: > > > > On Fri, May 10, 2019 at 11:34:41AM +0100, Joao Martins wrote: > > > On 5/10/19 10:54 AM, Wanpeng Li wrote: > > > > It is weird that we can observe intel_idle driver in the guest > > > > executes mwait eax=0x20, and the corresponding pCPU enters C3 on HSW > > > > server, however, we can't observe this on SKX/CLX server, it just > > > > enters maximal C1. > > > > > > I assume you refer to the case where you pass the host mwait substates to the > > > guests as is, right? Or are you zeroing/filtering out the mwait cpuid leaf EDX > > > like my patch (attached in the previous message) suggests? > > > > > > Interestingly, hints set to 0x20 actually corresponds to C6 on HSW (based on > > > intel_idle driver). IIUC From the SDM (see Vol 2B, "MWAIT for Power Management" > > > in instruction set reference M-U) the hints register, doesn't necessarily > > > guarantee the specified C-state depicted in the hints will be used. The manual > > > makes it sound like it is tentative, and implementation-specific condition may > > > either ignore it or enter a different one. It appears to be only guaranteed that > > > it won't enter a C-{sub,}state deeper than the one depicted. > > > > Yep, section "MWAIT EXTENSIONS FOR ADVANCED POWER MANAGEMENT" is more > > explicit on this point: > > > > At CPL=0, system software can specify desired C-state and sub C-state by > > using the MWAIT hints register (EAX). Processors will not go to C-state > > and sub C-state deeper than what is specified by the hint register. > > > > As for why SKX/CLX only enters C1, AFAICT SKX isn't configured to support > > C3, e.g. skx_cstates in drivers/idle/intel_idle.c shows C1, C1E and C6. > > A quick search brings up a variety of docs that confirm this. My guess is > > that C1E provides better power/performance than C3 for the majority of > > server workloads, e.g. C3 doesn't provide enough power savings to justify > > its higher latency and TLB flush. > > You are right, I figure this out by referring to the SKX/CLX EDS, the > Core C-States of these two generations just support CC0/CC1/CC1E/CC6. > The issue here is after exposing mwait to the guest, SKX/CLX guest > can't enter CC6, however, HSW guest can enter CC3/CC6. Both HSW and > SKX/CLX hosts can enter CC6. We observe SKX/CLX guests execute mwait > eax 0x20, however, we can't observe the corresponding pCPU enter CC6 > by turbostat or reading MSR_CORE_C6_RESIDENCY directly. It's likely that the CPU is operating as expected and isn't dropping into the deeper sleep state because of some heuristic or wake event. It might be something as simple as the combination of periodic tick interrupts between host and guest occuring too frequently (to get to C6), or it could be a much more complex scenario. It's been several years since I've done anything close to hands on debug with C-states, I have no idea what capabilities are available to help debug this sort of thing. Sorry I can't be more helpful.