On Fri, May 10, 2019 at 11:34:41AM +0100, Joao Martins wrote: > On 5/10/19 10:54 AM, Wanpeng Li wrote: > > It is weird that we can observe intel_idle driver in the guest > > executes mwait eax=0x20, and the corresponding pCPU enters C3 on HSW > > server, however, we can't observe this on SKX/CLX server, it just > > enters maximal C1. > > I assume you refer to the case where you pass the host mwait substates to the > guests as is, right? Or are you zeroing/filtering out the mwait cpuid leaf EDX > like my patch (attached in the previous message) suggests? > > Interestingly, hints set to 0x20 actually corresponds to C6 on HSW (based on > intel_idle driver). IIUC From the SDM (see Vol 2B, "MWAIT for Power Management" > in instruction set reference M-U) the hints register, doesn't necessarily > guarantee the specified C-state depicted in the hints will be used. The manual > makes it sound like it is tentative, and implementation-specific condition may > either ignore it or enter a different one. It appears to be only guaranteed that > it won't enter a C-{sub,}state deeper than the one depicted. Yep, section "MWAIT EXTENSIONS FOR ADVANCED POWER MANAGEMENT" is more explicit on this point: At CPL=0, system software can specify desired C-state and sub C-state by using the MWAIT hints register (EAX). Processors will not go to C-state and sub C-state deeper than what is specified by the hint register. As for why SKX/CLX only enters C1, AFAICT SKX isn't configured to support C3, e.g. skx_cstates in drivers/idle/intel_idle.c shows C1, C1E and C6. A quick search brings up a variety of docs that confirm this. My guess is that C1E provides better power/performance than C3 for the majority of server workloads, e.g. C3 doesn't provide enough power savings to justify its higher latency and TLB flush.