Ping Paolo & Sean, Do you have any comment? Or do you think ITD virtualization is appropriate to discuss at PUCK? Thanks, Zhao On Sat, Feb 03, 2024 at 05:11:48PM +0800, Zhao Liu wrote: > Date: Sat, 3 Feb 2024 17:11:48 +0800 > From: Zhao Liu <zhao1.liu@xxxxxxxxxxxxxxx> > Subject: [RFC 00/26] Intel Thread Director Virtualization > X-Mailer: git-send-email 2.34.1 > > From: Zhao Liu <zhao1.liu@xxxxxxxxx> > > Hi list, > > This is our RFC to virtualize Intel Thread Director (ITD) feature for > Guest, which is based on Ricardo's patch series about ITD related > support in HFI driver ("[PATCH 0/9] thermal: intel: hfi: Prework for the > virtualization of HFI" [1]). > > In short, the purpose of this patch set is to enable the ITD-based > scheduling logic in Guest so that Guest can better schedule Guest tasks > on Intel hybrid platforms. > > Currently, ITD is necessary for Windows VMs. Based on ITD virtualization > support, the Windows 11 Guest could have significant performance > improvement (for example, on i9-13900K, up to 14%+ improvement on > 3DMARK). > > Our ITD virtualization is not bound to VMs' hybrid topology or vCPUs' > CPU affinity. However, in our practice, the ITD scheduling optimization > for win11 VMs works best when combined with hybrid topology and CPU > affinity (this is related to the specific implementation of Win11 > scheduling). For more details, please see the Section.1.2 "About hybrid > topology and vCPU pinning". > > To enable ITD related scheduling optimization in Win11 VM, some other > thermal related support is also needed (HWP, CPPC), but we could emulate > it with dummy value in the VMM (We'll also be sending out extra patches > in the future for these). > > Welcome your feedback! > > > 1. Background and Motivation > ============================ > > 1.1. Background > ^^^^^^^^^^^^^^^ > > We have the use case to run games in the client Windows VM as the cloud > gaming solution. > > Gaming VMs are performance-sensitive VMs on Client, so that they usually > have two characteristics to ensure interactivity and performance: > > i) There will be vCPUs equal to or close to the number of Host pCPUs. > > ii) The vCPUs of Gaming VM are often bound to the pCPUs to achieve > exclusive resources and avoid the overhead of migration. > > In this case, Host can't provide effective scheduling for Guest, so we > need to deliver more hardware-assisted scheduling capabilities to Guest > to enhance Guest's scheduling. > > Windows 11 (and future Windows products) is heavily optimized for the > Intel hybrid platform. To get the best performance, we need to > virtualize hybrid scheduling features (HFI/ITD) for Windows Guest. > > > 1.2. About hybrid topology and vCPU pinning > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > Our ITD virtualization can support most vCPU topologies (except multiple > packages/dies, see details in 3.5 Restrictions on Guest Topology), and > can also support the case of non-pinning vCPUs (i.e. it can handle vCPU > thread migration). > > The following is our performance measuremnt on an i9-13900K machine > (2995Mhz, 24Cores, 32Thread(8+16) RAM: 14GB (16GB Physical)), with > iGPU passthrough, running 3DMARK in Win11 Professional Guest: > > > compared with smp topo case smp topo smp topo smp topo hybrid topo hybrid topo hybrid topo hybrid topo > + affinity + ITD + ITD + affinity + ITD + ITD > + affinity + affinity > Time Spy - Overall 0.179% -0.250% 0.179% -0.107% 0.143% -0.179% -0.107% > Graphics score 0.124% -0.249% 0.124% -0.083% 0.124% -0.166% -0.249% > CPU score 0.916% -0.485% 1.149% -0.076% 0.722% -0.324% 11.915% > Fire Strike Extreme - Overall 0.149% 0.000% 0.224% -1.021% -3.361% -1.319% -3.361% > Graphics score 0.100% 0.050% 0.150% -1.376% -3.427% -1.676% -3.652% > Physics score 5.060% 0.759% 0.518% -2.907% -10.914% -0.897% 14.638% > Combined score 0.120% -0.179% 0.418% 0.060% -2.929% -0.179% -2.809% > Fire Strike - Overall 0.350% -0.085% 0.193% -1.377% -1.365% -1.509% -1.787% > Graphics score 0.256% -0.047% 0.210% -1.527% -1.376% -1.504% -2.320% > Physics score 3.695% -2.180% 0.629% -1.581% -6.846% -1.444% 14.100% > Combined score 0.415% -0.128% 0.128% -0.957% -1.052% -1.594% -0.957% > CPU Profile Max Threads 1.836% 0.298% 1.786% -0.069% 1.545% 0.025% 9.472% > 16 Threads 4.290% 0.989% 3.588% 0.595% 1.580% 0.848% 11.295% > 8 Threads -22.632% -0.602% -23.167% -0.988% -1.345% -1.340% 8.648% > 4 Threads -21.598% 0.449% -21.429% -0.817% 1.951% -0.832% 2.084% > 2 Threads -12.912% -0.014% -12.006% -0.481% -0.609% -0.595% 1.161% > 1 Threads -3.793% -0.137% -3.793% -0.495% -3.189% -0.495% 1.154% > > > Based on the above result, we can find exposing only HFI/ITD to win11 > VMs without hybrid topology or CPU affinity (case "smp topo + ITD") > won't hurt performance, but would also not get any performance > improvement. > > Setting both hybrid topology and CPU affinity for ITD, then win11 VMs > get significate performance improvement (up to 14%+, compared with the > case setting smp topology without CPU affinity). > > Not only the numerical results of 3DMARK, but in practice, there is an > significate improvement in the frame rate of the games. > > Also, the more powerful the machine, the more significate the > performance gains! > > Therefore, the best practice for enabling ITD scheduling optimization > is to set up both CPU affinity and hybrid topology for win11 Guest while > enabling our ITD virtualization. > > Our earlier QEMU prototype RFC [2] presented the initial hybrid > topology support for VMs. And currently our another proposal about > "QOM topology" [3] has been raised in the QEMU community, which is the > first step towards the hybrid topology implementation based on QOM > approach. > > > 2. Introduction of HFI and ITD > ============================== > > Intel provides Hardware Feedback Interface (HFI) feature to allow > hardware to provide guidance to the OS scheduler to perform optimal > workload scheduling through a hardware feedback interface structure in > memory [4]. This HFI structure is called HFI table. > > For now, the guidance includes performance and energy efficiency > hints, and it could be update via thermal interrupt as the actual > operating conditions of the processor change during run time. > > Intel Thread Director (ITD) feature extends the HFI to provide > performance and energy efficiency data for advanced classes of > instructions. > > Since ITD is an extension of HFI, our ITD virtualization also > virtualizes the native HFI feature. > > > 3. Dependencies of ITD > ====================== > > ITD is a thermal FEATURE that requires: > * PTM (Package Thermal Management, alias, PTS) > * HFI (Hardware Feedback Interface) > > In order to support the notification mechanism of ITD/HFI dynamic > update, we also need to add thermal interrupt related support, > including the following two features: > * ACPI (Thermal Monitor and Software Controlled Clock Facilities) > * TM (Thermal Monitor, alias, TM1/ACC) > > Therefore, we must also consider support for the emulation of all > the above dependencies. > > > 3.1. ACPI emulation > ^^^^^^^^^^^^^^^^^^^ > > For both ACPI, we can support it by emulating the RDMSR/WRMSR of the > associated MSRs and adding the ability to inject thermal interrupts. > But in fact, we don't really inject termal interrupts into Guest for > the termal conditions corresponding to ACPI. Here the termal interrupt > is prepared for the subsequent HFI/ITD. > > > 3.2. TM emulation > ^^^^^^^^^^^^^^^^^ > > TM is a hardware feature and its CPUID bit only indicates the presence > of the automatic thermal monitoring facilities. For TM, there's no > interactive interface between OS and hardware, but its flag is one of > the prerequisites for the OS to enable thermal interrupt. > > Thereby, as the support for TM, it is enough for us to expose its CPUID > flag to Guest. > > > 3.3. PTM emulation > ^^^^^^^^^^^^^^^^^^ > > PTM is a package-scope feature that includes package-level MSR and > package-level thermal interrupt. Unfortunately, KVM currently only > supports thread-scope MSR handling, and also doesn't care about the > specific Guest's topology. > > But considering that our purpose of supporting PTM in KVM is to further > support ITD, and the current platforms with ITD are all 1 package, so we > emulate the MSRs of the package scope provided by PTM at the VM level. > > In this way, the VMM is required to set only one package topology for > the PTM. In order to alleviate this limitation, we only expose the PTM > feature bit to Guest when ITD needs to be supported. > > > 3.4. HFI emulation > ^^^^^^^^^^^^^^^^^^ > > ITD is the extension of HFI, so both HFI and ITD depend on HFI table. > HFI itself is used on the Host for power-related management control, so > we should only expose HFI to Guest when we need to enable ITD. > > HFI also relies on PTM interrupt control, so it also has requirements > for package topology, and we also emulate HFI (including ITD) at the VM > level. > > In addition, because the HFI driver allocates HFI instances per die, > this also affects HFI (and ITD) and must limit the Guest to only set one > die. > > > 3.5. Restrictions on Guest Topology > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > Due to KVM's incomplete support for MSR topology and the requirement for > HFI instance management in the kernel, PTM, HFI, and ITD limit the > topology of the Guest (mainly restricting the topology types created on > the VMM side). > > Therefore, we only expose PTM, HFI, and ITD to userspace when we need to > support ITD. At the same time, considering that currently, ITD is only > used on the client platform with 1 package and 1 die, such temporary > restrictions will not have too much impact. > > > 4. Overview of ITD (and HFI) virtualization > =========================================== > > The main tasks of ITD (including HFI) virtualization are: > * maintain a virtual HFI table for VM. > * inject thermal interrupt when HFI table updates. > * handle related MSRs' emulation and adjust HFI table based on MSR's > control bits. > * expose ITD/HFI configuration info in related CPUID leaves. > > The most important of these is the maintenance of the virtual HFI table. > Although the HFI table should also be per package, since ITD/HFI related > MSRs are treated as per VM in KVM, we also treat the virtual HFI table > as per VM. > > > 4.1. HFI table building > ^^^^^^^^^^^^^^^^^^^^^^^ > > HFI table contains a table header and many table entries. Each table > entry is identified by an hfi table index, and each CPU corresponds to > one of the hfi table indexes. > > ITD and HFI features both depend on the HFI table, but their HFI table > are a little different. The HFI table provided by the ITD feature has > more classes (in terms of more columns in the table) than the HFI table > of native HFI feature. > > The virtual HFI table in KVM is built based on the actual HFI table, > which is maintained by HFI instance in HFI driver. We extract the HFI > data of the pCPUs, which vCPUs are running on, to form a virtual HFI > table. > > > 4.2. HFI table index > ^^^^^^^^^^^^^^^^^^^^ > > There are many entries in the HFI table, and the vCPU will be assigned > an HFI table index to specify the entry it maps. KVM will fill the > pCPU's HFI data (the pCPU that vCPU is running on) into the entry > corresponding to the HFI table index of the vCPU in the vcitual HFI > table. > > This index is set by VMM in CPUID. > > > 4.3. HFI table updating > ^^^^^^^^^^^^^^^^^^^^^^^ > > On some platforms, the HFI table will be dynamically updated with > thermal interrupts. In order to update the virtual HFI table in time, we > added the per-VM notifier to the HFI driver to notify KVM to update the > virtual HFI table for the VM, and then inject thermal interrupt into the > VM to notify the Guest. > > There is another case that needs to update the virtual HFI table, that > is, when the vCPU is migrated, the pCPU where it is located is changed, > and the corresponding virtual HFI data should also be updated to the new > pCPU's data. In this case, in order to reduce overhead, we can only > update the data of a single vPCU without traversing the entire virtual > HFI table. > > > 5. Patch Summary > ================ > > Patch 01-03: Prepare the bit definition, the hfi helpers and hfi data > structures that KVM needs. > Patch 04-05: Add the sched_out arch hook and reset the classification > history at sched_in()/schedu_out(). > Patch 06-10: Add emulations of ACPI, TM and PTM, mainly about CPUID and > related MSRs. > Patch 11-20: Add the emulation support for HFI, including maintaining > the HFI table for VM. > Patch 21-23: Add the emulation support for ITD, including extending HFI > to ITD and passing through the classification MSRs. > Patch 24-25: Add HRESET emulation support, which is also used by IPC > classes feature. > Patch 26: Add the brief doc about the per-VM lock - pkg_therm_lock. > > > 6. References > ============= > > [1]: [PATCH 0/9] thermal: intel: hfi: Prework for the virtualization of HFI > https://lore.kernel.org/lkml/20240203040515.23947-1-ricardo.neri-calderon@xxxxxxxxxxxxxxx/ > [2]: [RFC 00/52] Introduce hybrid CPU topology, > https://lore.kernel.org/qemu-devel/20230213095035.158240-1-zhao1.liu@xxxxxxxxxxxxxxx/ > [3]: [RFC 00/41] qom-topo: Abstract Everything about CPU Topology, > https://lore.kernel.org/qemu-devel/20231130144203.2307629-1-zhao1.liu@xxxxxxxxxxxxxxx/ > [4]: SDM, vol. 3B, section 15.6 HARDWARE FEEDBACK INTERFACE AND INTEL > THREAD DIRECTOR > > > Thanks and Best Regards, > Zhao > --- > Zhao Liu (17): > thermal: Add bit definition for x86 thermal related MSRs > KVM: Add kvm_arch_sched_out() hook > KVM: x86: Reset hardware history at vCPU's sched_in/out > KVM: VMX: Add helpers to handle the writes to MSR's R/O and R/WC0 bits > KVM: x86: cpuid: Define CPUID 0x06.eax by kvm_cpu_cap_mask() > KVM: VMX: Introduce HFI description structure > KVM: VMX: Introduce HFI table index for vCPU > KVM: x86: Introduce the HFI dynamic update request and kvm_x86_ops > KVM: VMX: Allow to inject thermal interrupt without HFI update > KVM: VMX: Emulate HFI related bits in package thermal MSRs > KVM: VMX: Emulate the MSRs of HFI feature > KVM: x86: Expose HFI feature bit and HFI info in CPUID > KVM: VMX: Extend HFI table and MSR emulation to support ITD > KVM: VMX: Pass through ITD classification related MSRs to Guest > KVM: x86: Expose ITD feature bit and related info in CPUID > KVM: VMX: Emulate the MSR of HRESET feature > Documentation: KVM: Add description of pkg_therm_lock > > Zhuocheng Ding (9): > thermal: intel: hfi: Add helpers to build HFI/ITD structures > thermal: intel: hfi: Add HFI notifier helpers to notify HFI update > KVM: VMX: Emulate ACPI (CPUID.0x01.edx[bit 22]) feature > KVM: x86: Expose TM/ACC (CPUID.0x01.edx[bit 29]) feature bit to VM > KVM: VMX: Emulate PTM/PTS (CPUID.0x06.eax[bit 6]) feature > KVM: VMX: Support virtual HFI table for VM > KVM: VMX: Sync update of Host HFI table to Guest > KVM: VMX: Update HFI table when vCPU migrates > KVM: x86: Expose HRESET feature's CPUID to Guest > > Documentation/virt/kvm/locking.rst | 13 +- > arch/arm64/include/asm/kvm_host.h | 1 + > arch/mips/include/asm/kvm_host.h | 1 + > arch/powerpc/include/asm/kvm_host.h | 1 + > arch/riscv/include/asm/kvm_host.h | 1 + > arch/s390/include/asm/kvm_host.h | 1 + > arch/x86/include/asm/hfi.h | 28 ++ > arch/x86/include/asm/kvm-x86-ops.h | 3 +- > arch/x86/include/asm/kvm_host.h | 2 + > arch/x86/include/asm/msr-index.h | 54 +- > arch/x86/kvm/cpuid.c | 201 +++++++- > arch/x86/kvm/irq.h | 1 + > arch/x86/kvm/lapic.c | 9 + > arch/x86/kvm/svm/svm.c | 8 + > arch/x86/kvm/vmx/vmx.c | 751 +++++++++++++++++++++++++++- > arch/x86/kvm/vmx/vmx.h | 79 ++- > arch/x86/kvm/x86.c | 18 + > drivers/thermal/intel/intel_hfi.c | 212 +++++++- > drivers/thermal/intel/therm_throt.c | 1 - > include/linux/kvm_host.h | 1 + > virt/kvm/kvm_main.c | 1 + > 21 files changed, 1343 insertions(+), 44 deletions(-) > > -- > 2.34.1 >