On 2020/1/15 22:14, Marc Zyngier wrote: > On 2020-01-13 12:12, Will Deacon wrote: >> [+PeterZ] >> >> On Thu, Dec 26, 2019 at 09:58:27PM +0800, Zengruan Ye wrote: >>> This patch set aims to support the vcpu_is_preempted() functionality >>> under KVM/arm64, which allowing the guest to obtain the VCPU is >>> currently running or not. This will enhance lock performance on >>> overcommitted hosts (more runnable VCPUs than physical CPUs in the >>> system) as doing busy waits for preempted VCPUs will hurt system >>> performance far worse than early yielding. >>> >>> We have observed some performace improvements in uninx benchmark tests. >>> >>> unix benchmark result: >>> host: kernel 5.5.0-rc1, HiSilicon Kunpeng920, 8 CPUs >>> guest: kernel 5.5.0-rc1, 16 VCPUs >>> >>> test-case | after-patch | before-patch >>> ----------------------------------------+-------------------+------------------ >>> Dhrystone 2 using register variables | 334600751.0 lps | 335319028.3 lps >>> Double-Precision Whetstone | 32856.1 MWIPS | 32849.6 MWIPS >>> Execl Throughput | 3662.1 lps | 2718.0 lps >>> File Copy 1024 bufsize 2000 maxblocks | 432906.4 KBps | 158011.8 KBps >>> File Copy 256 bufsize 500 maxblocks | 116023.0 KBps | 37664.0 KBps >>> File Copy 4096 bufsize 8000 maxblocks | 1432769.8 KBps | 441108.8 KBps >>> Pipe Throughput | 6405029.6 lps | 6021457.6 lps >>> Pipe-based Context Switching | 185872.7 lps | 184255.3 lps >>> Process Creation | 4025.7 lps | 3706.6 lps >>> Shell Scripts (1 concurrent) | 6745.6 lpm | 6436.1 lpm >>> Shell Scripts (8 concurrent) | 998.7 lpm | 931.1 lpm >>> System Call Overhead | 3913363.1 lps | 3883287.8 lps >>> ----------------------------------------+-------------------+------------------ >>> System Benchmarks Index Score | 1835.1 | 1327.6 >> >> Interesting, thanks for the numbers. >> >> So it looks like there is a decent improvement to be had from targetted vCPU >> wakeup, but I really dislike the explicit PV interface and it's already been >> shown to interact badly with the WFE-based polling in smp_cond_load_*(). >> >> Rather than expose a divergent interface, I would instead like to explore an >> improvement to smp_cond_load_*() and see how that performs before we commit >> to something more intrusive. Marc and I looked at this very briefly in the >> past, and the basic idea is to register all of the WFE sites with the >> hypervisor, indicating which register contains the address being spun on >> and which register contains the "bad" value. That way, you don't bother >> rescheduling a vCPU if the value at the address is still bad, because you >> know it will exit immediately. >> >> Of course, the devil is in the details because when I say "address", that's >> a guest virtual address, so you need to play some tricks in the hypervisor >> so that you have a separate mapping for the lockword (it's enough to keep >> track of the physical address). >> >> Our hacks are here but we basically ran out of time to work on them beyond >> an unoptimised and hacky prototype: >> >> https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=kvm-arm64/pvcy >> >> Marc -- how would you prefer to handle this? > > Let me try and rebase this thing to a modern kernel (I doubt it applies without > conflicts to mainline). We can then have discussion about its merit on the list > once I post it. It'd be good to have a pointer to the benchamrks that have been > used here. > > Thanks, > > M. Hi Marc, Will, My apologies for the slow reply. Just checking what is the latest on this PV cond yield prototype? https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=kvm-arm64/pvcy The following are the unixbench test results of PV cond yield prototype: unix benchmark result: host: kernel 5.10.0-rc6, HiSilicon Kunpeng920, 8 CPUs guest: kernel 5.10.0-rc6, 16 VCPUs | 5.10.0-rc6 | pv_cond_yield | vcpu_is_preempted System Benchmarks Index Values | INDEX | INDEX | INDEX ---------------------------------------+------------+---------------+------------------- Dhrystone 2 using register variables | 29164.0 | 29156.9 | 29207.2 Double-Precision Whetstone | 6807.6 | 6789.2 | 6912.1 Execl Throughput | 856.7 | 1195.6 | 863.1 File Copy 1024 bufsize 2000 maxblocks | 189.9 | 923.5 | 1094.2 File Copy 256 bufsize 500 maxblocks | 121.9 | 578.4 | 588.7 File Copy 4096 bufsize 8000 maxblocks | 419.9 | 1992.0 | 2733.7 Pipe Throughput | 6727.2 | 6670.2 | 6743.2 Pipe-based Context Switching | 486.9 | 547.0 | 471.9 Process Creation | 353.4 | 345.1 | 338.5 Shell Scripts (1 concurrent) | 3187.2 | 1432.2 | 2798.7 Shell Scripts (8 concurrent) | 3410.5 | 1360.1 | 2672.9 System Call Overhead | 2967.0 | 3273.9 | 3497.9 ---------------------------------------+------------+---------------+------------------- System Benchmarks Index Score | 1410.0 | 1885.8 | 2128.5 Thanks, Zengruan