2018-01-22 20:53 GMT+08:00 Peter Zijlstra <peterz@xxxxxxxxxxxxx>: > On Mon, Jan 22, 2018 at 07:47:45PM +0800, Wanpeng Li wrote: >> Hi all, >> >> We can observe unixbench context switch performance is heavily >> influenced by cpu topology which is exposed to the guest. the score is >> posted below, bigger is better, both the guest and the host kernel are >> 3.15-rc3(we can also reproduce against centos 7.4 693 guest/host), LLC >> is exposed to the guest, kvm adaptive halt-polling is default enabled, >> then start a guest w/ 8 logical cpus. >> >> >> >> unixbench context switch >> -smp 8, sockets=8, cores=1, threads=1 382036 >> -smp 8, sockets=4, cores=2, threads=1 132480 >> -smp 8, sockets=2, cores=4, threads=1 128032 >> -smp 8, sockets=2, cores=2, threads=2 131767 >> -smp 8, sockets=1, cores=4, threads=2 132742 >> -smp 8, sockets=1, cores=4, threads=2 (guest w/ nohz=off idle=poll) 331471 >> >> I can observe there are a lot of reschedule IPIs sent from one vCPU to >> another vCPU, the context switch workload switches between running and >> idle frequently which results in HLT instruction in the idle path, I >> use idle=poll to avoid vmexit due to HLT and to avoid reschedule IPIs >> since idle task checks TIF_NEED_RESCHED flags in a loop, nohz=off can >> stop to program lapic timer/other nohz stuffs. Any idea why sockets=8 >> can get best performance? > > I suspect because we load-balance less agressively across nodes than we > do within a cache domain. It is true. after taking a more closer look by kernelshark, the context1 in the guest will be migrated to another logical cpu after several milliseconds for sockets=1, cores=4, threads=2, however, it can keep on one logical cpu around several seconds for sockets=8, cores=1, threads=1 before migrating to another one. > > Fix you benchmark to pin itself to a single CPU, that's the only > sensible way to obtain this number in any case. Yeah, this setup can get a good performance. Actually the two context1 tasks don't stack up on one logical cpu at the most of time which is observed by kernelshark opposed to Mike's reply. In addition, I can observe the sum of RESCHED IPIs in the guest for sockets=1, cores=4, threads=2 is 4.5 times for sockets=8, cores=1, threads=1. Any idea how this can happen? I suspect the TTWU path selects another idle logical cpu which results in a RESCHED IPI is avoidless. However, there is still no benefit for performance after I clear the SD_BALANCE_WAKE for correlative sched_domains. Regards, Wanpeng Li