Hi all, We can observe unixbench context switch performance is heavily influenced by cpu topology which is exposed to the guest. the score is posted below, bigger is better, both the guest and the host kernel are 3.15-rc3(we can also reproduce against centos 7.4 693 guest/host), LLC is exposed to the guest, kvm adaptive halt-polling is default enabled, then start a guest w/ 8 logical cpus. unixbench context switch -smp 8, sockets=8, cores=1, threads=1 382036 -smp 8, sockets=4, cores=2, threads=1 132480 -smp 8, sockets=2, cores=4, threads=1 128032 -smp 8, sockets=2, cores=2, threads=2 131767 -smp 8, sockets=1, cores=4, threads=2 132742 -smp 8, sockets=1, cores=4, threads=2 (guest w/ nohz=off idle=poll) 331471 I can observe there are a lot of reschedule IPIs sent from one vCPU to another vCPU, the context switch workload switches between running and idle frequently which results in HLT instruction in the idle path, I use idle=poll to avoid vmexit due to HLT and to avoid reschedule IPIs since idle task checks TIF_NEED_RESCHED flags in a loop, nohz=off can stop to program lapic timer/other nohz stuffs. Any idea why sockets=8 can get best performance? Regards, Wanpeng Li