On 2017/8/29 19:58, Alexander Graf wrote:
On 08/29/2017 01:46 PM, Yang Zhang wrote:
Some latency-intensive workload will see obviously performance
drop when running inside VM. The main reason is that the overhead
is amplified when running inside VM. The most cost i have seen is
inside idle path.
This patch introduces a new mechanism to poll for a while before
entering idle state. If schedule is needed during poll, then we
don't need to goes through the heavy overhead path.
Here is the data we get when running benchmark contextswitch to measure
the latency(lower is better):
1. w/o patch:
2493.14 ns/ctxsw -- 200.3 %CPU
2. w/ patch:
halt_poll_threshold=10000 -- 1485.96ns/ctxsw -- 201.0 %CPU
halt_poll_threshold=20000 -- 1391.26 ns/ctxsw -- 200.7 %CPU
halt_poll_threshold=30000 -- 1488.55 ns/ctxsw -- 200.1 %CPU
halt_poll_threshold=500000 -- 1159.14 ns/ctxsw -- 201.5 %CPU
3. kvm dynamic poll
halt_poll_ns=10000 -- 2296.11 ns/ctxsw -- 201.2 %CPU
halt_poll_ns=20000 -- 2599.7 ns/ctxsw -- 201.7 %CPU
halt_poll_ns=30000 -- 2588.68 ns/ctxsw -- 211.6 %CPU
halt_poll_ns=500000 -- 2423.20 ns/ctxsw -- 229.2 %CPU
4. idle=poll
2050.1 ns/ctxsw -- 1003 %CPU
5. idle=mwait
2188.06 ns/ctxsw -- 206.3 %CPU
Could you please try to create another metric for guest initiated, host
aborted mwait?
For a quick benchmark, reserve 4 registers for a magic value, set them
to the magic value before you enter MWAIT in the guest. Then allow
native MWAIT execution on the host. If you see the guest wants to enter
I guess you want to allow native MWAIT execution on the guest not host?
with the 4 registers containing the magic contents and no events are
pending, directly go into the vcpu block function on the host.
Mmm..It is not very clear to me. If guest executes MWAIT without vmexit,
how to check the register?
That way any time a guest gets naturally aborted while in mwait, it will
only reenter mwait when an event actually occured. While the guest is
normally running (and nobody else wants to run on the host), we just
stay in guest context, but with a sleeping CPU.
Overall, that might give us even better performance, as it allows for
turbo boost and HT to work properly.
In our testing, we have enough cores(32cores) but only 10VCPUs, so in
the best case, we may see the same performance as poll.
--
Yang
Alibaba Cloud Computing