Re: [PATCH] KVM: x86: add full vm-exit reason debug entries

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Apr 22, 2019 at 09:24:59PM +0800, zhenwei pi wrote:
> Hit several typical cases of performance drop due to vm-exit:
> case 1, jemalloc calls madvise(void *addr, size_t length, MADV_DONTNEED) in
> guest, IPI causes a lot of exits.
> case 2, atop collects IPC by perf hardware events in guest, vpmu & rdpmc
> exits increase a lot.
> case 3, host memory compaction invalidates TDP and tdp_page_fault can cause
> huge loss of performance.
> case 4, web services(written by golang) call futex and have higher latency
> than host os environment.
> 
> Add more vm-exit reason debug entries, they are helpful to recognize
> performance drop reason. In this patch:
> 1, add more vm-exit reasons.
> 2, add wrmsr details.
> 3, add CR details.
> 4, add hypercall details.
> 
> Currently we can also implement the same result by bpf.
> Sample code (written by Fei Li<lifei.shirley@xxxxxxxxxxxxx>):
>   b = BPF(text="""
> 
>   struct kvm_msr_exit_info {
>       u32 pid;
>       u32 tgid;
>       u32 msr_exit_ct;
>   };
>   BPF_HASH(kvm_msr_exit, unsigned int, struct kvm_msr_exit_info, 1024);
> 
>   TRACEPOINT_PROBE(kvm, kvm_msr) {
>       int ct = args->ecx;
>       if (ct >= 0xffffffff) {
>           return -1;
>       }
> 
>       u32 pid = bpf_get_current_pid_tgid() >> 32;
>       u32 tgid = bpf_get_current_pid_tgid();
> 
>       struct kvm_msr_exit_info *exit_info;
>       struct kvm_msr_exit_info init_exit_info = {};
>       exit_info = kvm_msr_exit.lookup(&ct);
>       if (exit_info != NULL) {
>           exit_info->pid = pid;
>           exit_info->tgid = tgid;
>           exit_info->msr_exit_ct++;
>       } else {
>           init_exit_info.pid = pid;
>           init_exit_info.tgid = tgid;
>           init_exit_info.msr_exit_ct = 1;
>           kvm_msr_exit.update(&ct, &init_exit_info);
>       }
>       return 0;
>   }
>   """)
> 
> Run wrmsr(MSR_IA32_TSCDEADLINE, val) benchmark in guest
> (CPU Intel Gold 5118):
> case 1, no bpf on host. ~1127 cycles/wrmsr.
> case 2, sample bpf on host with JIT. ~1223 cycles/wrmsr.      -->  8.5%
> case 3, sample bpf on host without JIT. ~1312 cycles/wrmsr.   --> 16.4%
> 
> So, debug entries are more efficient than the bpf method.

How much does host performance matter?  E.g. does high overhead interfere
with debug, is this something you want to have running at all times, etc...

> 
> Signed-off-by: zhenwei pi <pizhenwei@xxxxxxxxxxxxx>
> ---
>  arch/x86/include/asm/kvm_host.h | 77 +++++++++++++++++++++++++++++++++++------
>  arch/x86/kvm/cpuid.c            |  1 +
>  arch/x86/kvm/lapic.c            | 17 +++++++++
>  arch/x86/kvm/pmu.c              |  1 +
>  arch/x86/kvm/vmx/vmx.c          | 26 ++++++++++++++
>  arch/x86/kvm/x86.c              | 60 ++++++++++++++++++++++++++++++++
>  6 files changed, 172 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index a9d03af34030..e7d70ed046b2 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -945,21 +945,11 @@ struct kvm_vcpu_stat {
>  	u64 pf_guest;
>  	u64 tlb_flush;
>  	u64 invlpg;
> -
> -	u64 exits;
> -	u64 io_exits;
> -	u64 mmio_exits;
> -	u64 signal_exits;
> -	u64 irq_window_exits;
> -	u64 nmi_window_exits;
>  	u64 l1d_flush;
> -	u64 halt_exits;
>  	u64 halt_successful_poll;
>  	u64 halt_attempted_poll;
>  	u64 halt_poll_invalid;
>  	u64 halt_wakeup;
> -	u64 request_irq_exits;
> -	u64 irq_exits;
>  	u64 host_state_reload;
>  	u64 fpu_reload;
>  	u64 insn_emulation;
> @@ -968,6 +958,73 @@ struct kvm_vcpu_stat {
>  	u64 irq_injections;
>  	u64 nmi_injections;
>  	u64 req_event;
> +
> +	/* vm-exit reasons */
> +	u64 exits;
> +	u64 io_exits;
> +	u64 mmio_exits;
> +	u64 signal_exits;
> +	u64 irq_window_exits;
> +	u64 nmi_window_exits;
> +	u64 halt_exits;
> +	u64 request_irq_exits;
> +	u64 irq_exits;
> +	u64 exception_nmi_exits;
> +	u64 cr_exits;
> +	u64 dr_exits;
> +	u64 cpuid_exits;
> +	u64 rdpmc_exits;
> +	u64 update_ppr_exits;
> +	u64 rdmsr_exits;
> +	u64 wrmsr_exits;
> +	u64 apic_access_exits;
> +	u64 apic_write_exits;
> +	u64 apic_eoi_exits;
> +	u64 wbinvd_exits;
> +	u64 xsetbv_exits;
> +	u64 task_switch_exits;
> +	u64 ept_violation_exits;
> +	u64 pause_exits;
> +	u64 mwait_exits;
> +	u64 monitor_trap_exits;
> +	u64 monitor_exits;
> +	u64 pml_full_exits;
> +	u64 preemption_timer_exits;
> +	/* wrmsr & apic vm-exit reasons */
> +	u64 wrmsr_set_apic_base;
> +	u64 wrmsr_set_wall_clock;
> +	u64 wrmsr_set_system_time;
> +	u64 wrmsr_set_pmu;
> +	u64 lapic_set_tscdeadline;
> +	u64 lapic_set_tpr;
> +	u64 lapic_set_eoi;
> +	u64 lapic_set_ldr;
> +	u64 lapic_set_dfr;
> +	u64 lapic_set_spiv;
> +	u64 lapic_set_icr;
> +	u64 lapic_set_icr2;
> +	u64 lapic_set_lvt;
> +	u64 lapic_set_lvtt;
> +	u64 lapic_set_tmict;
> +	u64 lapic_set_tdcr;
> +	u64 lapic_set_esr;
> +	u64 lapic_set_self_ipi;
> +	/* cr vm-exit reasons */
> +	u64 cr_movetocr0;
> +	u64 cr_movetocr3;
> +	u64 cr_movetocr4;
> +	u64 cr_movetocr8;
> +	u64 cr_movefromcr3;
> +	u64 cr_movefromcr8;
> +	u64 cr_clts;
> +	u64 cr_lmsw;
> +	/* hypercall vm-exit reasons */
> +	u64 hypercall_vapic_poll_irq;
> +	u64 hypercall_kick_cpu;
> +#ifdef CONFIG_X86_64
> +	u64 hypercall_clock_pairing;
> +#endif
> +	u64 hypercall_send_ipi;
>  };

There are quite a few issues with this approach:

  - That's a lot of memory that may never be consumed.

  - struct kvm_vcpu_stat is vendor agnostic, but the exits tracked are
    very much VMX centric.  And for many of the exits that do overlap with
    SVM, only the VMX paths are updated by your patch.

  - The granularity of what is tracked is very haphazard, e.g. CLTS and
    LMSW get their own entries, but NMIs and all exceptions get lumped
    into a single stat.

And so on and so forth.

Just spitballing, rather than trying to extend the existing stats
implementation, what if we go in the opposite direction and pare it down
as much as possible in favor of trace events?  I.e. improve KVM's trace
events so that userspace can use them to aggregate data for all vCPUS in
a VM, filter out specific MSRs, etc..., and add post-processing scripts
for common operations so that users don't need to reinvent the wheel
every time they want to collect certain information.

Adding a tracepoint for things like fpu_reload or host_state_reload
is probably overkill, but just about every other x86 stat entry either
has an existing corresponding tracepoint or could provide much more
useful info if a tracepoint were added, e.g. to track which IRQ is
interrupting the guest or what I/O ports are causing exits.

The post-processing scripts might even be able to eliminate some of
manual analysis, e.g. by collating stats based on guest rip to build a
picture of what piece of guest code is triggerring what exits.



[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux