Re: [PATCH] KVM: x86: add full vm-exit reason debug entries

Fam Zheng <fam@xxxxxxxxxx> · Thu, 25 Apr 2019 15:01:37 +0800

> On Apr 23, 2019, at 04:54, Sean Christopherson <sean.j.christopherson@xxxxxxxxx> wrote:
> 
> On Mon, Apr 22, 2019 at 09:24:59PM +0800, zhenwei pi wrote:
>> Hit several typical cases of performance drop due to vm-exit:
>> case 1, jemalloc calls madvise(void *addr, size_t length, MADV_DONTNEED) in
>> guest, IPI causes a lot of exits.
>> case 2, atop collects IPC by perf hardware events in guest, vpmu & rdpmc
>> exits increase a lot.
>> case 3, host memory compaction invalidates TDP and tdp_page_fault can cause
>> huge loss of performance.
>> case 4, web services(written by golang) call futex and have higher latency
>> than host os environment.
>> 
>> Add more vm-exit reason debug entries, they are helpful to recognize
>> performance drop reason. In this patch:
>> 1, add more vm-exit reasons.
>> 2, add wrmsr details.
>> 3, add CR details.
>> 4, add hypercall details.
>> 
>> Currently we can also implement the same result by bpf.
>> Sample code (written by Fei Li<lifei.shirley@xxxxxxxxxxxxx>):
>>  b = BPF(text="""
>> 
>>  struct kvm_msr_exit_info {
>>      u32 pid;
>>      u32 tgid;
>>      u32 msr_exit_ct;
>>  };
>>  BPF_HASH(kvm_msr_exit, unsigned int, struct kvm_msr_exit_info, 1024);
>> 
>>  TRACEPOINT_PROBE(kvm, kvm_msr) {
>>      int ct = args->ecx;
>>      if (ct >= 0xffffffff) {
>>          return -1;
>>      }
>> 
>>      u32 pid = bpf_get_current_pid_tgid() >> 32;
>>      u32 tgid = bpf_get_current_pid_tgid();
>> 
>>      struct kvm_msr_exit_info *exit_info;
>>      struct kvm_msr_exit_info init_exit_info = {};
>>      exit_info = kvm_msr_exit.lookup(&ct);
>>      if (exit_info != NULL) {
>>          exit_info->pid = pid;
>>          exit_info->tgid = tgid;
>>          exit_info->msr_exit_ct++;
>>      } else {
>>          init_exit_info.pid = pid;
>>          init_exit_info.tgid = tgid;
>>          init_exit_info.msr_exit_ct = 1;
>>          kvm_msr_exit.update(&ct, &init_exit_info);
>>      }
>>      return 0;
>>  }
>>  """)
>> 
>> Run wrmsr(MSR_IA32_TSCDEADLINE, val) benchmark in guest
>> (CPU Intel Gold 5118):
>> case 1, no bpf on host. ~1127 cycles/wrmsr.
>> case 2, sample bpf on host with JIT. ~1223 cycles/wrmsr.      -->  8.5%
>> case 3, sample bpf on host without JIT. ~1312 cycles/wrmsr.   --> 16.4%
>> 
>> So, debug entries are more efficient than the bpf method.
> 
> How much does host performance matter?  E.g. does high overhead interfere
> with debug, is this something you want to have running at all times, etc…

The intention is to have a long running statictics for monitoring etc. In general, using eBPF, ftrace etc. are okay for not-so-hot points, but this one turned out to be very difficult to do efficiently without such a code change.

Fam