Re: [PATCH] perf/core: fix the bug in the event multiplexing

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Marc,

在 2023/8/9 16:48, Marc Zyngier 写道:
On Wed, 09 Aug 2023 02:39:53 +0100,
Huang Shijie <shijie@xxxxxxxxxxxxxxxxxxxxxx> wrote:

For a start, please provide a sensible subject line for your patch.
"fix the bug" is not exactly descriptive, and I'd argue that if there
was only one bug left, I'd have taken an early retirement by now.
okay, I will change the subject in the future.
1.) Background.
    1.1) In arm64, run a virtual guest with Qemu, and bind the guest
Is that with QEMU in system emulation mode? Or QEMU as a VMM for KVM?
Run the Qemu as a VMM for KVM.

         to core 33 and run program "a" in guest.
Is core 33 significant? Is the program itself significant?

It is not a significant one, just chosed by random.

The program is just used for testing the perf, not a significant one.

         The code of "a" shows below:
    	----------------------------------------------------------
		#include <stdio.h>

		int main()
		{
			unsigned long i = 0;

			for (;;) {
				i++;
			}

			printf("i:%ld\n", i);
			return 0;
		}
    	----------------------------------------------------------

    1.2) Use the following perf command in host:
       #perf stat -e cycles:G,cycles:H -C 33 -I 1000 sleep 1
           #           time             counts unit events
                1.000817400      3,299,471,572      cycles:G
                1.000817400          3,240,586      cycles:H

        This result is correct, my cpu's frequency is 3.3G.

    1.3) Use the following perf command in host:
       #perf stat -e cycles:G,cycles:H -C 33 -d -d  -I 1000 sleep 1
             time             counts unit events
      1.000831480        153,634,097      cycles:G                                                                (70.03%)
      1.000831480      3,147,940,599      cycles:H                                                                (70.03%)
      1.000831480      1,143,598,527      L1-dcache-loads                                                         (70.03%)
      1.000831480              9,986      L1-dcache-load-misses            #    0.00% of all L1-dcache accesses   (70.03%)
      1.000831480    <not supported>      LLC-loads
      1.000831480    <not supported>      LLC-load-misses
      1.000831480        580,887,696      L1-icache-loads                                                         (70.03%)
      1.000831480             77,855      L1-icache-load-misses            #    0.01% of all L1-icache accesses   (70.03%)
      1.000831480      6,112,224,612      dTLB-loads                                                              (70.03%)
      1.000831480             16,222      dTLB-load-misses                 #    0.00% of all dTLB cache accesses  (69.94%)
      1.000831480        590,015,996      iTLB-loads                                                              (59.95%)
      1.000831480                505      iTLB-load-misses                 #    0.00% of all iTLB cache accesses  (59.95%)

        This result is wrong. The "cycle:G" should be nearly 3.3G.

2.) Root cause.
	There is only 7 counters in my arm64 platform:
	  (one cycle counter) + (6 normal counters)

	In 1.3 above, we will use 10 event counters.
	Since we only have 7 counters, the perf core will trigger
        	event multiplexing in hrtimer:
	     merge_sched_in() -->perf_mux_hrtimer_restart() -->
	     perf_rotate_context().

        In the perf_rotate_context(), it does not restore some PMU registers
        as context_switch() does.  In context_switch():
              kvm_sched_in()  --> kvm_vcpu_pmu_restore_guest()
              kvm_sched_out() --> kvm_vcpu_pmu_restore_host()

        So we got wrong result.

3.) About this patch.
         3.1) Add arch_perf_rotate_pmu_set()
         3.2) Add is_guest().
	     Check the context for hrtimer.
	3.3) In arm64's arch_perf_rotate_pmu_set(),
        	     set the PMU registers by the context.

4.) Test result of this patch:
       #perf stat -e cycles:G,cycles:H -C 33 -d -d  -I 1000 sleep 1
             time             counts unit events
      1.000817360      3,297,898,244      cycles:G                                                                (70.03%)
      1.000817360          2,719,941      cycles:H                                                                (70.03%)
      1.000817360            883,764      L1-dcache-loads                                                         (70.03%)
      1.000817360             17,517      L1-dcache-load-misses            #    1.98% of all L1-dcache accesses   (70.03%)
      1.000817360    <not supported>      LLC-loads
      1.000817360    <not supported>      LLC-load-misses
      1.000817360          1,033,816      L1-icache-loads                                                         (70.03%)
      1.000817360            103,839      L1-icache-load-misses            #   10.04% of all L1-icache accesses   (70.03%)
      1.000817360            982,401      dTLB-loads                                                              (70.03%)
      1.000817360             28,272      dTLB-load-misses                 #    2.88% of all dTLB cache accesses  (69.94%)
      1.000817360            972,072      iTLB-loads                                                              (59.95%)
      1.000817360                772      iTLB-load-misses                 #    0.08% of all iTLB cache accesses  (59.95%)

     The result is correct. The "cycle:G" is nearly 3.3G now.

Signed-off-by: Huang Shijie <shijie@xxxxxxxxxxxxxxxxxxxxxx>
---
  arch/arm64/kvm/pmu.c     | 8 ++++++++
  include/linux/kvm_host.h | 1 +
  kernel/events/core.c     | 5 +++++
  virt/kvm/kvm_main.c      | 9 +++++++++
  4 files changed, 23 insertions(+)

diff --git a/arch/arm64/kvm/pmu.c b/arch/arm64/kvm/pmu.c
index 121f1a14c829..a6815c3f0c4e 100644
--- a/arch/arm64/kvm/pmu.c
+++ b/arch/arm64/kvm/pmu.c
@@ -210,6 +210,14 @@ void kvm_vcpu_pmu_restore_host(struct kvm_vcpu *vcpu)
  	kvm_vcpu_pmu_disable_el0(events_guest);
  }
+void arch_perf_rotate_pmu_set(void)
+{
+	if (is_guest())
+		kvm_vcpu_pmu_restore_guest(NULL);
+	else
+		kvm_vcpu_pmu_restore_host(NULL);
+}
So we're now randomly poking at the counters even when no guest is
running, based on whatever is stashed in internal KVM data structures?
I'm sure this is going to work really well.

Hint: even if these functions don't directly look at the vcpu pointer,
passing NULL is a really bad idea. It is a sure sign that you don't
have the context on which to perform the action you're trying to do.

This really shouldn't do *anything* when the rotation process is not
preempting a guest.
Thanks for explanation, I will try Oliver's second method to resolve this issue.
+
  /*
   * With VHE, keep track of the PMUSERENR_EL0 value for the host EL0 on the pCPU
   * where PMUSERENR_EL0 for the guest is loaded, since PMUSERENR_EL0 is switched
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9d3ac7720da9..e350cbc8190f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -931,6 +931,7 @@ void kvm_destroy_vcpus(struct kvm *kvm);
void vcpu_load(struct kvm_vcpu *vcpu);
  void vcpu_put(struct kvm_vcpu *vcpu);
+bool is_guest(void);
Why do we need this (not to mention the poor choice of name)? We
already have kvm_get_running_vcpu(), which does everything you need
(and gives you the actual context).
yes.
#ifdef __KVM_HAVE_IOAPIC
  void kvm_arch_post_irq_ack_notifier_list_update(struct kvm *kvm);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 6fd9272eec6e..fe78f9d17eba 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4229,6 +4229,10 @@ ctx_event_to_rotate(struct perf_event_pmu_context *pmu_ctx)
  	return event;
  }
+void __weak arch_perf_rotate_pmu_set(void)
+{
+}
+
  static bool perf_rotate_context(struct perf_cpu_pmu_context *cpc)
  {
  	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
@@ -4282,6 +4286,7 @@ static bool perf_rotate_context(struct perf_cpu_pmu_context *cpc)
  	if (task_event || (task_epc && cpu_event))
  		__pmu_ctx_sched_in(task_epc->ctx, pmu);
+ arch_perf_rotate_pmu_set();
KVM already supports hooking into the perf core using the
perf_guest_info_callbacks structure. Why should we need a separate
mechanism?
okay, I will check it too.

  	perf_pmu_enable(pmu);
  	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index dfbaafbe3a00..a77d336552be 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -218,6 +218,15 @@ void vcpu_load(struct kvm_vcpu *vcpu)
  }
  EXPORT_SYMBOL_GPL(vcpu_load);
+/* Do we in the guest? */
+bool is_guest(void)
+{
+	struct kvm_vcpu *vcpu;
+
+	vcpu = __this_cpu_read(kvm_running_vcpu);
+	return !!vcpu;
+}
+
  void vcpu_put(struct kvm_vcpu *vcpu)
  {
  	preempt_disable();
It looks like you've identified an actual issue. However, I'm highly
sceptical of the implementation. This really needs some more work.


Thanks again.

Huang Shijie




[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux