Re: [RFC PATCH 02/41] perf: Support guest enter/exit interfaces

Sean Christopherson <seanjc@xxxxxxxxxx> · Fri, 12 Apr 2024 12:17:01 -0700

On Thu, Apr 11, 2024, Kan Liang wrote:
> >> +/*
> >> + * When a guest enters, force all active events of the PMU, which supports
> >> + * the VPMU_PASSTHROUGH feature, to be scheduled out. The events of other
> >> + * PMUs, such as uncore PMU, should not be impacted. The guest can
> >> + * temporarily own all counters of the PMU.
> >> + * During the period, all the creation of the new event of the PMU with
> >> + * !exclude_guest are error out.
> >> + */
> >> +void perf_guest_enter(void)
> >> +{
> >> +	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
> >> +
> >> +	lockdep_assert_irqs_disabled();
> >> +
> >> +	if (__this_cpu_read(__perf_force_exclude_guest))
> > 
> > This should be a WARN_ON_ONCE, no?
> 
> To debug the improper behavior of KVM?

Not so much "debug" as ensure that the platform owner noticies that KVM is buggy.

> >> +static inline int perf_force_exclude_guest_check(struct perf_event *event,
> >> +						 int cpu, struct task_struct *task)
> >> +{
> >> +	bool *force_exclude_guest = NULL;
> >> +
> >> +	if (!has_vpmu_passthrough_cap(event->pmu))
> >> +		return 0;
> >> +
> >> +	if (event->attr.exclude_guest)
> >> +		return 0;
> >> +
> >> +	if (cpu != -1) {
> >> +		force_exclude_guest = per_cpu_ptr(&__perf_force_exclude_guest, cpu);
> >> +	} else if (task && (task->flags & PF_VCPU)) {
> >> +		/*
> >> +		 * Just need to check the running CPU in the event creation. If the
> >> +		 * task is moved to another CPU which supports the force_exclude_guest.
> >> +		 * The event will filtered out and be moved to the error stage. See
> >> +		 * merge_sched_in().
> >> +		 */
> >> +		force_exclude_guest = per_cpu_ptr(&__perf_force_exclude_guest, task_cpu(task));
> >> +	}
> > 
> > These checks are extremely racy, I don't see how this can possibly do the
> > right thing.  PF_VCPU isn't a "this is a vCPU task", it's a "this task is about
> > to do VM-Enter, or just took a VM-Exit" (the "I'm a virtual CPU" comment in
> > include/linux/sched.h is wildly misleading, as it's _only_ valid when accounting
> > time slices).
> >
> 
> This is to reject an !exclude_guest event creation for a running
> "passthrough" guest from host perf tool.
> Could you please suggest a way to detect it via the struct task_struct?
> 
> > Digging deeper, I think __perf_force_exclude_guest has similar problems, e.g.
> > perf_event_create_kernel_counter() calls perf_event_alloc() before acquiring the
> > per-CPU context mutex.
> 
> Do you mean that the perf_guest_enter() check could be happened right
> after the perf_force_exclude_guest_check()?
> It's possible. For this case, the event can still be created. It will be
> treated as an existing event and handled in merge_sched_in(). It will
> never be scheduled when a guest is running.
> 
> The perf_force_exclude_guest_check() is to make sure most of the cases
> can be rejected at the creation place. For the corner cases, they will
> be rejected in the schedule stage.

Ah, the "rejected in the schedule stage" is what I'm missing.  But that creates
a gross ABI, because IIUC, event creation will "randomly" succeed based on whether
or not a CPU happens to be running in a KVM guest.  I.e. it's not just the kernel
code that has races, the entire event creation is one big race.

What if perf had a global knob to enable/disable mediate PMU support?  Then when
KVM is loaded with enable_mediated_true, call into perf to (a) check that there
are no existing !exclude_guest events (this part could be optional), and (b) set
the global knob to reject all new !exclude_guest events (for the core PMU?).

Hmm, or probably better, do it at VM creation.  That has the advantage of playing
nice with CONFIG_KVM=y (perf could reject the enabling without completely breaking
KVM), and not causing problems if KVM is auto-probed but the user doesn't actually
want to run VMs.

E.g. (very roughly)

int x86_perf_get_mediated_pmu(void)
{
	if (refcount_inc_not_zero(...))
		return 0;

	if (<system wide events>)
		return -EBUSY;

	<slow path with locking>
}

void x86_perf_put_mediated_pmu(void)
{
	if (!refcount_dec_and_test(...))
		return;

	<slow path with locking>
}

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1bbf312cbd73..f2994377ef44 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12467,6 +12467,12 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
        if (type)
                return -EINVAL;
 
+       if (enable_mediated_pmu)
+               ret = x86_perf_get_mediated_pmu();
+               if (ret)
+                       return ret;
+       }
+
        ret = kvm_page_track_init(kvm);
        if (ret)
                goto out;
@@ -12518,6 +12524,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
        kvm_mmu_uninit_vm(kvm);
        kvm_page_track_cleanup(kvm);
 out:
+       x86_perf_put_mediated_pmu();
        return ret;
 }
 
@@ -12659,6 +12666,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
        kvm_page_track_cleanup(kvm);
        kvm_xen_destroy_vm(kvm);
        kvm_hv_destroy_vm(kvm);
+       x86_perf_put_mediated_pmu();
 }
 
 static void memslot_rmap_free(struct kvm_memory_slot *slot)