Hello, On Fri, Sep 06, 2024 at 11:40:51AM -0400, Liang, Kan wrote: > > > On 2024-09-06 6:59 a.m., Mi, Dapeng wrote: > > > > On 8/1/2024 12:58 PM, Mingwei Zhang wrote: > >> From: Kan Liang <kan.liang@xxxxxxxxxxxxxxx> > >> > >> Currently, the guest and host share the PMU resources when a guest is > >> running. KVM has to create an extra virtual event to simulate the > >> guest's event, which brings several issues, e.g., high overhead, not > >> accuracy and etc. > >> > >> A new passthrough PMU method is proposed to address the issue. It requires > >> that the PMU resources can be fully occupied by the guest while it's > >> running. Two new interfaces are implemented to fulfill the requirement. > >> The hypervisor should invoke the interface while creating a guest which > >> wants the passthrough PMU capability. > >> > >> The PMU resources should only be temporarily occupied as a whole when a > >> guest is running. When the guest is out, the PMU resources are still > >> shared among different users. > >> > >> The exclude_guest event modifier is used to guarantee the exclusive > >> occupation of the PMU resources. When creating a guest, the hypervisor > >> should check whether there are !exclude_guest events in the system. > >> If yes, the creation should fail. Because some PMU resources have been > >> occupied by other users. > >> If no, the PMU resources can be safely accessed by the guest directly. > >> Perf guarantees that no new !exclude_guest events are created when a > >> guest is running. > >> > >> Only the passthrough PMU is affected, but not for other PMU e.g., uncore > >> and SW PMU. The behavior of those PMUs are not changed. The guest > >> enter/exit interfaces should only impact the supported PMUs. > >> Add a new PERF_PMU_CAP_PASSTHROUGH_VPMU flag to indicate the PMUs that > >> support the feature. > >> > >> Add nr_include_guest_events to track the !exclude_guest events of PMU > >> with PERF_PMU_CAP_PASSTHROUGH_VPMU. > >> > >> Suggested-by: Sean Christopherson <seanjc@xxxxxxxxxx> > >> Signed-off-by: Kan Liang <kan.liang@xxxxxxxxxxxxxxx> > >> Tested-by: Yongwei Ma <yongwei.ma@xxxxxxxxx> > >> Signed-off-by: Mingwei Zhang <mizhang@xxxxxxxxxx> > >> --- > >> include/linux/perf_event.h | 10 ++++++ > >> kernel/events/core.c | 66 ++++++++++++++++++++++++++++++++++++++ > >> 2 files changed, 76 insertions(+) > >> > >> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h > >> index a5304ae8c654..45d1ea82aa21 100644 > >> --- a/include/linux/perf_event.h > >> +++ b/include/linux/perf_event.h > >> @@ -291,6 +291,7 @@ struct perf_event_pmu_context; > >> #define PERF_PMU_CAP_NO_EXCLUDE 0x0040 > >> #define PERF_PMU_CAP_AUX_OUTPUT 0x0080 > >> #define PERF_PMU_CAP_EXTENDED_HW_TYPE 0x0100 > >> +#define PERF_PMU_CAP_PASSTHROUGH_VPMU 0x0200 > >> > >> struct perf_output_handle; > >> > >> @@ -1728,6 +1729,8 @@ extern void perf_event_task_tick(void); > >> extern int perf_event_account_interrupt(struct perf_event *event); > >> extern int perf_event_period(struct perf_event *event, u64 value); > >> extern u64 perf_event_pause(struct perf_event *event, bool reset); > >> +int perf_get_mediated_pmu(void); > >> +void perf_put_mediated_pmu(void); > >> #else /* !CONFIG_PERF_EVENTS: */ > >> static inline void * > >> perf_aux_output_begin(struct perf_output_handle *handle, > >> @@ -1814,6 +1817,13 @@ static inline u64 perf_event_pause(struct perf_event *event, bool reset) > >> { > >> return 0; > >> } > >> + > >> +static inline int perf_get_mediated_pmu(void) > >> +{ > >> + return 0; > >> +} > >> + > >> +static inline void perf_put_mediated_pmu(void) { } > >> #endif > >> > >> #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL) > >> diff --git a/kernel/events/core.c b/kernel/events/core.c > >> index 8f908f077935..45868d276cde 100644 > >> --- a/kernel/events/core.c > >> +++ b/kernel/events/core.c > >> @@ -402,6 +402,20 @@ static atomic_t nr_bpf_events __read_mostly; > >> static atomic_t nr_cgroup_events __read_mostly; > >> static atomic_t nr_text_poke_events __read_mostly; > >> static atomic_t nr_build_id_events __read_mostly; > >> +static atomic_t nr_include_guest_events __read_mostly; > >> + > >> +static atomic_t nr_mediated_pmu_vms; > >> +static DEFINE_MUTEX(perf_mediated_pmu_mutex); > >> + > >> +/* !exclude_guest event of PMU with PERF_PMU_CAP_PASSTHROUGH_VPMU */ > >> +static inline bool is_include_guest_event(struct perf_event *event) > >> +{ > >> + if ((event->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU) && > >> + !event->attr.exclude_guest) > >> + return true; > >> + > >> + return false; > >> +} > >> > >> static LIST_HEAD(pmus); > >> static DEFINE_MUTEX(pmus_lock); > >> @@ -5212,6 +5226,9 @@ static void _free_event(struct perf_event *event) > >> > >> unaccount_event(event); > >> > >> + if (is_include_guest_event(event)) > >> + atomic_dec(&nr_include_guest_events); > >> + > >> security_perf_event_free(event); > >> > >> if (event->rb) { > >> @@ -5769,6 +5786,36 @@ u64 perf_event_pause(struct perf_event *event, bool reset) > >> } > >> EXPORT_SYMBOL_GPL(perf_event_pause); > >> > >> +/* > >> + * Currently invoked at VM creation to > >> + * - Check whether there are existing !exclude_guest events of PMU with > >> + * PERF_PMU_CAP_PASSTHROUGH_VPMU > >> + * - Set nr_mediated_pmu_vms to prevent !exclude_guest event creation on > >> + * PMUs with PERF_PMU_CAP_PASSTHROUGH_VPMU > >> + * > >> + * No impact for the PMU without PERF_PMU_CAP_PASSTHROUGH_VPMU. The perf > >> + * still owns all the PMU resources. > >> + */ > >> +int perf_get_mediated_pmu(void) > >> +{ > >> + guard(mutex)(&perf_mediated_pmu_mutex); > >> + if (atomic_inc_not_zero(&nr_mediated_pmu_vms)) > >> + return 0; > >> + > >> + if (atomic_read(&nr_include_guest_events)) > >> + return -EBUSY; > >> + > >> + atomic_inc(&nr_mediated_pmu_vms); > >> + return 0; > >> +} > >> +EXPORT_SYMBOL_GPL(perf_get_mediated_pmu); > >> + > >> +void perf_put_mediated_pmu(void) > >> +{ > >> + atomic_dec(&nr_mediated_pmu_vms); > >> +} > >> +EXPORT_SYMBOL_GPL(perf_put_mediated_pmu); > >> + > >> /* > >> * Holding the top-level event's child_mutex means that any > >> * descendant process that has inherited this event will block > >> @@ -11907,6 +11954,17 @@ static void account_event(struct perf_event *event) > >> account_pmu_sb_event(event); > >> } > >> > >> +static int perf_account_include_guest_event(void) > >> +{ > >> + guard(mutex)(&perf_mediated_pmu_mutex); > >> + > >> + if (atomic_read(&nr_mediated_pmu_vms)) > >> + return -EACCES; > > > > Kan, Namhyung posted a patchset > > https://lore.kernel.org/all/20240904064131.2377873-1-namhyung@xxxxxxxxxx/ > > which would remove to set exclude_guest flag from perf tools by default. > > This may impact current mediated vPMU solution, but fortunately the > > patchset provides a fallback mechanism to add exclude_guest flag if kernel > > returns "EOPNOTSUPP". > > > > So we'd better return "EOPNOTSUPP" instead of "EACCES" here. BTW, returning > > "EOPNOTSUPP" here looks more reasonable than "EACCES". > > It seems the existing Apple M1 PMU has ready returned "EOPNOTSUPP" for > the !exclude_guest. Yes, we should use the same error code. Yep, it'd be much easier to handle if it returns the same error code. Thanks, Namhyung