On Wed, 29 Mar 2023 01:21:36 +0100, Reiji Watanabe <reijiw@xxxxxxxxxx> wrote: > > Currently, with VHE, KVM sets ER, CR, SW and EN bits of > PMUSERENR_EL0 to 1 on vcpu_load(). So, if the value of those bits > are cleared after vcpu_load() (the perf subsystem would do when PMU > counters are programmed for the guest), PMU access from the guest EL0 > might be trapped to the guest EL1 directly regardless of the current > PMUSERENR_EL0 value of the vCPU. + RobH. Is that what is done when the event is created and armv8pmu_start() called? This is... crap. The EL0 access thing breaks everything, and nobody tested it with KVM, obviously. I would be tempted to start mitigating it with the following: diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c index dde06c0f97f3..8063525bf3dd 100644 --- a/arch/arm64/kernel/perf_event.c +++ b/arch/arm64/kernel/perf_event.c @@ -806,17 +806,19 @@ static void armv8pmu_disable_event(struct perf_event *event) static void armv8pmu_start(struct arm_pmu *cpu_pmu) { - struct perf_event_context *ctx; - int nr_user = 0; + if (sysctl_perf_user_access) { + struct perf_event_context *ctx; + int nr_user = 0; - ctx = perf_cpu_task_ctx(); - if (ctx) - nr_user = ctx->nr_user; + ctx = perf_cpu_task_ctx(); + if (ctx) + nr_user = ctx->nr_user; - if (sysctl_perf_user_access && nr_user) - armv8pmu_enable_user_access(cpu_pmu); - else - armv8pmu_disable_user_access(); + if (nr_user) + armv8pmu_enable_user_access(cpu_pmu); + else + armv8pmu_disable_user_access(); + } /* Enable all counters */ armv8pmu_pmcr_write(armv8pmu_pmcr_read() | ARMV8_PMU_PMCR_E); but that's obviously not enough as we want it to work with EL0 access enabled on the host as well. What we miss is something that tells the PMU code "we're in a context where host userspace isn't present", and this would be completely skipped, relying on KVM to restore the appropriate state on vcpu_put(). But then the IPI stuff that controls EL0 can always come in and wreck things. Gahhh... I'm a bit reluctant to use the "save/restore all the time" hammer, because it only hides that the EL0 counter infrastructure is a bit broken. > With VHE, fix this by setting those bits of the register on every > guest entry (as with nVHE). Also, opportunistically make the similar > change for PMSELR_EL0, which is cleared by vcpu_load(), to ensure it > is always set to zero on guest entry (PMXEVCNTR_EL0 access might cause > UNDEF at EL1 instead of being trapped to EL2, depending on the value > of PMSELR_EL0). I think that would be more robust, although I don't > find any kernel code that writes PMSELR_EL0. This was changed a while ago to avoid using the selection register, see 0fdf1bb75953 ("arm64: perf: Avoid PMXEV* indirection"), and the rationale behind the reset of PMSELR_EL0 in 21cbe3cc8a48 ("arm64: KVM: pmu: Reset PMSELR_EL0.SEL to a sane value before entering the guest"). We *could* simply drop this zeroing of PMSELR_EL0 now that there is nothing else host-side that writes to it. But we need to agree on how to fix the above first. Thanks, M. -- Without deviation from the norm, progress is not possible.