Re: REGRESSION: Upgrading host kernel from 5.11 to 5.13 breaks QEMU guests - perf/fw_devlink/kvm

Will Deacon <will@xxxxxxxxxx> · Mon, 20 Sep 2021 10:56:57 +0100

On Sun, Sep 19, 2021 at 02:36:46PM +0100, Marc Zyngier wrote:
> Hi Russell,
> 
> Thanks for the heads-up.
> 
> On Sat, 18 Sep 2021 17:17:45 +0100,
> "Russell King (Oracle)" <linux@xxxxxxxxxxxxxxx> wrote:
> > 
> > Hi,
> > 
> > This morning, I upgraded my VM host from Debian Buster to Debian
> > Bullseye. This host (Armada 8040) runs libvirt. I placed all the VMs
> > into managedsave, which basically means their state is saved out by
> > QEMU ready to be resumed once the upgrade is complete.
> > 
> > Initially, I was running 5.11 on the host, which didn't have POSIX
> > ACLs enabled on tmpfs. Sadly, upgrading to Bullseye now requires
> > this to be enabled, so I upgraded the kernel to 5.13 to avoid this
> > problem - without POSIX ACLs on tmpfs, qemu refuses to even start.
> > 
> > Under Bullseye with 5.13, I could start new VMs, but I could not
> > resume the saved VMs - it would spit out:
> > 
> > 2021-09-18T11:14:14.456227Z qemu-system-aarch64: Invalid value 236 expecting positive value <= 162
> > 2021-09-18T11:14:14.456269Z qemu-system-aarch64: Failed to load cpu:cpreg_vmstate_array_len
> > 2021-09-18T11:14:14.456279Z qemu-system-aarch64: error while loading state for instance 0x0 of device 'cpu'
> > 2021-09-18T11:14:14.456887Z qemu-system-aarch64: load of migration failed: Invalid argument
> > 
> > Essentially, what this is complaining about is that the register
> > list returned by the KVM_GET_REG_LIST ioctl has reduced in size,
> > meaning that the saved VM has more registers that need to be set
> > (236 of them, after QEMU's filtering) than those which are actually
> > available under 5.13 (162 after QEMU's filtering).
> > 
> > After much debugging, and writing a program to create a VM and CPU,
> > and retrieve the register list etc, I have finally tracked down
> > exactly what is going on...
> > 
> > Under 5.13, KVM believes there is no PMU, so it doesn't advertise the
> > PMU registers, despite the hardware having a PMU and the kernel
> > having support.
> > 
> > kvm_arm_support_pmu_v3() determines whether the PMU_V3 feature is
> > available or not.
> > 
> > Under 5.11, this was determined via:
> > 
> > 	perf_num_counters() > 0
> > 
> > which in turn was derived from (essentially):
> > 
> > 	__oprofile_cpu_pmu && __oprofile_cpu_pmu->num_events > 0
> > 
> > __oprofile_cpu_pmu can be set at any time when the PMU has been
> > initialised, and due to initialisation ordering, it appears to happen
> > much later in the kernel boot.
> > 
> > However, under 5.13, oprofile has been removed, and this test is no
> > longer present. Instead, kvm attempts to determine the availability
> > of PMUv3 when it initialises, and fails to because the PMU has not
> > yet been initialised. If there is no PMU at the point KVM initialises,
> > then KVM will never advertise a PMU.
> > 
> > This change of behaviour is what breaks KVM on Armada 8040, causing
> > a userspace regression.
> > 
> > What makes this more confusing is the PMU errors have gone:
> > 
> > 5.13:
> > [    0.170514] PCI: CLS 0 bytes, default 64
> > [    0.171085] kvm [1]: IPA Size Limit: 44 bits
> > [    0.172532] kvm [1]: vgic interrupt IRQ9
> > [    0.172650] kvm: pmu event creation failed -2
> > [    0.172688] kvm [1]: Hyp mode initialized successfully
> > ...
> > [    1.479833] hw perfevents: enabled with armv8_cortex_a72 PMU driver, 7 counters available
> >
> 
> [...]
> 
> Urgh. That's a bummer. T1he PMU driver only comes up once it has found
> its interrupt controller, which on the Armada 8040 is not the GIC, but
> some weird thing on the side that doesn't actually serve any real
> purpose. On HW where the PMU is directly wired into the GIC, it all
> works fine, though by luck rather than by design.
> 
> Anyway, rant over. This is a bug that needs addressing so that KVM can
> initialise correctly irrespective of the probing order. This probably
> means that the static key controlling KVM's behaviour wrt the PMU must
> be controlled by the PMU infrastructure itself, rather than KVM trying
> to probe for it.
> 
> Can you please give the following hack a go (on top of 5.15-rc1)? I've
> briefly tested it on my McBin, and it did the trick. I've also tested
> it on the M1 (which really doesn't have an architectural PMU) to
> verify that it was correctly failing.
> 
> Thanks,
> 
> 	M.
> 
> From 9c26e3e6bbcbc3a583b3974e7a9017029d31fe29 Mon Sep 17 00:00:00 2001
> From: Marc Zyngier <maz@xxxxxxxxxx>
> Date: Sun, 19 Sep 2021 14:09:49 +0100
> Subject: [PATCH] KVM: arm64: Fix PMU probe ordering
> 
> Russell reported that since 5.13, KVM's probing of the PMU has
> started to fail on his HW. As it turns out, there is an implicit
> ordering dependency between the architectural PMU probing code and
> and KVM's own probing. If, due to probe ordering reasons, KVM probes
> before the PMU driver, it will fail to detect the PMU and prevent it
> from being advertised to guests as well as the VMM.
> 
> Obviously, this is one probing too many, and we should be able to
> deal with any ordering.
> 
> Add a callback from the PMU code into KVM to advertise the registration
> of a host CPU PMU, allowing for any probing order.
> 
> Fixes: 5421db1be3b1 ("KVM: arm64: Divorce the perf code from oprofile helpers")
> Reported-by: "Russell King (Oracle)" <linux@xxxxxxxxxxxxxxx>
> Signed-off-by: Marc Zyngier <maz@xxxxxxxxxx>
> Link: https://lore.kernel.org/r/YUYRKVflRtUytzy5@xxxxxxxxxxxxxxxxxxxxx
> Cc: stable@xxxxxxxxxxxxxxx
> ---
>  arch/arm64/kvm/perf.c        |  3 ---
>  arch/arm64/kvm/pmu-emul.c    | 12 +++++++++++-
>  drivers/perf/arm_pmu.c       |  2 ++
>  include/kvm/arm_pmu.h        |  3 ---
>  include/linux/perf/arm_pmu.h |  6 ++++++
>  5 files changed, 19 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/arm64/kvm/perf.c b/arch/arm64/kvm/perf.c
> index f9bb3b14130e..c84fe24b2ea1 100644
> --- a/arch/arm64/kvm/perf.c
> +++ b/arch/arm64/kvm/perf.c
> @@ -50,9 +50,6 @@ static struct perf_guest_info_callbacks kvm_guest_cbs = {
>  
>  int kvm_perf_init(void)
>  {
> -	if (kvm_pmu_probe_pmuver() != ID_AA64DFR0_PMUVER_IMP_DEF && !is_protected_kvm_enabled())
> -		static_branch_enable(&kvm_arm_pmu_available);
> -
>  	return perf_register_guest_info_callbacks(&kvm_guest_cbs);
>  }
>  
> diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c
> index f5065f23b413..588100c52f34 100644
> --- a/arch/arm64/kvm/pmu-emul.c
> +++ b/arch/arm64/kvm/pmu-emul.c
> @@ -740,7 +740,17 @@ void kvm_pmu_set_counter_event_type(struct kvm_vcpu *vcpu, u64 data,
>  	kvm_pmu_create_perf_event(vcpu, select_idx);
>  }
>  
> -int kvm_pmu_probe_pmuver(void)
> +void kvm_host_pmu_init(struct arm_pmu *pmu)
> +{
> +	if (pmu->pmuver != 0 &&
> +	    pmu->pmuver != ID_AA64DFR0_PMUVER_IMP_DEF &&
> +	    !is_protected_kvm_enabled()) {
> +		static_branch_enable(&kvm_arm_pmu_available);
> +		kvm_info("PMU detected and enabled\n");
> +	}
> +}
> +
> +static int kvm_pmu_probe_pmuver(void)
>  {
>  	struct perf_event_attr attr = { };
>  	struct perf_event *event;
> diff --git a/drivers/perf/arm_pmu.c b/drivers/perf/arm_pmu.c
> index 3cbc3baf087f..295cc7952d0e 100644
> --- a/drivers/perf/arm_pmu.c
> +++ b/drivers/perf/arm_pmu.c
> @@ -952,6 +952,8 @@ int armpmu_register(struct arm_pmu *pmu)
>  		pmu->name, pmu->num_events,
>  		has_nmi ? ", using NMIs" : "");
>  
> +	kvm_host_pmu_init(pmu);

Just a nit, but I think this will get called for each PMU we probe on a
big.LITTLE system which is probably harmless, but possible not what you want?

Will