KVM: arm64: A new approach for SPE support

Alexandru Elisei <alexandru.elisei@xxxxxxx> · Wed, 23 Nov 2022 11:40:45 +0000

The previous discussion about how best to add SPE support to KVM [1] is
heading in the direction of pinning at EL2 only the buffer, when the guest
enables profiling, instead of pinning the entire VM memory. Although better
than pinning the entire VM at EL2, it still has some disadvantages:

1. Pinning memory at stage 2 goes against the design principle of secondary
MMUs, which must reflect all changes in the primary (host's stage 1) page
tables. This means a mechanism by which to pin VM memory at stage 2 must be
created from scratch just for SPE. Although I haven't done this yet, I'm a
bit concerned that this will turn out to be fragile and/or complicated.

2. The architecture allows software to change the VA to IPA translations
for the profiling buffer when the buffer is enabled if profiling is
disabled (the buffer is enabled, but sampling is disabled). Since SPE can
be programmed to profile EL0 only, and there is no easy way for KVM to trap
the exact moment when profiling becomes enabled in this scenario to
translate the buffer's guest VAs to IPA, to pin the IPAs at stage 2, it is
required for KVM impose limitations on how a guest uses SPE for emulation
to work.

I've prototyped a new approach [2] which eliminates both disadvantages, but
comes with its own set of drawbacks. The approach I've been working on is
to have KVM allocate a buffer in the kernel address space to profile the
guest, and when the buffer becomes full (or profiling is disabled for other
reasons), to copy the contents of the buffer to guest memory.

I'll start with the advantages:

1. No memory pinning at stage 2.

2. No meaningful restrictions on how the guest programs SPE, since the
translation of the guest VAs to IPAs is done by KVM when profiling has been
completed.

3. Neoverse N1 errata 1978083 ("Incorrect programming of PMBPTR_EL1 might
result in a deadlock") [6] is handled without any extra work.

As I see it, there are two main disadvantages:

1. The contents of the KVM buffer must be copied to the guest. In the
prototype this is done all at once, when profiling is stopped [3].
Presumably this can be amortized by unmapping the pages corresponding to
the guest buffer from stage 2 (or marking them as invalid) and copying the
data when the guest reads from those pages. Needs investigating.

2. When KVM profiles the guest, the KVM buffer owning exception level must
necessarily be EL2. This means that while profiling is happening,
PMBIDR_EL1.P = 1 (programming of the buffer is not allowed). PMBIDR_EL1
cannot be trapped without FEAT_FGT, so a guest that reads the register
after profiling becomes enabled will read the P bit as 1. I cannot think of
any valid reason for a guest to look at the bit after enabling profiling.
With FEAT_FGT, KVM would be able to trap accesses to the register.

3. In the worst case scenario, when the entire VM memory is mapped in the
host, this approach consumes more memory because the memory for the buffer
is separate from the memory allocated to the VM. On the plus side, there
will always be less memory pinned in the host for the VM process, since
only the buffer has to be pinned, instead of the buffer plus the guest's
stage 1 translation tables (to avoid SPE encountering a stage 2 fault on a
stage 1 translation table walk). Could be mitigated by providing an ioctl
to userspace to set the maximum size for the buffer.

I prefer this new approach instead of pinning the buffer at stage 2. It is
straightforward, less fragile and doesn't limit how a guest can program
SPE.

As for the prototype, I wrote it as a quick way to check if this approach
is viable. Does not have SPE support for the nVHE case because I would have
had to figure out how to map a continuous VA range in the EL2's translation
tables; supporting only the VHE case was a lot easier.  The prototype
doesn't have a stage 1 walker, so it's limited to guests that use id-mapped
addresses from TTBR0_EL1 for the buffer (although it would be trivial to
modify it to accept addresses from TTBR1_EL1) - I've used kvm-unit-tests
for testing [4]. I've tested the prototype on the model and on an Ampere
Altra.

For those interested, kvmtool support to run the prototype has also been
added [5] (add --spe to the command line to run a VM).

[1] https://lore.kernel.org/all/Yl6+JWaP+mq2Nc0b@monolith.localdoman/
[2] https://gitlab.arm.com/linux-arm/linux-ae/-/tree/kvm-spe-v6-copy-buffer-wip4-without-nvhe
[3] https://gitlab.arm.com/linux-arm/linux-ae/-/blob/kvm-spe-v6-copy-buffer-wip4-without-nvhe/arch/arm64/kvm/spe.c#L197
[4] https://gitlab.arm.com/linux-arm/kvm-unit-tests-ae/-/tree/kvm-spe-v6-copy-buffer-wip4
[5] https://gitlab.arm.com/linux-arm/kvmtool-ae/-/tree/kvm-spe-v6-copy-buffer-wip4
[6] https://developer.arm.com/documentation/SDEN885747/latest

Thanks,
Alex
_______________________________________________
kvmarm mailing list
kvmarm@xxxxxxxxxxxxxxxxxxxxx
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm