On Tue, 20 Dec 2022 21:47:36 +0000, Oliver Upton <oliver.upton@xxxxxxxxx> wrote: > > Hi Marc, > > On Tue, Dec 20, 2022 at 08:09:21PM +0000, Marc Zyngier wrote: > > A recent development on the EFI front has resulted in guests having > > their page tables baked in the firmware binary, and mapped into > > the IPA space as part as a read-only memslot. > > as part of a > > > Not only this is legitimate, but it also results in added security, > > so thumbs up. However, this clashes mildly with our handling of a S1PTW > > as a write to correctly handle AF/DB updates to the S1 PTs, and results > > in the guest taking an abort it won't recover from (the PTs mapping the > > vectors will suffer freom the same problem...). > > To be clear, the read-only page tables already have the AF set, > right? They certainly must, or else the guest isn't getting far :) Yes, the guest definitely has the AF set in the PT, and is not trying to use the HW-assisted AF (which obviously wouldn't work). > > I understand you're trying to describe _why_ we promote S1PTW to a > write, but doing it inline with the context of the EFI issue makes > it slightly unclear. Could you break these ideas up into two > paragraphs and maybe spell out the fault conditions a bit more? > > A recent development on the EFI front has resulted in guests having > their page tables baked in the firmware binary, and mapped into the > IPA space as part of a read-only memslot. Not only is this legitimate, > but it also results in added security, so thumbs up. > > It is possible to take an S1PTW translation fault if the S1 PTs are > unmapped at stage-2. However, KVM unconditionally treats S1PTW as a > write to correctly handle hardware AF/DB updates to the S1 PTs. > Furthermore, KVM injects a data abort into the guest for S1PTW writes. > In the aforementioned case this results in the guest taking an abort > it won't recover from, as the S1 PTs mapping the vectors suffer from > the same problem. > > Dunno, maybe I stink at reading which is why I got confused in the > first place. Nothing wrong with you, just that my write-up is indeed sloppy. I'll copy paste the above, thanks! > > > So clearly our handling is... wrong. > > > > Instead, switch to a two-pronged approach: > > > > - On S1PTW translation fault, handle the fault as a read > > > > - On S1PTW permission fault, handle the fault as a write > > > > This is of no consequence to SW that *writes* to its PTs (the write > > will trigger a non-S1PTW fault), and SW that uses RO PTs will not > > use AF/DB anyway, as that'd be wrong. > > > > Only in the case described in c4ad98e4b72c ("KVM: arm64: Assume write > > fault on S1PTW permission fault on instruction fetch") do we end-up > > with two back-to-back faults (page being evicted and faulted back). > > I don't think this is a case worth optimising for. > > > > Fixes: c4ad98e4b72c ("KVM: arm64: Assume write fault on S1PTW permission fault on instruction fetch") > > Signed-off-by: Marc Zyngier <maz@xxxxxxxxxx> > > Cc: stable@xxxxxxxxxxxxxxx > > --- > > arch/arm64/include/asm/kvm_emulate.h | 22 ++++++++++++++++++++-- > > 1 file changed, 20 insertions(+), 2 deletions(-) > > > > diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h > > index 9bdba47f7e14..fd6ad8b21f85 100644 > > --- a/arch/arm64/include/asm/kvm_emulate.h > > +++ b/arch/arm64/include/asm/kvm_emulate.h > > @@ -373,8 +373,26 @@ static __always_inline int kvm_vcpu_sys_get_rt(struct kvm_vcpu *vcpu) > > > > static inline bool kvm_is_write_fault(struct kvm_vcpu *vcpu) > > { > > - if (kvm_vcpu_abt_iss1tw(vcpu)) > > - return true; > > + if (kvm_vcpu_abt_iss1tw(vcpu)) { > > + /* > > + * Only a permission fault on a S1PTW should be > > + * considered as a write. Otherwise, page tables baked > > + * in a read-only memslot will result in an exception > > + * being delivered in the guest. > > Somewhat of a tangent, but: > > Aren't we somewhat unaligned with the KVM UAPI by injecting an > exception in this case? I know we've been doing it for a while, but it > flies in the face of the rules outlined in the > KVM_SET_USER_MEMORY_REGION documentation. That's an interesting point, and I certainly haven't considered that for faults introduced by page table walks. I'm not sure what userspace can do with that though. The problem is that this is a write for which we don't have useful data: although we know it is a page-table walker access, we don't know what it was about to write. The instruction that caused the write is meaningless (it could either be a load, a store, or an instruction fetch). How do you populate the data[] field then? If anything, this is closer to KVM_EXIT_ARM_NISV, for which we give userspace the full ESR and ask it to sort it out. I doubt it will be able to, but hey, maybe it is worth a shot. This would need to be a different exit reason though, as NISV is explicitly for non-memslot stuff. In any case, the documentation for KVM_SET_USER_MEMORY_REGION needs to reflect the fact that KVM_EXIT_MMIO cannot represent a fault due to a S1 PTW. > > > + * The drawback is that we end-up fauling twice if the > > typo: s/fauling/faulting/ > > > + * guest is using any of HW AF/DB: a translation fault > > + * to map the page containing the PT (read only at > > + * first), then a permission fault to allow the flags > > + * to be set. > > + */ > > + switch (kvm_vcpu_trap_get_fault_type(vcpu)) { > > + case ESR_ELx_FSC_PERM: > > + return true; > > + default: > > + return false; > > + } > > + } > > > > if (kvm_vcpu_trap_is_iabt(vcpu)) > > return false; > > -- > > 2.34.1 > > > > Besides the changelog/comment suggestions, the patch looks good to me. > > Reviewed-by: Oliver Upton <oliver.upton@xxxxxxxxx> Thanks for the quick review! I'll wait a bit before respinning the series, as I'd like to get closure on the UAPI point you have raised. Cheers, M. -- Without deviation from the norm, progress is not possible.