On Tue, Oct 18, 2022 at 03:05:14PM +0100, Will Deacon wrote: > Hi Mark, > > Cheers for having a look. > > On Tue, Oct 18, 2022 at 12:06:14PM +0100, Mark Rutland wrote: > > On Mon, Oct 17, 2022 at 12:51:58PM +0100, Will Deacon wrote: > > > diff --git a/arch/arm64/kvm/hyp/nvhe/mm.c b/arch/arm64/kvm/hyp/nvhe/mm.c > > > index d3a3b47181de..b77215630d5c 100644 > > > --- a/arch/arm64/kvm/hyp/nvhe/mm.c > > > +++ b/arch/arm64/kvm/hyp/nvhe/mm.c > > > @@ -14,6 +14,7 @@ > > > #include <nvhe/early_alloc.h> > > > #include <nvhe/gfp.h> > > > #include <nvhe/memory.h> > > > +#include <nvhe/mem_protect.h> > > > #include <nvhe/mm.h> > > > #include <nvhe/spinlock.h> > > > > > > @@ -25,6 +26,12 @@ unsigned int hyp_memblock_nr; > > > > > > static u64 __io_map_base; > > > > > > +struct hyp_fixmap_slot { > > > + u64 addr; > > > + kvm_pte_t *ptep; > > > +}; > > > +static DEFINE_PER_CPU(struct hyp_fixmap_slot, fixmap_slots); > > > + > > > static int __pkvm_create_mappings(unsigned long start, unsigned long size, > > > unsigned long phys, enum kvm_pgtable_prot prot) > > > { > > > @@ -212,6 +219,93 @@ int hyp_map_vectors(void) > > > return 0; > > > } > > > > > > +void *hyp_fixmap_map(phys_addr_t phys) > > > +{ > > > + struct hyp_fixmap_slot *slot = this_cpu_ptr(&fixmap_slots); > > > + kvm_pte_t pte, *ptep = slot->ptep; > > > + > > > + pte = *ptep; > > > + pte &= ~kvm_phys_to_pte(KVM_PHYS_INVALID); > > > + pte |= kvm_phys_to_pte(phys) | KVM_PTE_VALID; > > > + WRITE_ONCE(*ptep, pte); > > > + dsb(nshst); > > > + > > > + return (void *)slot->addr; > > > +} > > > + > > > +static void fixmap_clear_slot(struct hyp_fixmap_slot *slot) > > > +{ > > > + kvm_pte_t *ptep = slot->ptep; > > > + u64 addr = slot->addr; > > > + > > > + WRITE_ONCE(*ptep, *ptep & ~KVM_PTE_VALID); > > > + dsb(nshst); > > > + __tlbi_level(vale2, __TLBI_VADDR(addr, 0), (KVM_PGTABLE_MAX_LEVELS - 1)); > > > + dsb(nsh); > > > + isb(); > > > +} > > > > Does each CPU have independent Stage-1 tables at EL2? i.e. each has a distinct > > root table? > > No, the CPUs share the same stage-1 table at EL2. Ah, then I think there's a problem here. > > If the tables are shared, you need broadcast maintenance and ISH barriers here, > > or you risk the usual issues with asynchronous MMU behaviour. > > Can you elaborate a bit, please? What we're trying to do is reserve a page > of VA space for each CPU, which is only ever accessed explicitly by that > CPU using a normal memory mapping. The fixmap code therefore just updates > the relevant leaf entry for the CPU on which we're running and the TLBI > is there to ensure that the new mapping takes effect. > > If another CPU speculatively walks another CPU's fixmap slot, then I agree > that it could access that page after the slot had been cleared. Although > I can see theoretical security arguments around avoiding that situation, > there's a very real performance cost to broadcast invalidation that we > were hoping to avoid on this fast path. The issue is that any CPU could walk any of these entries at any time for any reason, and without broadcast maintenance we'd be violating the Break-Before-Make requirements. That permits a number of things, including "amalgamation", which would permit the CPU to consume some arbitrary function of the old+new entries. Among other things, that can permit accesses to entirely bogus physical addresses that weren't in either entry (e.g. making speculative accesses to arbitrary device addresses). For correctness, you need the maintenance to be broadcast to all PEs which could observe the old and new entries. > Of course, in the likely event that I've purged "the usual issues" from > my head and we need broadcasting for _correctness_, then we'll just have > to suck it up! As above, I believe you need this for correctness. I'm not sure if FEAT_BBM level 2 gives you the necessary properties to relax this on some HW. Thanks, Mark.