On Fri, 23 Jun 2023 16:24:20 +0100, Ard Biesheuvel <ardb@xxxxxxxxxx> wrote: > > (cc Marc and Quentin) > > On Mon, 5 Jun 2023 at 11:05, Russell King (Oracle) > <linux@xxxxxxxxxxxxxxx> wrote: > > > > Hi, > > > > Are there any comments on this? > > > > Hi Russell, > > I think the proposed approach is sound, but it is rather intrusive, as > you've pointed out already (wrt KASLR and KASAN etc). And once my LPA2 > work gets merged (which uses root level -1 when booted on LPA2 capable > hardware, and level 0 otherwise), we'll have yet another combination > that is either fully incompatible, or cumbersome to support at the > very least. > > I wonder if it would be worthwhile to explore an alternative approach, > using pKVM and the host stage2: > > - all stage1 kernel mappings remain as they are, and the kernel code > running at EL1 has no awareness of the replication beyond being > involved in allocating the memory; > - host is booted in protected KVM mode, which means that the host > kernel executes under a stage 2 mapping; > - each NUMA node has its own set of stage 2 page tables, and maps the > kernel's code/rodata IPA range to a NUMA local PA range > - the kernel's code and rodata are mapped read-only in the primary > stage-2 mapping so updates trap to EL2, permitting the hypervisor to > replicate those update to all clones. > > Note that pKVM retains the capabilities of ordinary KVM, so as long as > you boot at EL2, the only downside compared to your approach would be > the increased TLB footprint due to the stage 2 mappings for the host > kernel. > > Marc, Quentin, Will: any thoughts? I like the idea, though there are a couple of 'interesting' corner cases: - you have to give up VHE, which means that if your workload is to mainly run VMs, you pay an extra cost on each guest entry/exit - the EL2 code doesn't have the luxury of a stage-2, meaning that either you accept the fact that this code is going to suffer form uneven performance, or you keep the complexity of the kernel-visible replication for the EL2 code only - memory allocation for the stage-2 is tricky (Quentin can talk about that), and relies on being able to steal enough memory to cover the whole of the host's memory-map, including I/O. Having a set of S2 PTs per node is going to increase that pressure/complexity - I'm not too worried about the TLB aspect. Cores tend to cache VA/PA, not VA/IPA+IPA/PA. What is going to cost is the walk itself. This could be mitigated if S2 uses large mappings (possibly using 64k pages). The last point makes me think that what this approach may not be pKVM itself, but something that builds on top of what pKVM has (host S2) and the nVHE/hVHE behaviour. Thanks, M. -- Without deviation from the norm, progress is not possible.