Re: [PATCH RFC 00/17] arm64 kernel text replication

Marc Zyngier <maz@xxxxxxxxxx> · Fri, 23 Jun 2023 17:37:07 +0100

On Fri, 23 Jun 2023 16:24:20 +0100,
Ard Biesheuvel <ardb@xxxxxxxxxx> wrote:
> 
> (cc Marc and Quentin)
> 
> On Mon, 5 Jun 2023 at 11:05, Russell King (Oracle)
> <linux@xxxxxxxxxxxxxxx> wrote:
> >
> > Hi,
> >
> > Are there any comments on this?
> >
> 
> Hi Russell,
> 
> I think the proposed approach is sound, but it is rather intrusive, as
> you've pointed out already (wrt KASLR and KASAN etc). And once my LPA2
> work gets merged (which uses root level -1 when booted on LPA2 capable
> hardware, and level 0 otherwise), we'll have yet another combination
> that is either fully incompatible, or cumbersome to support at the
> very least.
> 
> I wonder if it would be worthwhile to explore an alternative approach,
> using pKVM and the host stage2:
> 
> - all stage1 kernel mappings remain as they are, and the kernel code
> running at EL1 has no awareness of the replication beyond being
> involved in allocating the memory;
> - host is booted in protected KVM mode, which means that the host
> kernel executes under a stage 2 mapping;
> - each NUMA node has its own set of stage 2 page tables, and maps the
> kernel's code/rodata IPA range to a NUMA local PA range
> - the kernel's code and rodata are mapped read-only in the primary
> stage-2 mapping so updates trap to EL2, permitting the hypervisor to
> replicate those update to all clones.
> 
> Note that pKVM retains the capabilities of ordinary KVM, so as long as
> you boot at EL2, the only downside compared to your approach would be
> the increased TLB footprint due to the stage 2 mappings for the host
> kernel.
> 
> Marc, Quentin, Will: any thoughts?

I like the idea, though there are a couple of 'interesting' corner
cases:

- you have to give up VHE, which means that if your workload is to
  mainly run VMs, you pay an extra cost on each guest entry/exit

- the EL2 code doesn't have the luxury of a stage-2, meaning that
  either you accept the fact that this code is going to suffer form
  uneven performance, or you keep the complexity of the kernel-visible
  replication for the EL2 code only

- memory allocation for the stage-2 is tricky (Quentin can talk about
  that), and relies on being able to steal enough memory to cover the
  whole of the host's memory-map, including I/O. Having a set of S2
  PTs per node is going to increase that pressure/complexity

- I'm not too worried about the TLB aspect. Cores tend to cache VA/PA,
  not VA/IPA+IPA/PA. What is going to cost is the walk itself. This
  could be mitigated if S2 uses large mappings (possibly using 64k
  pages).

The last point makes me think that what this approach may not be pKVM
itself, but something that builds on top of what pKVM has (host S2)
and the nVHE/hVHE behaviour.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.