Re: VMA merging updateds?

"Jarkko Sakkinen" <jarkko@xxxxxxxxxx> · Mon, 30 Sep 2024 01:36:06 +0300

On Fri Sep 27, 2024 at 8:39 PM EEST, Lorenzo Stoakes wrote:
> Jumping into this thread mid-way to give my point of view.
>
> On Thu, Sep 26, 2024 at 12:07:02PM GMT, Huang, Kai wrote:
> >
> >
> > On 23/09/2024 7:48 pm, Jarkko Sakkinen wrote:
> > > On Sun Sep 22, 2024 at 7:57 PM EEST, Jarkko Sakkinen wrote:
> > > > > On Sun Sep 22, 2024 at 7:27 PM EEST, Jarkko Sakkinen wrote:
> > > > > > Hi
> > > > > >
> > > > > > I started to look into this old issue with mm subsystem and SGX, i.e.
> > > > > > can we make SGX VMA's to merge together?
> > > > > >
> > > > > > This demonstrates the problem pretty well:
> > > > > >
> > > > > > https://lore.kernel.org/linux-sgx/884c7ea454cf2eb0ba2e95f7c25bd42018824f97.camel@xxxxxxxxxx/
> > > > > >
> > > > > > It was result of brk() syscall being applied a few times.
> > > >
> > > > Briging some context here. This can be fixed in the run-time by book
> > > > keeping the ranges and doing unmapping/mapping. I guess this goes
> > > > beyond what mm should support?
>
> The reason you're seeing this is that the ranges are VM_PFNMAP (as well as
> VM_IO, VM_DONTEXPAND) , which are part of the VM_SPECIAL bitmask and
> therefore explicitly not permitted to merge.
>
> VMA merging occurs not when VMAs are merely adjacent to one another, but
> when they are adjacent to one another AND share the same attributes.
>
> Obviously for most VMA attributes this is critically important - you can't
> have a read-only range merge with a r/w range, it is the VMAs that are
> telling us this, equally so if different files/non-adjacent
> offsets/etc. etc.
>
> For these 'special' mappings each individual VMA may have distinct private
> state and may be tracked/managed by the driver, and indeed I see that
> vma->vm_private_data is used.
>
> Also SGX utilises custom VMA handling operations for fault handling and
> obviously does its own thing.
>
> Because of this, there's just no way mm can know whether it's ok to merge
> or not. Some of the attributes become in effect 'hidden'. You could map
> anything in these ranges and track that with any kind of state.
>
> So it's absolutely correct that we do not merge in these cases, as things
> currently stand.
>
> > > >
> > > > I thought to plain check this as it has been two years since my last
> > > > query on topic (if we could improve either the driver or mm somehow).
> > >
> > > In the past I've substituted kernel's mm merge code with user space
> > > replacement:
> > >
> > > https://github.com/enarx/mmledger/blob/main/src/lib.rs
> > >
> > > It's essentially a reimplementation of al stuff that goes into
> > > mm/mmap.c's vma_merge(). I cannot recall anymore whether merges
> > > which map over existing ranges were working correctly, i.e. was
> > > the issue only concerning adjacent VMA's.
>
> mm/vma.c's vma_merge_existing_range() and vma_merge_new_range() now :) I
> have completely rewritten this code from 6.12 onwards.
>
> > >
> > > What I'm looking here is that can we make some cosntraints that
> > > if satisfied by the pfnmap code, it could leverage the code from
> > > vma_merge(). Perhaps by making explicit call to vma_merge()?
> > > I get that implicit use moves too much responsibility to the mm
> > > subsystem.
>
> Merging/splitting behaviour is an implementation detail and absolutely
> cannot be exposed like this (and would be dangerous to do so).

I can believe that (not unexpected)

I've been doing some initial research for possibility to integrate Enarx
(LF governed confidential computing run-time) with 
https://github.com/rust-vmm/vm-memory, which is pretty solid mmap
abstraction, for which at least Paolo Bonzini is active contributor.

Enarx is multi-backend supporting also VM based private memory in
addition SGX. It has some a bit degraded parts at least SNP interfacing
code that need to be fixed. So I'm now just scoping what needs to be
done make it fresh again. It is just a leisure time sudoku just to
recover a rusty codebase whenever have some idle time.

>
> However, I think the only sensible way we could proceed with something like
> this is to add a new vma operation explicitly for pfnmap mappings which are
> _otherwise mergeable_ like:
>
> 	vm_ops->may_merge_pfnmap(struct vm_area_struct *,
> 				 struct vm_area_struct *)
>
> Which the driver could implement to check internal state matches between
> the two VMAs.
>
> It's nicely opt-in as we'd not merge if this were not set, and it defers
> the decision as to 'hidden attributes' to the driver. Also we could ensure
> all other merge characteristics were satisfied first to avoid any invalid
> merging between VMAs - in fact this function would only be called if:
>
> 1. Both VMAs are VM_PFNMAP.
> 2. Both VMAs are otherwise compatible.
> 3. Both VMAs implement vm_ops->may_merge_pfnmap() (should be implied by 2,
>    as vm_ops equality implies file equality).
> 4. vm_ops->may_merge_pfnmap(vma1, vma2) returns true.

OK strong maybe see what I wrote below.

>
> At which point we could merge. Note that the existence of .close() could
> cause issues here, though the existing merging rules should handle this
> correctly.
>
> I _suspect_ from a brief reading of the SGX stuff you only really need the
> enclave value to be the same?
>
> So it could be something like:
>
> 	static bool sgx_may_merge_pfnmap(struct vma_area_struct *vma1,
> 					 struct vma_area_struct *vma2)
> 	{
> 		/* Merge if enclaves match. */
> 		return vma1->vm_private_data == vma2->vm_private_data;
> 	}
>
> Since you seem to be doing mappings explicitly based on virtual addresses
> in these ranges?

Yep, so this is how it works now that I revisited the cod ea bit:

1. Enclave is a naturally aligned static address range of pow2 size.
   That is mapped first and everything happens inside. It is mapped
   with none permissions. With opcode ENCLS[ECREATE] this range is
   connected to a singe enclave.
2. With enclave or actually even SNP VM then internally has small
   engine and database that implements mmap() call and calls back
   to the outside world with a MAP_FIXED range. So actually in all
   cases my mmledger book keeper is in all cases needed. And it
   is needed to define what is expected from run-time.

   It must happen this way because of a distrust model. Without
   going too much details enclaves can set requirements for page
   permissions cryptographically and have structure (enclave
   page cache map) for this. Thus enclave (or SNP VM) leads,
   run-time follows that by delegating (fixed) mmap() calls,

   In practice the enclave will #GP if memory access is made
   with unexpected permission, even if page tables would allow
   it. In theory you could even just plain rwx map the range
   and base only on EPCM but in practice matching the page
   tables is useful and adds a bit defense in depth.

With SNP things obviously work correctly because it is just a
fancy VM with weird extra shenanigans.

With SGX things are fixed up by allocating a static heap and
never translating brk/sbrk to any syscall. Then for regular
mmaps() it just I guess hopes that user space software uses
it wisely :-)

>
> I could look into RFC'ing something for you guys to test when I get a
> chance?

So I got some ideas I could try out after reading your proposal and
restudying work I did over two years ago :-) I think it makes sense
for me to try to see how much I can improve first.

With pfnmap what did happen when you have a fixed map let's say
from A to B and you map from A to C where C > B? Cannot recall does
this success. Right, and as said vm_close() does not exist.

That could be used based mmledger to overwrite the old VMA. Or I
could do pro-active unmapping.

mmledger could tell "what changed" i.e.  provide syscall parameters for
0-2 munmaps and mmap(MAP_FIXED)  and ask run-time to execute those
instead of just mmap(MAP_FIXED)

I need to improve that crate anyway because I realized that you need
something like that to any trusted execution environment because they
reason what they want from memory :-)

So I get on this first and come back to this lore thread later when
I have some more experience on the topic.

> I'm guessing the problem is a concern that a map limit will be hit or VMA
> memory usage will be a problem? If not it may simply seem ugly...

Yea, so Enarx can take any software compiled to static wasm program and
put into a protected environment. It has a syscall shim and wasm
run-time and loader when it launches. If the application uses mm
syscalls like a good citizen it should not blow up VMA list but
that ofc is based on good faith ;-)

[1] Only for completeness:
    https://github.com/enarx/enarx/blob/main/crates/shim-sgx/src/heap.rs

BR, Jarkko