Re: VMA merging updateds?

Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> · Fri, 27 Sep 2024 18:39:08 +0100

Jumping into this thread mid-way to give my point of view.

On Thu, Sep 26, 2024 at 12:07:02PM GMT, Huang, Kai wrote:
>
>
> On 23/09/2024 7:48 pm, Jarkko Sakkinen wrote:
> > On Sun Sep 22, 2024 at 7:57 PM EEST, Jarkko Sakkinen wrote:
> > > > On Sun Sep 22, 2024 at 7:27 PM EEST, Jarkko Sakkinen wrote:
> > > > > Hi
> > > > >
> > > > > I started to look into this old issue with mm subsystem and SGX, i.e.
> > > > > can we make SGX VMA's to merge together?
> > > > >
> > > > > This demonstrates the problem pretty well:
> > > > >
> > > > > https://lore.kernel.org/linux-sgx/884c7ea454cf2eb0ba2e95f7c25bd42018824f97.camel@xxxxxxxxxx/
> > > > >
> > > > > It was result of brk() syscall being applied a few times.
> > >
> > > Briging some context here. This can be fixed in the run-time by book
> > > keeping the ranges and doing unmapping/mapping. I guess this goes
> > > beyond what mm should support?

The reason you're seeing this is that the ranges are VM_PFNMAP (as well as
VM_IO, VM_DONTEXPAND) , which are part of the VM_SPECIAL bitmask and
therefore explicitly not permitted to merge.

VMA merging occurs not when VMAs are merely adjacent to one another, but
when they are adjacent to one another AND share the same attributes.

Obviously for most VMA attributes this is critically important - you can't
have a read-only range merge with a r/w range, it is the VMAs that are
telling us this, equally so if different files/non-adjacent
offsets/etc. etc.

For these 'special' mappings each individual VMA may have distinct private
state and may be tracked/managed by the driver, and indeed I see that
vma->vm_private_data is used.

Also SGX utilises custom VMA handling operations for fault handling and
obviously does its own thing.

Because of this, there's just no way mm can know whether it's ok to merge
or not. Some of the attributes become in effect 'hidden'. You could map
anything in these ranges and track that with any kind of state.

So it's absolutely correct that we do not merge in these cases, as things
currently stand.

> > >
> > > I thought to plain check this as it has been two years since my last
> > > query on topic (if we could improve either the driver or mm somehow).
> >
> > In the past I've substituted kernel's mm merge code with user space
> > replacement:
> >
> > https://github.com/enarx/mmledger/blob/main/src/lib.rs
> >
> > It's essentially a reimplementation of al stuff that goes into
> > mm/mmap.c's vma_merge(). I cannot recall anymore whether merges
> > which map over existing ranges were working correctly, i.e. was
> > the issue only concerning adjacent VMA's.

mm/vma.c's vma_merge_existing_range() and vma_merge_new_range() now :) I
have completely rewritten this code from 6.12 onwards.

> >
> > What I'm looking here is that can we make some cosntraints that
> > if satisfied by the pfnmap code, it could leverage the code from
> > vma_merge(). Perhaps by making explicit call to vma_merge()?
> > I get that implicit use moves too much responsibility to the mm
> > subsystem.

Merging/splitting behaviour is an implementation detail and absolutely
cannot be exposed like this (and would be dangerous to do so).

However, I think the only sensible way we could proceed with something like
this is to add a new vma operation explicitly for pfnmap mappings which are
_otherwise mergeable_ like:

	vm_ops->may_merge_pfnmap(struct vm_area_struct *,
				 struct vm_area_struct *)

Which the driver could implement to check internal state matches between
the two VMAs.

It's nicely opt-in as we'd not merge if this were not set, and it defers
the decision as to 'hidden attributes' to the driver. Also we could ensure
all other merge characteristics were satisfied first to avoid any invalid
merging between VMAs - in fact this function would only be called if:

1. Both VMAs are VM_PFNMAP.
2. Both VMAs are otherwise compatible.
3. Both VMAs implement vm_ops->may_merge_pfnmap() (should be implied by 2,
   as vm_ops equality implies file equality).
4. vm_ops->may_merge_pfnmap(vma1, vma2) returns true.

At which point we could merge. Note that the existence of .close() could
cause issues here, though the existing merging rules should handle this
correctly.

I _suspect_ from a brief reading of the SGX stuff you only really need the
enclave value to be the same?

So it could be something like:

	static bool sgx_may_merge_pfnmap(struct vma_area_struct *vma1,
					 struct vma_area_struct *vma2)
	{
		/* Merge if enclaves match. */
		return vma1->vm_private_data == vma2->vm_private_data;
	}

Since you seem to be doing mappings explicitly based on virtual addresses
in these ranges?

I could look into RFC'ing something for you guys to test when I get a
chance?

> >
>
> Hi Jarkko,
>
> Just want to understand more on the background:
>
> Are you seeing any real problem due to needing a lot of mmap()s to the same
> enclave, or it is just a problem that doesn't look nice and you want to
> resolve?
>
> I mean, this problem doesn't seem to be SGX-specific but a common one for
> VMAs with VM_PFNMAP (any bit in VM_SPECIAL), e.g., from random device
> drivers with mmap() support.  We will need a good justification if we want
> to make any core-mm change, if any, for this.

I'm guessing the problem is a concern that a map limit will be hit or VMA
memory usage will be a problem? If not it may simply seem ugly...