On Mon, Sep 30, 2024 at 01:36:06AM GMT, Jarkko Sakkinen wrote: > On Fri Sep 27, 2024 at 8:39 PM EEST, Lorenzo Stoakes wrote: > > Jumping into this thread mid-way to give my point of view. > > > > On Thu, Sep 26, 2024 at 12:07:02PM GMT, Huang, Kai wrote: > > > > > > > > > On 23/09/2024 7:48 pm, Jarkko Sakkinen wrote: > > > > On Sun Sep 22, 2024 at 7:57 PM EEST, Jarkko Sakkinen wrote: > > > > > > On Sun Sep 22, 2024 at 7:27 PM EEST, Jarkko Sakkinen wrote: > > > > > > > Hi > > > > > > > > > > > > > > I started to look into this old issue with mm subsystem and SGX, i.e. > > > > > > > can we make SGX VMA's to merge together? > > > > > > > > > > > > > > This demonstrates the problem pretty well: > > > > > > > > > > > > > > https://lore.kernel.org/linux-sgx/884c7ea454cf2eb0ba2e95f7c25bd42018824f97.camel@xxxxxxxxxx/ > > > > > > > > > > > > > > It was result of brk() syscall being applied a few times. > > > > > > > > > > Briging some context here. This can be fixed in the run-time by book > > > > > keeping the ranges and doing unmapping/mapping. I guess this goes > > > > > beyond what mm should support? > > > > The reason you're seeing this is that the ranges are VM_PFNMAP (as well as > > VM_IO, VM_DONTEXPAND) , which are part of the VM_SPECIAL bitmask and > > therefore explicitly not permitted to merge. > > > > VMA merging occurs not when VMAs are merely adjacent to one another, but > > when they are adjacent to one another AND share the same attributes. > > > > Obviously for most VMA attributes this is critically important - you can't > > have a read-only range merge with a r/w range, it is the VMAs that are > > telling us this, equally so if different files/non-adjacent > > offsets/etc. etc. > > > > For these 'special' mappings each individual VMA may have distinct private > > state and may be tracked/managed by the driver, and indeed I see that > > vma->vm_private_data is used. > > > > Also SGX utilises custom VMA handling operations for fault handling and > > obviously does its own thing. > > > > Because of this, there's just no way mm can know whether it's ok to merge > > or not. Some of the attributes become in effect 'hidden'. You could map > > anything in these ranges and track that with any kind of state. > > > > So it's absolutely correct that we do not merge in these cases, as things > > currently stand. > > > > > > > > > > > > I thought to plain check this as it has been two years since my last > > > > > query on topic (if we could improve either the driver or mm somehow). > > > > > > > > In the past I've substituted kernel's mm merge code with user space > > > > replacement: > > > > > > > > https://github.com/enarx/mmledger/blob/main/src/lib.rs > > > > > > > > It's essentially a reimplementation of al stuff that goes into > > > > mm/mmap.c's vma_merge(). I cannot recall anymore whether merges > > > > which map over existing ranges were working correctly, i.e. was > > > > the issue only concerning adjacent VMA's. > > > > mm/vma.c's vma_merge_existing_range() and vma_merge_new_range() now :) I > > have completely rewritten this code from 6.12 onwards. > > > > > > > > > > What I'm looking here is that can we make some cosntraints that > > > > if satisfied by the pfnmap code, it could leverage the code from > > > > vma_merge(). Perhaps by making explicit call to vma_merge()? > > > > I get that implicit use moves too much responsibility to the mm > > > > subsystem. > > > > Merging/splitting behaviour is an implementation detail and absolutely > > cannot be exposed like this (and would be dangerous to do so). > > I can believe that (not unexpected) > > I've been doing some initial research for possibility to integrate Enarx > (LF governed confidential computing run-time) with > https://github.com/rust-vmm/vm-memory, which is pretty solid mmap > abstraction, for which at least Paolo Bonzini is active contributor. > > Enarx is multi-backend supporting also VM based private memory in > addition SGX. It has some a bit degraded parts at least SNP interfacing > code that need to be fixed. So I'm now just scoping what needs to be > done make it fresh again. It is just a leisure time sudoku just to > recover a rusty codebase whenever have some idle time. > > > > > However, I think the only sensible way we could proceed with something like > > this is to add a new vma operation explicitly for pfnmap mappings which are > > _otherwise mergeable_ like: > > > > vm_ops->may_merge_pfnmap(struct vm_area_struct *, > > struct vm_area_struct *) > > > > Which the driver could implement to check internal state matches between > > the two VMAs. > > > > It's nicely opt-in as we'd not merge if this were not set, and it defers > > the decision as to 'hidden attributes' to the driver. Also we could ensure > > all other merge characteristics were satisfied first to avoid any invalid > > merging between VMAs - in fact this function would only be called if: > > > > 1. Both VMAs are VM_PFNMAP. > > 2. Both VMAs are otherwise compatible. > > 3. Both VMAs implement vm_ops->may_merge_pfnmap() (should be implied by 2, > > as vm_ops equality implies file equality). > > 4. vm_ops->may_merge_pfnmap(vma1, vma2) returns true. > > OK strong maybe see what I wrote below. > > > > > At which point we could merge. Note that the existence of .close() could > > cause issues here, though the existing merging rules should handle this > > correctly. > > > > I _suspect_ from a brief reading of the SGX stuff you only really need the > > enclave value to be the same? > > > > So it could be something like: > > > > static bool sgx_may_merge_pfnmap(struct vma_area_struct *vma1, > > struct vma_area_struct *vma2) > > { > > /* Merge if enclaves match. */ > > return vma1->vm_private_data == vma2->vm_private_data; > > } > > > > Since you seem to be doing mappings explicitly based on virtual addresses > > in these ranges? > > Yep, so this is how it works now that I revisited the cod ea bit: > > 1. Enclave is a naturally aligned static address range of pow2 size. > That is mapped first and everything happens inside. It is mapped > with none permissions. With opcode ENCLS[ECREATE] this range is > connected to a singe enclave. > 2. With enclave or actually even SNP VM then internally has small > engine and database that implements mmap() call and calls back > to the outside world with a MAP_FIXED range. So actually in all > cases my mmledger book keeper is in all cases needed. And it > is needed to define what is expected from run-time. > > It must happen this way because of a distrust model. Without > going too much details enclaves can set requirements for page > permissions cryptographically and have structure (enclave > page cache map) for this. Thus enclave (or SNP VM) leads, > run-time follows that by delegating (fixed) mmap() calls, > > In practice the enclave will #GP if memory access is made > with unexpected permission, even if page tables would allow > it. In theory you could even just plain rwx map the range > and base only on EPCM but in practice matching the page > tables is useful and adds a bit defense in depth. > > With SNP things obviously work correctly because it is just a > fancy VM with weird extra shenanigans. > > With SGX things are fixed up by allocating a static heap and > never translating brk/sbrk to any syscall. Then for regular > mmaps() it just I guess hopes that user space software uses > it wisely :-) > > > > > I could look into RFC'ing something for you guys to test when I get a > > chance? > > So I got some ideas I could try out after reading your proposal and > restudying work I did over two years ago :-) I think it makes sense > for me to try to see how much I can improve first. > > With pfnmap what did happen when you have a fixed map let's say > from A to B and you map from A to C where C > B? Cannot recall does > this success. Right, and as said vm_close() does not exist. > > That could be used based mmledger to overwrite the old VMA. Or I > could do pro-active unmapping. > > mmledger could tell "what changed" i.e. provide syscall parameters for > 0-2 munmaps and mmap(MAP_FIXED) and ask run-time to execute those > instead of just mmap(MAP_FIXED) > > I need to improve that crate anyway because I realized that you need > something like that to any trusted execution environment because they > reason what they want from memory :-) > > So I get on this first and come back to this lore thread later when > I have some more experience on the topic. OK cool let's revisit this when you've done that. I think if you did want assistance from VMA merging what I delineated above is the only sensible way forward, afaict. > > > I'm guessing the problem is a concern that a map limit will be hit or VMA > > memory usage will be a problem? If not it may simply seem ugly... > > Yea, so Enarx can take any software compiled to static wasm program and > put into a protected environment. It has a syscall shim and wasm > run-time and loader when it launches. If the application uses mm > syscalls like a good citizen it should not blow up VMA list but > that ofc is based on good faith ;-) > > [1] Only for completeness: > https://github.com/enarx/enarx/blob/main/crates/shim-sgx/src/heap.rs > > BR, Jarkko