On Fri Sep 27, 2024 at 8:39 PM EEST, Lorenzo Stoakes wrote: > Jumping into this thread mid-way to give my point of view. > > On Thu, Sep 26, 2024 at 12:07:02PM GMT, Huang, Kai wrote: > > > > > > On 23/09/2024 7:48 pm, Jarkko Sakkinen wrote: > > > On Sun Sep 22, 2024 at 7:57 PM EEST, Jarkko Sakkinen wrote: > > > > > On Sun Sep 22, 2024 at 7:27 PM EEST, Jarkko Sakkinen wrote: > > > > > > Hi > > > > > > > > > > > > I started to look into this old issue with mm subsystem and SGX, i.e. > > > > > > can we make SGX VMA's to merge together? > > > > > > > > > > > > This demonstrates the problem pretty well: > > > > > > > > > > > > https://lore.kernel.org/linux-sgx/884c7ea454cf2eb0ba2e95f7c25bd42018824f97.camel@xxxxxxxxxx/ > > > > > > > > > > > > It was result of brk() syscall being applied a few times. > > > > > > > > Briging some context here. This can be fixed in the run-time by book > > > > keeping the ranges and doing unmapping/mapping. I guess this goes > > > > beyond what mm should support? > > The reason you're seeing this is that the ranges are VM_PFNMAP (as well as > VM_IO, VM_DONTEXPAND) , which are part of the VM_SPECIAL bitmask and > therefore explicitly not permitted to merge. > > VMA merging occurs not when VMAs are merely adjacent to one another, but > when they are adjacent to one another AND share the same attributes. > > Obviously for most VMA attributes this is critically important - you can't > have a read-only range merge with a r/w range, it is the VMAs that are > telling us this, equally so if different files/non-adjacent > offsets/etc. etc. > > For these 'special' mappings each individual VMA may have distinct private > state and may be tracked/managed by the driver, and indeed I see that > vma->vm_private_data is used. > > Also SGX utilises custom VMA handling operations for fault handling and > obviously does its own thing. > > Because of this, there's just no way mm can know whether it's ok to merge > or not. Some of the attributes become in effect 'hidden'. You could map > anything in these ranges and track that with any kind of state. > > So it's absolutely correct that we do not merge in these cases, as things > currently stand. > > > > > > > > > I thought to plain check this as it has been two years since my last > > > > query on topic (if we could improve either the driver or mm somehow). > > > > > > In the past I've substituted kernel's mm merge code with user space > > > replacement: > > > > > > https://github.com/enarx/mmledger/blob/main/src/lib.rs > > > > > > It's essentially a reimplementation of al stuff that goes into > > > mm/mmap.c's vma_merge(). I cannot recall anymore whether merges > > > which map over existing ranges were working correctly, i.e. was > > > the issue only concerning adjacent VMA's. > > mm/vma.c's vma_merge_existing_range() and vma_merge_new_range() now :) I > have completely rewritten this code from 6.12 onwards. > > > > > > > What I'm looking here is that can we make some cosntraints that > > > if satisfied by the pfnmap code, it could leverage the code from > > > vma_merge(). Perhaps by making explicit call to vma_merge()? > > > I get that implicit use moves too much responsibility to the mm > > > subsystem. > > Merging/splitting behaviour is an implementation detail and absolutely > cannot be exposed like this (and would be dangerous to do so). I can believe that (not unexpected) I've been doing some initial research for possibility to integrate Enarx (LF governed confidential computing run-time) with https://github.com/rust-vmm/vm-memory, which is pretty solid mmap abstraction, for which at least Paolo Bonzini is active contributor. Enarx is multi-backend supporting also VM based private memory in addition SGX. It has some a bit degraded parts at least SNP interfacing code that need to be fixed. So I'm now just scoping what needs to be done make it fresh again. It is just a leisure time sudoku just to recover a rusty codebase whenever have some idle time. > > However, I think the only sensible way we could proceed with something like > this is to add a new vma operation explicitly for pfnmap mappings which are > _otherwise mergeable_ like: > > vm_ops->may_merge_pfnmap(struct vm_area_struct *, > struct vm_area_struct *) > > Which the driver could implement to check internal state matches between > the two VMAs. > > It's nicely opt-in as we'd not merge if this were not set, and it defers > the decision as to 'hidden attributes' to the driver. Also we could ensure > all other merge characteristics were satisfied first to avoid any invalid > merging between VMAs - in fact this function would only be called if: > > 1. Both VMAs are VM_PFNMAP. > 2. Both VMAs are otherwise compatible. > 3. Both VMAs implement vm_ops->may_merge_pfnmap() (should be implied by 2, > as vm_ops equality implies file equality). > 4. vm_ops->may_merge_pfnmap(vma1, vma2) returns true. OK strong maybe see what I wrote below. > > At which point we could merge. Note that the existence of .close() could > cause issues here, though the existing merging rules should handle this > correctly. > > I _suspect_ from a brief reading of the SGX stuff you only really need the > enclave value to be the same? > > So it could be something like: > > static bool sgx_may_merge_pfnmap(struct vma_area_struct *vma1, > struct vma_area_struct *vma2) > { > /* Merge if enclaves match. */ > return vma1->vm_private_data == vma2->vm_private_data; > } > > Since you seem to be doing mappings explicitly based on virtual addresses > in these ranges? Yep, so this is how it works now that I revisited the cod ea bit: 1. Enclave is a naturally aligned static address range of pow2 size. That is mapped first and everything happens inside. It is mapped with none permissions. With opcode ENCLS[ECREATE] this range is connected to a singe enclave. 2. With enclave or actually even SNP VM then internally has small engine and database that implements mmap() call and calls back to the outside world with a MAP_FIXED range. So actually in all cases my mmledger book keeper is in all cases needed. And it is needed to define what is expected from run-time. It must happen this way because of a distrust model. Without going too much details enclaves can set requirements for page permissions cryptographically and have structure (enclave page cache map) for this. Thus enclave (or SNP VM) leads, run-time follows that by delegating (fixed) mmap() calls, In practice the enclave will #GP if memory access is made with unexpected permission, even if page tables would allow it. In theory you could even just plain rwx map the range and base only on EPCM but in practice matching the page tables is useful and adds a bit defense in depth. With SNP things obviously work correctly because it is just a fancy VM with weird extra shenanigans. With SGX things are fixed up by allocating a static heap and never translating brk/sbrk to any syscall. Then for regular mmaps() it just I guess hopes that user space software uses it wisely :-) > > I could look into RFC'ing something for you guys to test when I get a > chance? So I got some ideas I could try out after reading your proposal and restudying work I did over two years ago :-) I think it makes sense for me to try to see how much I can improve first. With pfnmap what did happen when you have a fixed map let's say from A to B and you map from A to C where C > B? Cannot recall does this success. Right, and as said vm_close() does not exist. That could be used based mmledger to overwrite the old VMA. Or I could do pro-active unmapping. mmledger could tell "what changed" i.e. provide syscall parameters for 0-2 munmaps and mmap(MAP_FIXED) and ask run-time to execute those instead of just mmap(MAP_FIXED) I need to improve that crate anyway because I realized that you need something like that to any trusted execution environment because they reason what they want from memory :-) So I get on this first and come back to this lore thread later when I have some more experience on the topic. > I'm guessing the problem is a concern that a map limit will be hit or VMA > memory usage will be a problem? If not it may simply seem ugly... Yea, so Enarx can take any software compiled to static wasm program and put into a protected environment. It has a syscall shim and wasm run-time and loader when it launches. If the application uses mm syscalls like a good citizen it should not blow up VMA list but that ofc is based on good faith ;-) [1] Only for completeness: https://github.com/enarx/enarx/blob/main/crates/shim-sgx/src/heap.rs BR, Jarkko