On Thu, Nov 28, 2024 at 06:45:46PM +0100, Jann Horn wrote: > On Thu, Nov 28, 2024 at 4:06 PM Lorenzo Stoakes > <lorenzo.stoakes@xxxxxxxxxx> wrote: > > In commit 158978945f31 ("mm: perform the mapping_map_writable() check after > > call_mmap()") (and preceding changes in the same series) it became possible > > to mmap() F_SEAL_WRITE sealed memfd mappings read-only. > > > > This was previously unnecessarily disallowed, despite the man page > > documentation indicating that it would be, thereby limiting the usefulness > > of F_SEAL_WRITE logic. > > > > We fixed this by adapting logic that existed for the F_SEAL_FUTURE_WRITE > > seal (one which disallows future writes to the memfd) to also be used for > > F_SEAL_WRITE. > > > > For background - the F_SEAL_FUTURE_WRITE seal clears VM_MAYWRITE for a > > read-only mapping to disallow mprotect() from overriding the seal - an > > operation performed by seal_check_write(), invoked from shmem_mmap(), the > > f_op->mmap() hook used by shmem mappings. > > > > By extending this to F_SEAL_WRITE and critically - checking > > mapping_map_writable() to determine if we may map the memfd AFTER we invoke > > shmem_mmap() - the desired logic becomes possible. This is because > > mapping_map_writable() explicitly checks for VM_MAYWRITE, which we will > > have cleared. > > > > Commit 5de195060b2e ("mm: resolve faulty mmap_region() error path > > behaviour") unintentionally undid this logic by moving the > > mapping_map_writable() check before the shmem_mmap() hook is invoked, > > thereby regressing this change. > > > > We reinstate this functionality by moving the check out of shmem_mmap() and > > instead performing it in do_mmap() at the point at which VMA flags are > > being determined, which seems in any case to be a more appropriate place in > > which to make this determination. > > > > In order to achieve this we rework memfd seal logic to allow us access to > > this information using existing logic and eliminate the clearing of > > VM_MAYWRITE from seal_check_write() which we are performing in do_mmap() > > instead. > > If we already check is_readonly_sealed() and strip VM_MAYWRITE in > do_mmap(), without holding any kind of lock or counter on the file > yet, then this check is clearly racy somehow, right? I think we have a > race where we intermittently reject shared-readonly mmap() calls? > > Like: > > process 1: calls mmap(PROT_READ, MAP_PRIVATE), checks is_readonly_sealed() > process 2: adds a F_SEAL_WRITE seal > process 1: enters mmap_region(), is_shared_maywrite() is true, > mapping_map_writable() fails I don't think this matters? Firstly these would have to be threads unless you are going out of your way to transmit the memfd incompletely set up via a socket or something, and then you'd have to be doing it on the assumption that it wouldn't race? The whole purpose of this change is to _allow read-only mapping *at all*_. Not to avoid silly races that are the product of somebody doing stupid things. > > But even if we fix that, the same scenario would result in > F_SEAL_WRITE randomly failing depending on the ordering, so it's not > like we can actually do anything particularly sensible if these two > operations race. Taking a step back, read-only shared mappings of > F_SEAL_WRITE-sealed files are just kind of a bad idea because if > someone first creates a read-only shared mapping and *then* tries to > apply F_SEAL_WRITE, that won't work because the existing mapping will > be VM_MAYWRITE. I don't think so? If they try to do that, attempting to apply the seal will fail as write will be disallowed. So there's no risk of overriding the seal. The idea is you establish a buffer, write into it, unmap, write-seal, and now you can mmap() it PROT_READ. Obviously it's not sensible (or really probably sensibly feasible) to try to find every VMA that has it opened VM_READ | VM_MAYWRITE and clear the VM_MAYWRITE, so instead we simply disallow that scenario. But it's totally reasonable to be able to mmap() it readonly afterwards. > > And the manpage is just misleading on interaction with shared mappings > in general, it says "Using the F_ADD_SEALS operation to set the > F_SEAL_WRITE seal fails with EBUSY if any writable, shared mapping > exists" when actually, it more or less fails if any shared mapping > exists at all. No, it's when any writable mapping exists after my changes :) but people might not be quite aware of the distinction between VM_MAYWRITE and VM_WRITE. > > @Julian Orth: Did you report this regression because this change > caused issues with existing userspace code? > > > Reported-by: Julian Orth <ju.orth@xxxxxxxxx> > > Closes: https://lore.kernel.org/all/CAHijbEUMhvJTN9Xw1GmbM266FXXv=U7s4L_Jem5x3AaPZxrYpQ@xxxxxxxxxxxxxx/ > > Fixes: 5de195060b2e ("mm: resolve faulty mmap_region() error path behaviour") > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> In any case, we are not discussing my original patch in 6.6 that permitted this behaviour, whether you agree or disagree it was sensible, we have regressed user-visible behaviour, this change restores it.