On Thu, Nov 28, 2024 at 7:05 PM Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> wrote: > On Thu, Nov 28, 2024 at 06:45:46PM +0100, Jann Horn wrote: > > On Thu, Nov 28, 2024 at 4:06 PM Lorenzo Stoakes > > <lorenzo.stoakes@xxxxxxxxxx> wrote: > > > In commit 158978945f31 ("mm: perform the mapping_map_writable() check after > > > call_mmap()") (and preceding changes in the same series) it became possible > > > to mmap() F_SEAL_WRITE sealed memfd mappings read-only. > > > > > > This was previously unnecessarily disallowed, despite the man page > > > documentation indicating that it would be, thereby limiting the usefulness > > > of F_SEAL_WRITE logic. > > > > > > We fixed this by adapting logic that existed for the F_SEAL_FUTURE_WRITE > > > seal (one which disallows future writes to the memfd) to also be used for > > > F_SEAL_WRITE. > > > > > > For background - the F_SEAL_FUTURE_WRITE seal clears VM_MAYWRITE for a > > > read-only mapping to disallow mprotect() from overriding the seal - an > > > operation performed by seal_check_write(), invoked from shmem_mmap(), the > > > f_op->mmap() hook used by shmem mappings. > > > > > > By extending this to F_SEAL_WRITE and critically - checking > > > mapping_map_writable() to determine if we may map the memfd AFTER we invoke > > > shmem_mmap() - the desired logic becomes possible. This is because > > > mapping_map_writable() explicitly checks for VM_MAYWRITE, which we will > > > have cleared. > > > > > > Commit 5de195060b2e ("mm: resolve faulty mmap_region() error path > > > behaviour") unintentionally undid this logic by moving the > > > mapping_map_writable() check before the shmem_mmap() hook is invoked, > > > thereby regressing this change. > > > > > > We reinstate this functionality by moving the check out of shmem_mmap() and > > > instead performing it in do_mmap() at the point at which VMA flags are > > > being determined, which seems in any case to be a more appropriate place in > > > which to make this determination. > > > > > > In order to achieve this we rework memfd seal logic to allow us access to > > > this information using existing logic and eliminate the clearing of > > > VM_MAYWRITE from seal_check_write() which we are performing in do_mmap() > > > instead. > > > > If we already check is_readonly_sealed() and strip VM_MAYWRITE in > > do_mmap(), without holding any kind of lock or counter on the file > > yet, then this check is clearly racy somehow, right? I think we have a > > race where we intermittently reject shared-readonly mmap() calls? > > > > Like: > > > > process 1: calls mmap(PROT_READ, MAP_PRIVATE), checks is_readonly_sealed() > > process 2: adds a F_SEAL_WRITE seal > > process 1: enters mmap_region(), is_shared_maywrite() is true, > > mapping_map_writable() fails > > I don't think this matters? Firstly these would have to be threads unless you > are going out of your way to transmit the memfd incompletely set up via a socket > or something, and then you'd have to be doing it on the assumption that it > wouldn't race? Ah, I guess that's true. > The whole purpose of this change is to _allow read-only mapping *at all*_. Not > to avoid silly races that are the product of somebody doing stupid things. > > > > > But even if we fix that, the same scenario would result in > > F_SEAL_WRITE randomly failing depending on the ordering, so it's not > > like we can actually do anything particularly sensible if these two > > operations race. Taking a step back, read-only shared mappings of > > F_SEAL_WRITE-sealed files are just kind of a bad idea because if > > someone first creates a read-only shared mapping and *then* tries to > > apply F_SEAL_WRITE, that won't work because the existing mapping will > > be VM_MAYWRITE. > > I don't think so? > > If they try to do that, attempting to apply the seal will fail as write will be > disallowed. So there's no risk of overriding the seal. > > The idea is you establish a buffer, write into it, unmap, write-seal, and now > you can mmap() it PROT_READ. > > Obviously it's not sensible (or really probably sensibly feasible) to try to > find every VMA that has it opened VM_READ | VM_MAYWRITE and clear the > VM_MAYWRITE, so instead we simply disallow that scenario. > > But it's totally reasonable to be able to mmap() it readonly afterwards. > > > > > And the manpage is just misleading on interaction with shared mappings > > in general, it says "Using the F_ADD_SEALS operation to set the > > F_SEAL_WRITE seal fails with EBUSY if any writable, shared mapping > > exists" when actually, it more or less fails if any shared mapping > > exists at all. > > No, it's when any writable mapping exists after my changes :) but people > might not be quite aware of the distinction between VM_MAYWRITE and > VM_WRITE. To clarify, do you read "writable" as "VM_MAYWRITE|VM_SHARED"? > > @Julian Orth: Did you report this regression because this change > > caused issues with existing userspace code? > > > > > Reported-by: Julian Orth <ju.orth@xxxxxxxxx> > > > Closes: https://lore.kernel.org/all/CAHijbEUMhvJTN9Xw1GmbM266FXXv=U7s4L_Jem5x3AaPZxrYpQ@xxxxxxxxxxxxxx/ > > > Fixes: 5de195060b2e ("mm: resolve faulty mmap_region() error path behaviour") > > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> > > In any case, we are not discussing my original patch in 6.6 that permitted > this behaviour, whether you agree or disagree it was sensible, we have > regressed user-visible behaviour, this change restores it. Hm, yeah, you're right.