On Thu, Nov 28, 2024 at 7:21 PM Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> wrote: > On Thu, Nov 28, 2024 at 06:58:21PM +0100, Julian Orth wrote: > > (Re-sending the message below since I forgot to reply-all) > > > > On Thu, Nov 28, 2024 at 6:46 PM Jann Horn <jannh@xxxxxxxxxx> wrote: > > > > > > On Thu, Nov 28, 2024 at 4:06 PM Lorenzo Stoakes > > > <lorenzo.stoakes@xxxxxxxxxx> wrote: > > > > In commit 158978945f31 ("mm: perform the mapping_map_writable() check after > > > > call_mmap()") (and preceding changes in the same series) it became possible > > > > to mmap() F_SEAL_WRITE sealed memfd mappings read-only. > > > > > > > > This was previously unnecessarily disallowed, despite the man page > > > > documentation indicating that it would be, thereby limiting the usefulness > > > > of F_SEAL_WRITE logic. > > > > > > > > We fixed this by adapting logic that existed for the F_SEAL_FUTURE_WRITE > > > > seal (one which disallows future writes to the memfd) to also be used for > > > > F_SEAL_WRITE. > > > > > > > > For background - the F_SEAL_FUTURE_WRITE seal clears VM_MAYWRITE for a > > > > read-only mapping to disallow mprotect() from overriding the seal - an > > > > operation performed by seal_check_write(), invoked from shmem_mmap(), the > > > > f_op->mmap() hook used by shmem mappings. > > > > > > > > By extending this to F_SEAL_WRITE and critically - checking > > > > mapping_map_writable() to determine if we may map the memfd AFTER we invoke > > > > shmem_mmap() - the desired logic becomes possible. This is because > > > > mapping_map_writable() explicitly checks for VM_MAYWRITE, which we will > > > > have cleared. > > > > > > > > Commit 5de195060b2e ("mm: resolve faulty mmap_region() error path > > > > behaviour") unintentionally undid this logic by moving the > > > > mapping_map_writable() check before the shmem_mmap() hook is invoked, > > > > thereby regressing this change. > > > > > > > > We reinstate this functionality by moving the check out of shmem_mmap() and > > > > instead performing it in do_mmap() at the point at which VMA flags are > > > > being determined, which seems in any case to be a more appropriate place in > > > > which to make this determination. > > > > > > > > In order to achieve this we rework memfd seal logic to allow us access to > > > > this information using existing logic and eliminate the clearing of > > > > VM_MAYWRITE from seal_check_write() which we are performing in do_mmap() > > > > instead. > > > > > > If we already check is_readonly_sealed() and strip VM_MAYWRITE in > > > do_mmap(), without holding any kind of lock or counter on the file > > > yet, then this check is clearly racy somehow, right? I think we have a > > > race where we intermittently reject shared-readonly mmap() calls? > > > > Apropos race, some time ago I reported a way to get a mutable mapping > > for a write-sealed memfd via a race: > > > > https://bugzilla.kernel.org/show_bug.cgi?id=219106 > > Kind of hard to read rust code, but it looks like you're intentionally > trying to race sealing on the assumption it's atomic when it's not? That > doesn't seem like a bug? > > The intent of sealing memfds is you establish the memfd buffer, then seal > it and _only then_ expose it elsewhere. > > I may be missing something here, however. AFAIU memfds are supposed to also guarantee *to the recipient* of the shared memfd that the memory inside it won't change anymore, so that the recipient can parse data out of this shared memory buffer without having to worry about the data concurrently changing. udmabuf_create() looks like it indeed breaks that assumption by first calling check_memfd_seals() and then doing udmabuf_pin_folios() without any lock that prevents a seal being added in between those. That's also why we have memfd_wait_for_pins(), which ensures that folios in the memfd don't have elevated refcounts when a F_SEAL_WRITE seal is added. (I believe that's one of the major differences in usecases of F_SEAL_WRITE and F_SEAL_FUTURE_WRITE: F_SEAL_FUTURE_WRITE is enough for cases where only the creator of the memfd wants to prevent other tasks from writing into it, while F_SEAL_WRITE is suitable for cases where the creator and recipient of the memfd want mutual protection.)