On Thu, Nov 28, 2024 at 07:27:44PM +0100, Jann Horn wrote: > On Thu, Nov 28, 2024 at 7:21 PM Lorenzo Stoakes > <lorenzo.stoakes@xxxxxxxxxx> wrote: > > On Thu, Nov 28, 2024 at 06:58:21PM +0100, Julian Orth wrote: > > > (Re-sending the message below since I forgot to reply-all) > > > > > > On Thu, Nov 28, 2024 at 6:46 PM Jann Horn <jannh@xxxxxxxxxx> wrote: > > > > > > > > On Thu, Nov 28, 2024 at 4:06 PM Lorenzo Stoakes > > > > <lorenzo.stoakes@xxxxxxxxxx> wrote: > > > > > In commit 158978945f31 ("mm: perform the mapping_map_writable() check after > > > > > call_mmap()") (and preceding changes in the same series) it became possible > > > > > to mmap() F_SEAL_WRITE sealed memfd mappings read-only. > > > > > > > > > > This was previously unnecessarily disallowed, despite the man page > > > > > documentation indicating that it would be, thereby limiting the usefulness > > > > > of F_SEAL_WRITE logic. > > > > > > > > > > We fixed this by adapting logic that existed for the F_SEAL_FUTURE_WRITE > > > > > seal (one which disallows future writes to the memfd) to also be used for > > > > > F_SEAL_WRITE. > > > > > > > > > > For background - the F_SEAL_FUTURE_WRITE seal clears VM_MAYWRITE for a > > > > > read-only mapping to disallow mprotect() from overriding the seal - an > > > > > operation performed by seal_check_write(), invoked from shmem_mmap(), the > > > > > f_op->mmap() hook used by shmem mappings. > > > > > > > > > > By extending this to F_SEAL_WRITE and critically - checking > > > > > mapping_map_writable() to determine if we may map the memfd AFTER we invoke > > > > > shmem_mmap() - the desired logic becomes possible. This is because > > > > > mapping_map_writable() explicitly checks for VM_MAYWRITE, which we will > > > > > have cleared. > > > > > > > > > > Commit 5de195060b2e ("mm: resolve faulty mmap_region() error path > > > > > behaviour") unintentionally undid this logic by moving the > > > > > mapping_map_writable() check before the shmem_mmap() hook is invoked, > > > > > thereby regressing this change. > > > > > > > > > > We reinstate this functionality by moving the check out of shmem_mmap() and > > > > > instead performing it in do_mmap() at the point at which VMA flags are > > > > > being determined, which seems in any case to be a more appropriate place in > > > > > which to make this determination. > > > > > > > > > > In order to achieve this we rework memfd seal logic to allow us access to > > > > > this information using existing logic and eliminate the clearing of > > > > > VM_MAYWRITE from seal_check_write() which we are performing in do_mmap() > > > > > instead. > > > > > > > > If we already check is_readonly_sealed() and strip VM_MAYWRITE in > > > > do_mmap(), without holding any kind of lock or counter on the file > > > > yet, then this check is clearly racy somehow, right? I think we have a > > > > race where we intermittently reject shared-readonly mmap() calls? > > > > > > Apropos race, some time ago I reported a way to get a mutable mapping > > > for a write-sealed memfd via a race: > > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=219106 > > > > Kind of hard to read rust code, but it looks like you're intentionally > > trying to race sealing on the assumption it's atomic when it's not? That > > doesn't seem like a bug? > > > > The intent of sealing memfds is you establish the memfd buffer, then seal > > it and _only then_ expose it elsewhere. > > > > I may be missing something here, however. > > AFAIU memfds are supposed to also guarantee *to the recipient* of the > shared memfd that the memory inside it won't change anymore, so that > the recipient can parse data out of this shared memory buffer without > having to worry about the data concurrently changing. udmabuf_create() > looks like it indeed breaks that assumption by first calling > check_memfd_seals() and then doing udmabuf_pin_folios() without any > lock that prevents a seal being added in between those. > > That's also why we have memfd_wait_for_pins(), which ensures that > folios in the memfd don't have elevated refcounts when a F_SEAL_WRITE > seal is added. > > (I believe that's one of the major differences in usecases of > F_SEAL_WRITE and F_SEAL_FUTURE_WRITE: F_SEAL_FUTURE_WRITE is enough > for cases where only the creator of the memfd wants to prevent other > tasks from writing into it, while F_SEAL_WRITE is suitable for cases > where the creator and recipient of the memfd want mutual protection.) Those being stupid can also be in kernel code... :) I mean that just looks like udmabuf is doing something buggy and require a fix which is out of scope for mm but perhaps worth reporting direct to udmabuf maintainers. That makes sense re: F_SEAL_WRITE and further makes the case for making it possible to read-only mmap() these buffers for convenience. You'd also want to add other seals to ensure it's truly immutable.