On Thu, Nov 28, 2024 at 7:20 PM Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> wrote: > > On Thu, Nov 28, 2024 at 06:58:21PM +0100, Julian Orth wrote: > > (Re-sending the message below since I forgot to reply-all) > > > > On Thu, Nov 28, 2024 at 6:46 PM Jann Horn <jannh@xxxxxxxxxx> wrote: > > > > > > On Thu, Nov 28, 2024 at 4:06 PM Lorenzo Stoakes > > > <lorenzo.stoakes@xxxxxxxxxx> wrote: > > > > In commit 158978945f31 ("mm: perform the mapping_map_writable() check after > > > > call_mmap()") (and preceding changes in the same series) it became possible > > > > to mmap() F_SEAL_WRITE sealed memfd mappings read-only. > > > > > > > > This was previously unnecessarily disallowed, despite the man page > > > > documentation indicating that it would be, thereby limiting the usefulness > > > > of F_SEAL_WRITE logic. > > > > > > > > We fixed this by adapting logic that existed for the F_SEAL_FUTURE_WRITE > > > > seal (one which disallows future writes to the memfd) to also be used for > > > > F_SEAL_WRITE. > > > > > > > > For background - the F_SEAL_FUTURE_WRITE seal clears VM_MAYWRITE for a > > > > read-only mapping to disallow mprotect() from overriding the seal - an > > > > operation performed by seal_check_write(), invoked from shmem_mmap(), the > > > > f_op->mmap() hook used by shmem mappings. > > > > > > > > By extending this to F_SEAL_WRITE and critically - checking > > > > mapping_map_writable() to determine if we may map the memfd AFTER we invoke > > > > shmem_mmap() - the desired logic becomes possible. This is because > > > > mapping_map_writable() explicitly checks for VM_MAYWRITE, which we will > > > > have cleared. > > > > > > > > Commit 5de195060b2e ("mm: resolve faulty mmap_region() error path > > > > behaviour") unintentionally undid this logic by moving the > > > > mapping_map_writable() check before the shmem_mmap() hook is invoked, > > > > thereby regressing this change. > > > > > > > > We reinstate this functionality by moving the check out of shmem_mmap() and > > > > instead performing it in do_mmap() at the point at which VMA flags are > > > > being determined, which seems in any case to be a more appropriate place in > > > > which to make this determination. > > > > > > > > In order to achieve this we rework memfd seal logic to allow us access to > > > > this information using existing logic and eliminate the clearing of > > > > VM_MAYWRITE from seal_check_write() which we are performing in do_mmap() > > > > instead. > > > > > > If we already check is_readonly_sealed() and strip VM_MAYWRITE in > > > do_mmap(), without holding any kind of lock or counter on the file > > > yet, then this check is clearly racy somehow, right? I think we have a > > > race where we intermittently reject shared-readonly mmap() calls? > > > > Apropos race, some time ago I reported a way to get a mutable mapping > > for a write-sealed memfd via a race: > > > > https://bugzilla.kernel.org/show_bug.cgi?id=219106 > > Kind of hard to read rust code, but it looks like you're intentionally > trying to race sealing on the assumption it's atomic when it's not? That > doesn't seem like a bug? > > The intent of sealing memfds is you establish the memfd buffer, then seal > it and _only then_ expose it elsewhere. > > I may be missing something here, however. systemd allows the client to send journal logs via a file descriptor. If this file descriptor is a write-sealed, truncate-sealed memfd, systemd uses an optimized code path that relies on the contents of the file being immutable. I could not find any actual way to exploit this nor could I find any other open-source code that relies on write-seals being accurate. But maybe there is some code out there that could be exploited this way. E.g. if it used the contents of the file as an array index, then you could easily get a TOC/TOU exploit with this. > > > > > > > > > Like: > > > > > > process 1: calls mmap(PROT_READ, MAP_PRIVATE), checks is_readonly_sealed() > > > process 2: adds a F_SEAL_WRITE seal > > > process 1: enters mmap_region(), is_shared_maywrite() is true, > > > mapping_map_writable() fails > > > > > > But even if we fix that, the same scenario would result in > > > F_SEAL_WRITE randomly failing depending on the ordering, so it's not > > > like we can actually do anything particularly sensible if these two > > > operations race. Taking a step back, read-only shared mappings of > > > F_SEAL_WRITE-sealed files are just kind of a bad idea because if > > > someone first creates a read-only shared mapping and *then* tries to > > > apply F_SEAL_WRITE, that won't work because the existing mapping will > > > be VM_MAYWRITE. > > > > > > And the manpage is just misleading on interaction with shared mappings > > > in general, it says "Using the F_ADD_SEALS operation to set the > > > F_SEAL_WRITE seal fails with EBUSY if any writable, shared mapping > > > exists" when actually, it more or less fails if any shared mapping > > > exists at all. > > > > > > @Julian Orth: Did you report this regression because this change > > > caused issues with existing userspace code? > > > > I noticed this because it broke one of my testcases. It would also > > affect production code but making that code work on pre-6.6 kernels is > > probably a good idea anyway. > > Thanks for having that test case! I have added a test here to ensure we do > not regress this again. > > This was a new feature introduced in 6.6, there is no reason to backport it > to any earlier kernels if this is what you mean :) > > It's more a convenience thing like 'hm I can read() this but I can > mmap-read this even though the man page says I can'. > > > > > > > > > > Reported-by: Julian Orth <ju.orth@xxxxxxxxx> > > > > Closes: https://lore.kernel.org/all/CAHijbEUMhvJTN9Xw1GmbM266FXXv=U7s4L_Jem5x3AaPZxrYpQ@xxxxxxxxxxxxxx/ > > > > Fixes: 5de195060b2e ("mm: resolve faulty mmap_region() error path behaviour") > > > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx>