* jeffxu@xxxxxxxxxxxx <jeffxu@xxxxxxxxxxxx> [240415 12:35]: > From: Jeff Xu <jeffxu@xxxxxxxxxxxx> > > The new mseal() is an syscall on 64 bit CPU, and with > following signature: > > int mseal(void addr, size_t len, unsigned long flags) > addr/len: memory range. > flags: reserved. > > mseal() blocks following operations for the given memory range. > > 1> Unmapping, moving to another location, and shrinking the size, > via munmap() and mremap(), can leave an empty space, therefore can > be replaced with a VMA with a new set of attributes. > > 2> Moving or expanding a different VMA into the current location, > via mremap(). > > 3> Modifying a VMA via mmap(MAP_FIXED). > > 4> Size expansion, via mremap(), does not appear to pose any specific > risks to sealed VMAs. It is included anyway because the use case is > unclear. In any case, users can rely on merging to expand a sealed VMA. > > 5> mprotect() and pkey_mprotect(). > > 6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous > memory, when users don't have write permission to the memory. Those > behaviors can alter region contents by discarding pages, effectively a > memset(0) for anonymous memory. > > Following input during RFC are incooperated into this patch: > > Jann Horn: raising awareness and providing valuable insights on the > destructive madvise operations. > Linus Torvalds: assisting in defining system call signature and scope. > Liam R. Howlett: perf optimization. > Theo de Raadt: sharing the experiences and insight gained from > implementing mimmutable() in OpenBSD. > > Finally, the idea that inspired this patch comes from Stephen Röttger’s > work in Chrome V8 CFI. No per-vma change is checked prior to entering a per-vma modification loop today. This means that mseal() differs in behaviour in "up-front failure" vs "partial change failure" that exists in every other function. I'm not saying it's wrong or that it's right - I'm just wondering what the direction is here. Either we should do as much up-front as possible or keep with tradition and have (partial) success where possible. If you look at do_mprotect_pkey(), you can even see map_deny_write_exec() being checked in a loop during modifications. I think we can all agree that having some up-front and some later without any reason will lead to a higher probability of things getting missed. Thanks, Liam