On Mon, Oct 16, 2023 at 4:18 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > On Mon, Oct 16, 2023 at 02:38:19PM +0000, jeffxu@xxxxxxxxxxxx wrote: > > Modern CPUs support memory permissions such as RW and NX bits. Linux has > > supported NX since the release of kernel version 2.6.8 in August 2004 [1]. > > This seems like a confusing way to introduce the subject. Here, you're > talking about page permissions, whereas (as far as I can tell), mseal() is > about making _virtual_ addresses immutable, for some value of immutable. > > > Memory sealing additionally protects the mapping itself against > > modifications. This is useful to mitigate memory corruption issues where > > a corrupted pointer is passed to a memory management syscall. For example, > > such an attacker primitive can break control-flow integrity guarantees > > since read-only memory that is supposed to be trusted can become writable > > or .text pages can get remapped. Memory sealing can automatically be > > applied by the runtime loader to seal .text and .rodata pages and > > applications can additionally seal security critical data at runtime. > > A similar feature already exists in the XNU kernel with the > > VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4]. > > Also, Chrome wants to adopt this feature for their CFI work [2] and this > > patchset has been designed to be compatible with the Chrome use case. > > This [2] seems very generic and wide-ranging, not helpful. [5] was more > useful to understand what you're trying to do. > > > The new mseal() is an architecture independent syscall, and with > > following signature: > > > > mseal(void addr, size_t len, unsigned int types, unsigned int flags) > > > > addr/len: memory range. Must be continuous/allocated memory, or else > > mseal() will fail and no VMA is updated. For details on acceptable > > arguments, please refer to comments in mseal.c. Those are also fully > > covered by the selftest. > > Mmm. So when you say "continuous/allocated" what you really mean is > "Must have contiguous VMAs" rather than "All pages in this range must > be populated", yes? > > > types: bit mask to specify which syscall to seal, currently they are: > > MM_SEAL_MSEAL 0x1 > > MM_SEAL_MPROTECT 0x2 > > MM_SEAL_MUNMAP 0x4 > > MM_SEAL_MMAP 0x8 > > MM_SEAL_MREMAP 0x10 > > I don't understand why we want this level of granularity. The OpenBSD > and XNU examples just say "This must be immutable*". For values of > immutable that allow downgrading access (eg RW to RO or RX to RO), > but not upgrading access (RW->RX, RO->*, RX->RW). > > > Each bit represents sealing for one specific syscall type, e.g. > > MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask > > is that the API is extendable, i.e. when needed, the sealing can be > > extended to madvise, mlock, etc. Backward compatibility is also easy. > > Honestly, it feels too flexible. Why not just two flags to mprotect() > -- PROT_IMMUTABLE and PROT_DOWNGRADABLE. I can see a use for that -- > maybe for some things we want to be able to downgrade and for other > things, we don't. I think it's worth pointing out that this suggestion (with PROT_*) could easily integrate with mmap() and as such allow for one-shot mmap() + mseal(). If we consider the common case as 'addr = mmap(...); mseal(addr);', it definitely sounds like a performance win as we halve the number of syscalls for a sealed mapping. And if we trivially look at e.g OpenBSD ld.so code, mmap() + mimmutable() and mprotect() + mimmutable() seem like common patterns. -- Pedro