Hi Pedro On Sun, Sep 8, 2024 at 2:56 PM Pedro Falcato <pedro.falcato@xxxxxxxxx> wrote: > > On Sun, Sep 8, 2024 at 10:35 PM Pedro Falcato <pedro.falcato@xxxxxxxxx> wrote: > > Hi, > > > > I agree with most of the points. Sitting down here to write unofficial > > guidelines for mseal behavior. > > > > mseal should seal regions and mark them immutable, which means their protection > > and contents (ish?) (not _only_ backing mapping, but also contents in general > > (see madvise below)) should not be changed throughout the lifetime of the address space. > > > > For the general syscall interface, this means: > > 1) mprotect and munmap need to be blocked on mseal regions. > > 1a) munmap _cannot_ tolerate partial failure, per POSIX. > > 2b) mmap MAP_FIXED counts as an unmap operation and also needs to be blocked and return -EPERM. > > > > 2) Most madvise calls are allowed, except for destructive operations on > > read-only anonymous _pages_ (MADV_DONTNEED is destructive for anon, but we also don't care > > about blocking these ops if we can do it manually with e.g memset) > > 2a) The current implementation only blocks discard on anonymous _regions_, which is slightly > > different. We probably do want to block these on MAP_PRIVATE file mappings, as to block > > stuff like madvise MADV_DONTNEED on program rodata. > > 2b) We take into account pkeys when doing the permission checks. > > > > 3) mremap is not allowed as we'd change the "contents" of the old region. > > 3a) Should mremap expansion be allowed? aka only block moving and shrinking, but allow expansion. > > We already informally allow expansion if e.g mmapping after it + mseal. > > > > 4) mlock and msync are allowed. > > > > 5) mseal is blocked. > > > > 6) Other miscellaneous syscalls (mbind, etc) that do not change contents in any way, are allowed. > > 6a) This obviously means PTEs can change as long as the contents don't. Swapping is also ok. > > > > 7) FOLL_FORCE (kernel-internal speak, more commonly seen as ptrace and /proc/self/mem from userspace) > > should be disallowed (?) > > 7a) This currently does not happen, and seems like a large hole? But disallowing this > > would probably severely break ptrace if the ELF sealing plans come to fruition. > > > > When we say "disallowed", we usually (apart from munmap) allow for partial failure. This > > means getting an -EPERM while part of the call succeeded, if we e.g mprotect a region consisting > > of [NORMAL VMA][SEALED VMA]. We do not want to test for this, because we do not want to paint ourselves > > into a corner - this is strictly "undefined behavior". The msealed regions themselves > > will never be touched in such cases. (we do however want to test munmap operation atomicity, but this is > > also kind of a munmap-related test, and might not really be something we really want in the mseal tests) > > > > Kernel-internal wise: The VMA and PTE modifications resulting from the above operations are blocked. > > Sealed VMAs allow splitting and merging; there was contention about the splitting issue, but it truly > > does not make sense to block operations unless they affect a VMA entirely, and that would also force > > VMA merging to be ABI ("vma_merge isn't merging these two regions and now my madvise works/doesn't work :("). > > > > > > Do I have everything right? Am I missing anything? > > Small addendum: file write, truncate and hole punching might also > matter for sealed file-backed regions, as these change the region's > contents, but we probably > want to rely on file write permissions to protect against this (as we > already do). Any other solution is probably terrible and probably > endlessly NAK'd by fs folks, but it does > mean sealed regions aren't really immutable if you or the attacker can > write to the file. > Right. The mseal protects the control-data of VMA (e.g. prot), it doesn't do anything more than that. The file permission relies on dac/mac control. -Jeff > -- > Pedro