On Wed, Jan 24, 2024 at 12:06 PM Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> wrote: > > * Jeff Xu <jeffxu@xxxxxxxxxxxx> [240124 12:50]: > > On Tue, Jan 23, 2024 at 10:15 AM Liam R. Howlett > > <Liam.Howlett@xxxxxxxxxx> wrote: > > > > > > * jeffxu@xxxxxxxxxxxx <jeffxu@xxxxxxxxxxxx> [240122 10:29]: > > > > From: Jeff Xu <jeffxu@xxxxxxxxxxxx> > > > > > > > > The new mseal() is an syscall on 64 bit CPU, and with > > > > following signature: > > > > > > > > int mseal(void addr, size_t len, unsigned long flags) > > > > addr/len: memory range. > > > > flags: reserved. > > > > > > > > mseal() blocks following operations for the given memory range. > > > > > > > > 1> Unmapping, moving to another location, and shrinking the size, > > > > via munmap() and mremap(), can leave an empty space, therefore can > > > > be replaced with a VMA with a new set of attributes. > > > > > > > > 2> Moving or expanding a different VMA into the current location, > > > > via mremap(). > > > > > > > > 3> Modifying a VMA via mmap(MAP_FIXED). > > > > > > > > 4> Size expansion, via mremap(), does not appear to pose any specific > > > > risks to sealed VMAs. It is included anyway because the use case is > > > > unclear. In any case, users can rely on merging to expand a sealed VMA. > > > > > > > > 5> mprotect() and pkey_mprotect(). > > > > > > > > 6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous > > > > memory, when users don't have write permission to the memory. Those > > > > behaviors can alter region contents by discarding pages, effectively a > > > > memset(0) for anonymous memory. > > > > > > > > In addition: mmap() has two related changes. > > > > > > > > The PROT_SEAL bit in prot field of mmap(). When present, it marks > > > > the map sealed since creation. > > > > > > > > The MAP_SEALABLE bit in the flags field of mmap(). When present, it marks > > > > the map as sealable. A map created without MAP_SEALABLE will not support > > > > sealing, i.e. mseal() will fail. > > > > > > > > Applications that don't care about sealing will expect their behavior > > > > unchanged. For those that need sealing support, opt-in by adding > > > > MAP_SEALABLE in mmap(). > > > > > > > > I would like to formally acknowledge the valuable contributions > > > > received during the RFC process, which were instrumental > > > > in shaping this patch: > > > > > > > > Jann Horn: raising awareness and providing valuable insights on the > > > > destructive madvise operations. > > > > Linus Torvalds: assisting in defining system call signature and scope. > > > > Pedro Falcato: suggesting sealing in the mmap(). > > > > Theo de Raadt: sharing the experiences and insights gained from > > > > implementing mimmutable() in OpenBSD. > > > > > > > > Finally, the idea that inspired this patch comes from Stephen Röttger’s > > > > work in Chrome V8 CFI. > > > > > > > > Signed-off-by: Jeff Xu <jeffxu@xxxxxxxxxxxx> > > > > --- > > > > include/linux/mm.h | 48 ++++ > > > > include/linux/syscalls.h | 1 + > > > > include/uapi/asm-generic/mman-common.h | 8 + > > > > mm/Makefile | 4 + > > > > mm/madvise.c | 12 + > > > > mm/mmap.c | 27 ++ > > > > mm/mprotect.c | 10 + > > > > mm/mremap.c | 31 +++ > > > > mm/mseal.c | 343 +++++++++++++++++++++++++ > > > > 9 files changed, 484 insertions(+) > > > > create mode 100644 mm/mseal.c > > > > > > ... > > > > > > > > diff --git a/mm/mmap.c b/mm/mmap.c > > > > index b78e83d351d2..32bc2179aed0 100644 > > > > --- a/mm/mmap.c > > > > +++ b/mm/mmap.c > > > > @@ -1213,6 +1213,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr, > > > > { > > > > struct mm_struct *mm = current->mm; > > > > int pkey = 0; > > > > + unsigned long vm_seals; > > > > > > > > *populate = 0; > > > > > > > > @@ -1233,6 +1234,8 @@ unsigned long do_mmap(struct file *file, unsigned long addr, > > > > if (flags & MAP_FIXED_NOREPLACE) > > > > flags |= MAP_FIXED; > > > > > > > > + vm_seals = get_mmap_seals(prot, flags); > > > > + > > > > if (!(flags & MAP_FIXED)) > > > > addr = round_hint_to_min(addr); > > > > > > > > @@ -1261,6 +1264,13 @@ unsigned long do_mmap(struct file *file, unsigned long addr, > > > > return -EEXIST; > > > > } > > > > > > > > + /* > > > > + * Check if the address range is sealed for do_mmap(). > > > > + * can_modify_mm assumes we have acquired the lock on MM. > > > > + */ > > > > + if (!can_modify_mm(mm, addr, addr + len)) > > > > + return -EPERM; > > > > + > > > > > > This is called after get_unmapped_area(), so this area is either going > > > to be MAP_FIXED and return the "hint" addr or it's going to be empty. > > > You can probably avoid walking the VMAs in the non-FIXED case. This > > > would remove the overhead of your check in the most common case. > > > > > > > Thanks for flagging this! > > > > I wasn't entirely sure about get_unmapped_area() after reading the > > code, It calls a few variants of arch_get_unmapped_area_xxx() > > functions. > > > > e.g. it seems like the generic_get_unmapped_area_topdown is returning > > a non-null address even when MAP_FIXED is set to false > > > > ---------------------------------------------------------------------------- > > generic_get_unmapped_area_topdown ( > > ... > > if (flags & MAP_FIXED) <-- MAP_FIXED case. > > return addr; > > > > /* requesting a specific address */ > > if (addr) { <-- note not MAP_FIXED > > addr = PAGE_ALIGN(addr); > > vma = find_vma_prev(mm, addr, &prev); > > if (mmap_end - len >= addr && addr >= mmap_min_addr && > > (!vma || addr + len <= vm_start_gap(vma)) && > > (!prev || addr >= vm_end_gap(prev))) > > return addr; <--- note return not null addr here. > > } > > Sorry, I was not clear. Either MAP_FIXED will just return the addr, or > the addr that is returned has no VMA (the memory area is empty). This > function finds a gap to place your data and the gap is (at least) as big > as you want (usually oversized, but that doesn't matter here). The > mmap_lock is held, so we know it's going to remain empty. > > So there are two scenarios: > 1. MAP_FIXED which may or may not have a VMA over the range > 2. An address which has no VMA over the range > > Anyways, this is probably not needed, because of what I say later. > > > > > ---------------------------------------------------------------------------- > > I thought also about adding a check for addr != null instead, i.e. > > if (addr && !can_modify_mm(mm, addr, addr + len)) > > return -EPERM; > > } > > > > But using MAP_FIXED to allocate memory at address 0 is legit, e.g. > > allocating a PROT_NONE | PROT_SEAL at address 0. > > > > Another factor to consider is: what will be the cost of passing an > > empty address into can_modify_mm() ? the search will be 0 to len. > > Almost always zero VMAs to check, it's not worth optimising. The maple > tree will walk to the first range and it'll be 0 to some very large > number, most likely. > Got you. > > > > > > if (prot == PROT_EXEC) { > > > > pkey = execute_only_pkey(mm); > > > > if (pkey < 0) > > > > @@ -1376,6 +1386,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr, > > > > vm_flags |= VM_NORESERVE; > > > > } > > > > > > > > + vm_flags |= vm_seals; > > > > addr = mmap_region(file, addr, len, vm_flags, pgoff, uf); > > > > if (!IS_ERR_VALUE(addr) && > > > > ((vm_flags & VM_LOCKED) || > > > > @@ -2679,6 +2690,14 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm, > > > > if (end == start) > > > > return -EINVAL; > > > > > > > > + /* > > > > + * Check if memory is sealed before arch_unmap. > > > > + * Prevent unmapping a sealed VMA. > > > > + * can_modify_mm assumes we have acquired the lock on MM. > > > > + */ > > > > + if (!can_modify_mm(mm, start, end)) > > > > + return -EPERM; > > > > + > > > > > > This function is currently called from mmap_region(), so we are going to > > > run this check twice as you have it; once in do_mmap() then again in > > > mma_region() -> do_vmi_munmap(). This effectively doubles your impact > > > to MAP_FIXED calls. > > > > > Yes. To address this would require a new flag in the do_vmi_munmap(), > > after passing the first check in mmap(), we could set the flag as false, > > so do_vmi_munmap() would not check it again. > > > > However, this approach was attempted in v1 and V2 of the patch [1] [2], > > and was strongly opposed by Linus. It was considered as too random and > > decreased the readability. > > Oh yes, I recall that now. He was not pleased. > > > > > Below is my text in V2: [3] > > > > "When handing the mmap/munmap/mremap/mmap, once the code passed > > can_modify_mm(), it means the memory area is not sealed, if the code > > continues to call the other utility functions, we don't need to check > > the seal again. This is the case for mremap(), the seal of src address > > and dest address (when applicable) are checked first, later when the > > code calls do_vmi_munmap(), it no longer needs to check the seal > > again." > > > > Considering this is the MAP_FIXED case, and maybe that is not used > > that often in practice, I think this is acceptable performance-wise, > > unless you know another solution to help this. > > Okay, sure, I haven't been yelled at on the ML for a few weeks. Here > goes: > > do_mmap() will call get_unmapped_area(), which will return an empty area > (no need to check mseal, I hope - or we have larger issues here) or a > MAP_FIXED address. > > do_mmap() will pass the address along to mmap_region() > > mmap_region() will then call do_vmi_munmap() - which will either remove > the VMA(s) in the way, or do nothing... or error. > > mmap_region() will return -ENOMEM in the case of an error returned from > do_vmi_munmap() today. Change that to return the error code, and let > do_vmi_munmap() do the mseal check. If mseal check fails then the error > is propagated the same way -ENOMEM is propagated today. > > This relies on the fact that we only really need to check the mseal > status of existing VMAs and we can only really map over existing VMAs by > first munmapping them. > > It does move your error return to much later in the call stack, but it > removes duplicate work and less code. Considering this should be a rare > event, I don't think that's of concern. > I think that is a great idea, I will try to implement it and get back to you on this. > > > > [1] https://lore.kernel.org/lkml/20231016143828.647848-6-jeffxu@xxxxxxxxxxxx/ > > [2] https://lore.kernel.org/lkml/20231017090815.1067790-6-jeffxu@xxxxxxxxxxxx/ > > [3] https://lore.kernel.org/lkml/CALmYWFux2m=9189Gs0o8-xhPNW4dnFvtqj7ptcT5QvzxVgfvYQ@xxxxxxxxxxxxxx/ > > > > > > > > /* arch_unmap() might do unmaps itself. */ > > > > arch_unmap(mm, start, end); > > > > > > > > @@ -3102,6 +3121,14 @@ int do_vma_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, > > > > { > > > > struct mm_struct *mm = vma->vm_mm; > > > > > > > > + /* > > > > + * Check if memory is sealed before arch_unmap. > > > > + * Prevent unmapping a sealed VMA. > > > > + * can_modify_mm assumes we have acquired the lock on MM. > > > > + */ > > > > + if (!can_modify_mm(mm, start, end)) > > > > + return -EPERM; > > > > + > > > > > > I am sure you've looked at the callers, from what I found there are two: > > > > > > The brk call uses this function, so it may check more than one VMA in > > > that path. Will the brk VMAs potentially be msealed? I guess someone > > > could do that? > > > > > > The other place this is use is in ipc/shm.c whhere the start/end is just > > > the vma start/end, so we only really need to check that one vma. > > > > > Yes. Those two cases were looked at, and was the main reason that > > MAP_SEALABLE is introduced as part of mmap(). > > > > As in the open discussion of the V3/V4 patch: [4] [5] > > > > [4] https://lore.kernel.org/linux-mm/20231212231706.2680890-1-jeffxu@xxxxxxxxxxxx/T/ > > [5] https://lore.kernel.org/linux-mm/20240104185138.169307-3-jeffxu@xxxxxxxxxxxx/T/ > > > > Copied here for ease of reading: > > --------------------------------------------------------------------------------------------- > > > > During the development of V3, I had new questions and thoughts and > > wished to discuss. > > > > 1> shm/aio > > From reading the code, it seems to me that aio/shm can mmap/munmap > > maps on behalf of userspace, e.g. ksys_shmdt() in shm.c. The lifetime > > of those mapping are not tied to the lifetime of the process. If those > > memories are sealed from userspace, then unmap will fail. This isn’t a > > huge problem, since the memory will eventually be freed at exit or > > exec. However, it feels like the solution is not complete, because of > > the leaks in VMA address space during the lifetime of the process. > > > > 2> Brk (heap/stack) > > Currently, userspace applications can seal parts of the heap by > > calling malloc() and mseal(). This raises the question of what the > > expected behavior is when sealing the heap is attempted. > > > > let's assume following calls from user space: > > > > ptr = malloc(size); > > mprotect(ptr, size, RO); > > mseal(ptr, size, SEAL_PROT_PKEY); > > free(ptr); > > > > Technically, before mseal() is added, the user can change the > > protection of the heap by calling mprotect(RO). As long as the user > > changes the protection back to RW before free(), the memory can be > > reused. > > > > Adding mseal() into picture, however, the heap is then sealed > > partially, user can still free it, but the memory remains to be RO, > > and the result of brk-shrink is nondeterministic, depending on if > > munmap() will try to free the sealed memory.(brk uses munmap to shrink > > the heap). > > > > 3> Above two cases led to the third topic: > > There one option to address the problem mentioned above. > > Option 1: A “MAP_SEALABLE” flag in mmap(). > > If a map is created without this flag, the mseal() operation will > > fail. Applications that are not concerned with sealing will expect > > their behavior to be unchanged. For those that are concerned, adding a > > flag at mmap time to opt in is not difficult. For the short term, this > > solves problems 1 and 2 above. The memory in shm/aio/brk will not have > > the MAP_SEALABLE flag at mmap(), and the same is true for the heap. > > > > If we choose not to go with path, all mapping will by default > > sealable. We could document above mentioned limitations so devs are > > more careful at the time to choose what memory to seal. I think > > deny of service through mseal() by attacker is probably not a concern, > > if attackers have access to mseal() and unsealed memory, then they can > > also do other harmful thing to the memory, such as munmap, etc. > > > > 4> > > I think it might be possible to seal the stack or other special > > mappings created at runtime (vdso, vsyscall, vvar). This means we can > > enforce and seal W^X for certain types of application. For instance, > > the stack is typically used in read-write mode, but in some cases, it > > can become executable. To defend against unintented addition of > > executable bit to stack, we could let the application to seal it. > > > > Sealing the heap (for adding X) requires special handling, since the > > heap can shrink, and shrink is implemented through munmap(). > > > > Indeed, it might be possible that all virtual memory accessible to user > > space, regardless of its usage pattern, could be sealed. However, this > > would require additional research and development work. > > > > ----------------------------------------------------------------------------------------------------- > > > > > > > Is there a way to avoid walking the tree for the single known VMA? > > Are you thinking about a hash table to record brk VMA ? or a dedicated > > tree for sealed VMAs? possible. code will be a lot more though. > > No, instead of calling a loop to walk the tree to find the same VMA, > just check the single VMA. > > ipc/shm.c: do_vma_munmap(&vmi, vma, vma->vm_start, vma->vm_end... > > So if you just check the single VMA then we don't need to worry about > re-walking. > If you meant: have a new function do_single_vma_munmap() which checks sealing flag and munmap for one VMA, and is used by ipc/shm.c Yes, we can have that. > I think this is a moot point if my outline above works. > Yes, I agree. that has performance impact only for shm. We can do this optimationzation as a follow-up patch set. > > > > > Does > > > it make sense to deny mseal writing to brk VMAs? > > > > > Yes. It makes sense. Since brk memory doesn't have MAP_SEALABLE at > > this moment, mseal will fail even if someone tries to seal it. > > Sealing brk memory would require more research and design. > > > > > > > > > arch_unmap(mm, start, end); > > > > return do_vmi_align_munmap(vmi, vma, mm, start, end, uf, unlock); > > > > } > > > > > > ... > > > > > > > > > Ah, I see them now. Yes, this is what I expected to see. Does this not > > > have any impact on mmap/munmap benchmarks? > > > > > Thanks for bringing this topic! I'm kind of hoping for performance related > > questions. > > > > I haven't done any benchmarks, due to lack of knowledge on how those > > tests are usually performed. > > > > For mseal(), since it will be called only in a few places (libc/elf > > loading), I'm expecting no real world impact, and that can be > > measured when we have implementations in place in libc and > > elf-loading. > > > > The hot path could be on mmap() and munmap(), as you pointed out. > > > > mmap() was discussed above (adding a check for FIXED ) > > That can probably be dropped as discussed above. > Ok. > > > > munmap(), There is a cost in calling can_modify_mm(). I thought about > > calling can_modify_vma in do_vmi_align_munmap, but there are two reasons: > > > > a. it skips arch_unmap, and arch_unmap can unmap the memory. > > b. Current logic of checking sealing is: if one of VMAs between start to end is > > sealed, mprotect/mmap/munmap will fail without any of VMAs being modified. > > This means we will need additional walking over the VMA tree. > > Certainly, but it comes at a cost. I was just surprised with the > statement that there is no negative from the previous discussion, as I > replied to the cover letter. > Ah, the context of my "no downside" comment is specifically to "having the PROT_SEAL flag in mmap()", i.e. combine mmap() and mseal() in one call. > > > > +/* > > > > + * Apply sealing. > > > > + */ > > > > +static int apply_mm_seal(unsigned long start, unsigned long end) > > > > +{ > > > > + unsigned long nstart; > > > > + struct vm_area_struct *vma, *prev; > > > > + > > > > + VMA_ITERATOR(vmi, current->mm, start); > > > > + > > > > + vma = vma_iter_load(&vmi); > > > > + /* > > > > + * Note: check_mm_seal should already checked ENOMEM case. > > > > + * so vma should not be null, same for the other ENOMEM cases. > > > > > > The start to end is contiguous, right? > > Yes. check_mm_seal makes sure the start to end is contiguous. > > > > > > > > > + */ > > > > + prev = vma_prev(&vmi); > > > > + if (start > vma->vm_start) > > > > + prev = vma; > > > > + > > > > + nstart = start; > > > > + for_each_vma_range(vmi, vma, end) { > > > > + int error; > > > > + unsigned long tmp; > > > > + vm_flags_t newflags; > > > > + > > > > + newflags = vma->vm_flags | VM_SEALED; > > > > + tmp = vma->vm_end; > > > > + if (tmp > end) > > > > + tmp = end; > > > > + error = mseal_fixup(&vmi, vma, &prev, nstart, tmp, newflags); > > > > + if (error) > > > > + return error; > > > > > > > + tmp = vma_iter_end(&vmi); > > > > + nstart = tmp; > > > > > > You set tmp before using it unconditionally to vma->vm_end above, so you > > > can set nstart = vma_iter_end(&vmi) here. But, also we know the > > > VMAs are contiguous from your check_mm_seal() call, so we know nstart == > > > vma->vm_start on the next loop. > > The code is almost the same as in mlock.c, except that we know the > > VMAs are contiguous, so we don't check for some of the ENOMEM cases. > > There might be ways to improve this code. For ease of code review, I > > choose a consistency (same as mlock) for now. > > Yes, I thought that was the case. tmp is updated in that code to ensure > we have reached the end of the range without a gap at the end. Since > you already checked that the VMAs are contiguous, the last tmp update in > your loop is not needed. > > Thanks, > Liam