On Wed, Oct 18, 2023 at 8:08 AM Jeff Xu <jeffxu@xxxxxxxxxx> wrote: > > On Tue, Oct 17, 2023 at 9:54 AM Linus Torvalds > <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > > > On Tue, 17 Oct 2023 at 02:08, <jeffxu@xxxxxxxxxxxx> wrote: > > > > > > Of all the call paths that call into do_vmi_munmap(), > > > this is the only place where checkSeals = MM_SEAL_MUNMAP. > > > The rest has checkSeals = 0. > > > > Why? > > > > None of this makes sense. > > > > So you say "we can't munmap in this *one* place, but all others ignore > > the sealing". > > > I apologize that previously, I described what this code does, and not reasoning. > > In our threat model, as Stephen Röttger point out in [1], and I quote: > > V8 exploits typically follow a similar pattern: an initial bug leads > to memory corruption but often the initial corruption is limited and > the attacker has to find a way to arbitrarily read/write in the whole > address space. > > The memory correction is in the user space process, e.g. Chrome. > Attackers will try to modify permission of the memory, by calling > mprotect, or munmap then mmap to the same address but with different > permission, etc. > > Sealing blocks mprotect/munmap/mremap/mmap call from the user space > process, e.g. Chrome. > > At time of handling those 4 syscalls, we need to check the seal ( > can_modify_mm), this requires locking the VMA ( > mmap_write_lock_killable), and ideally, after validating the syscall > input. The reasonable place for can_modify_mm() is from utility > functions, such as do_mmap(), do_vmi_munmap(), etc. > > However, there is no guarantee that do_mmap() and do_vmi_munmap() are > only reachable from mprotect/munmap/mremap/mmap syscall entry point > (SYSCALL_DEFINE_XX). In theory, the kernel can call those in other > scenarios, and some of them can be perfectly legit. Those other > scenarios are not covered by our threat model at this time. Therefore, > we need a flag, passed from the SYSCALL_DEFINE_XX entry , down to > can_modify_mm(), to differentiate those other scenarios. > > Now, back to code, it did some optimization, i.e. doesn't pass the > flag from SYSCALL_DEFINE_XX in all cases. If SYSCALL_DEFINE_XX calls > do_a, and do_a has only one caller, I will set the flag in do_a, > instead of SYSCALL_DEFINE_XX. Doing this reduces the size of the > patchset, but it also makes the code less readable indeed. I could > remove this optimization in V3. I welcome suggestions to improve > readability on this. > > When handing the mmap/munmap/mremap/mmap, once the code passed > can_modify_mm(), it means the memory area is not sealed, if the code > continues to call the other utility functions, we don't need to check > the seal again. This is the case for mremap(), the seal of src address > and dest address (when applicable) are checked first, later when the > code calls do_vmi_munmap(), it no longer needs to check the seal > again. > > [1] https://v8.dev/blog/control-flow-integrity > > -Jeff There is also alternative approach: For all the places that call do_vmi_munmap(), find out which case should ignore the sealing flag legitimately, set an ignore_seal flag and pass it down into do_vmi_munmap(). For the rest case, use default behavior. All future API will automatically be covered for sealing, by using default. The risky side, if I missed a case that requires setting ignore_seal, there will be a bug. Also if a driver calls the utility functions to unmap a memory, the seal will be checked as well. (Driver is not in our threat model, but Chrome probably doesn't mind it.) Which of those two approaches are better ? I appreciate the direction on this. Thanks! -Jeff -Jeff