On Mon, Sep 30, 2024 at 05:24:39PM -0700, Jeff Xu wrote: > Hi Pedro > > On Sat, Sep 28, 2024 at 6:43 AM Pedro Falcato <pedro.falcato@xxxxxxxxx> wrote: > > > > On Fri, Sep 27, 2024 at 06:29:30PM GMT, Jeff Xu wrote: > > > Hi Pedro, > > > > > > On Fri, Sep 27, 2024 at 3:59 PM Pedro Falcato <pedro.falcato@xxxxxxxxx> wrote: > > <snip> > > > > > + > > > > > + Blocked mm syscall: > > > > > + - munmap > > > > > + - mmap > > > > > + - mremap > > > > > + - mprotect and pkey_mprotect > > > > > + - some destructive madvise behaviors: MADV_DONTNEED, MADV_FREE, > > > > > + MADV_DONTNEED_LOCKED, MADV_FREE, MADV_DONTFORK, MADV_WIPEONFORK > > > > > + > > > > > + The first set of syscall to block is munmap, mremap, mmap. They can > > > > > + either leave an empty space in the address space, therefore allow > > > > > + replacement with a new mapping with new set of attributes, or can > > > > > + overwrite the existing mapping with another mapping. > > > > > + > > > > > + mprotect and pkey_mprotect are blocked because they changes the > > > > change > > > > > + protection bits (rwx) of the mapping. > > > > > + > > > > > + Some destructive madvice behaviors (MADV_DONTNEED, MADV_FREE, > > > > > + MADV_DONTNEED_LOCKED, MADV_FREE, MADV_DONTFORK, MADV_WIPEONFORK) > > > > > + for anonymous memory, when users don't have write permission to the > > > > > + memory. Those behaviors can alter region contents by discarding pages, > > > > > + effectively a memset(0) for anonymous memory. > > > > > > > > What's the difference between anonymous memory and MAP_PRIVATE | MAP_FILE? > > > > > > > MAP_FILE seems not used ? > > > anonymous mapping is the mapping that is not backed by a file. > > > > MAP_FILE is actually defined as 0 usually :) But I meant file-backed private mappings. > > > OK, we are on the same page for this. > > > > > The feature now, as is (as far as I understand!) will allow you to do things like MADV_DONTNEED > > > > on a read-only file mapping. e.g .text. This is obviously wrong? > > > > > > > When a MADV_DONTNEED is called, pages will be freed, on file-backed > > > mapping, if the process reads from the mapping again, the content > > > will be retrieved from the file. > > > > > > > Sorry, it was late and I gave you a crap example. Consider this: > > a file-backed MAP_PRIVATE vma is marked RW. I write to it, then RO-it + mseal. > > > > The attacker later gets me to MADV_DONTNEED that VMA. You've just lost data. > > > > The big problem here is with anon _pages_, not anon vmas. > > > That depends on the app's threat-model. What you described seems to be > a case below > 1. The file is rw > 2. The process opens the file as rw > 3. the process mmap the fd as rw > 4 The process writes the memory, and the change isn't flushed to the > file on disk. > 5 The process changes the mapping to RO > 6. The process seals the mapping > 7. The process is called MADV_DONTNEED , and because the change isn't > flush to file on disk, so it loses the change, (retrieve the old data > from disk when read from the mapped address later) > > I'm not sure this is a valid use case, the problem here seems to be > that the app needs to flush the change from memory to disk if the > expectation is writing is permanent. > MAP_PRIVATE never does writeback. That's not what this is about. I can trivially discard anonymous pages for private "file VMAs", which aren't refilled with the exact same contents. This is a problem. > In any case, the mseal currently just blocks a subset of madvise, those > we know with a security implication. If there is something mseal needs > to block additionally, one can always extend it by using the "flags" field. > I do think the bar is high though, e.g. a valid use case to support that. No, this has nothing to do with a flag. It's about providing sane semantics. > > > > For anonymous mapping, since there is no file backup, if process > > > reads from the mapping, 0 is filled, hence equivalent to memset(0) > > > > > > > > + > > > > > + Kernel will return -EPERM for blocked syscalls. > > > > > + > > > > > + When blocked syscall return -EPERM due to sealing, the memory regions may or may not be changed, depends on the syscall being blocked: > > > > > + - munmap: munmap is atomic. If one of VMAs in the given range is > > > > > + sealed, none of VMAs are updated. > > > > > + - mprotect, pkey_mprotect, madvise: partial update might happen, e.g. > > > > > + when mprotect over multiple VMAs, mprotect might update the beginning > > > > > + VMAs before reaching the sealed VMA and return -EPERM. > > > > > + - mmap and mremap: undefined behavior. > > > > > > > > mmap and mremap are actually not undefined as they use munmap semantics for their unmapping. > > > > Whether this is something we'd want to document, I don't know honestly (nor do I think is ever written down in POSIX?) > > > > > > > I'm not sure if I can declare mmap/mremap as atomic. > > > > > > Although, it might be possible to achieve this due to munmap being > > > atomic. I'm not sure as I didn't test this. Would you like to find > > > out ? > > > > I just told you they use munmap under the hood. It's just that the requirement isn't actually > > written down anywhere. > > > I knew about mmap/mremap calling munmap. I don't know what exactly you > are asking though. In your patch and its discussion, you did not mention > the mmap/mremap (for sealing) is or should be atomic. > > My point is: since there isn't a clear statement from your patch description > or POSIX, that mremap/mmap is atomic, and I haven't tested it myself with > regards to sealing, let's leave them as "undefined" for now. (I could get back > to this later after the merging window) > > > > > > > > > > > > > > Use cases: > > > > > ========== > > > > > - glibc: > > > > > The dynamic linker, during loading ELF executables, can apply sealing to > > > > > - non-writable memory segments. > > > > > + mapping segments. > > > > > > > > > > - Chrome browser: protect some security sensitive data-structures. > > > > > > > > > > -Notes on which memory to seal: > > > > > -============================== > > > > > - > > > > > -It might be important to note that sealing changes the lifetime of a mapping, > > > > > -i.e. the sealed mapping won’t be unmapped till the process terminates or the > > > > > -exec system call is invoked. Applications can apply sealing to any virtual > > > > > -memory region from userspace, but it is crucial to thoroughly analyze the > > > > > -mapping's lifetime prior to apply the sealing. > > > > > +Don't use mseal on: > > > > > +=================== > > > > > +Applications can apply sealing to any virtual memory region from userspace, > > > > > +but it is *crucial to thoroughly analyze the mapping's lifetime* prior to > > > > > +apply the sealing. This is because the sealed mapping *won’t be unmapped* > > > > > +till the process terminates or the exec system call is invoked. > > > > > > > > There should probably be a nice disclaimer as to how most people don't need this or shouldn't use this. > > > > At least in its current form. > > > > > > > Ya, the mseal is not for most apps. I mention the malloc example to stress that. > > > > > > > <snip> > > > > > - > > > > > - > > > > > -Additional notes: > > > > > -================= > > > > > As Jann Horn pointed out in [3], there are still a few ways to write > > > > > -to RO memory, which is, in a way, by design. Those cases are not covered > > > > > -by mseal(). If applications want to block such cases, sandbox tools (such as > > > > > -seccomp, LSM, etc) might be considered. > > > > > +to RO memory, which is, in a way, by design. And those could be blocked > > > > > +by different security measures. > > > > > > > > > > Those cases are: > > > > > - > > > > > -- Write to read-only memory through /proc/self/mem interface. > > > > > -- Write to read-only memory through ptrace (such as PTRACE_POKETEXT). > > > > > -- userfaultfd. > > > > > + - Write to read-only memory through /proc/self/mem interface (FOLL_FORCE). > > > > > + - Write to read-only memory through ptrace (such as PTRACE_POKETEXT). > > > > > + - userfaultfd. > > > > > > > > I don't understand how this is not a problem, but MADV_DONTNEED is. > > > > To me it seems that what we have now is completely useless, because you can trivially > > > > bypass it using /proc/self/mem, which is enabled on most Linux systems. > > > > > > > > Before you mention ChromeOS or Chrome, I don't care. Kernel features aren't designed > > > > for Chrome. They need to work with every other distro and application as well. > > > > > > > > It seems to me that the most sensible change is blocking/somehow distinguishing between /proc/self/mem and > > > > /proc/<pid>/mem (some other process) and ptrace. As in blocking /proc/self/mem but allowing the other FOLL_FORCE's > > > > as the traditional UNIX permission model allows. > > > > > > > IMO, it is a matter of Divide and Conquer. In a nutshell, mseal only > > > prevents VMA's certain attributes (such as prot bits) from changing. > > > It doesn't mean to say that sealed RO memory is immutable. To achieve > > > that, the system needs to apply multiple security measures. > > > > No, it's a matter of providing a sane API without tons of edgecases. Making a VMA immutable should make a VMA > > immutable, and not require you to provide a crap ton of other mechanisms in order to truly make it immutable. > > If I call mseal, I expect it to be sealed, not "sealed except when it's not, lol". > > > > You haven't been able to quite specify what semantics are desirable out of this whole thing. Making > > prot flags "immutable" is completely worthless if you can simply write to a random pseudofile and > > have it bypass the whole thing (where a write to /proc/self/mem is semantically equivalent to > > mprotect RW + write + mprotect RO). Making the vma immutable is completely worthless > > if I can simply wipe anon pages. There has to be some end goal here (make contents immutable? > > make sure VMA protection can't be changed? both?) which seems to be unclear from the kernel mmap-side. > > > > If you insist on providing half-baked APIs (and waving off any concerns), I'm sure this would've been better > > implemented as a random bpf program for chrome. Maybe we could revert this whole thing and give eBPF one > > or two bits of vma flags for their own uses :) > > Please reply to the above. We're struggling to understand exactly what semantics you want from this. *That* is what we want to document and get set in stone, and we'll move from there. > > > > > > For writing to /proc/pid/mem, it can be disabled via [1]. SELINUX and > > > Landlock can achieve the same protection too. > > > > I'm not blocking /proc/pid/mem, and my distro doesn't run any of those security modules :/ > > > It is a choice you can make :-) Your feature needs to work without "extra choices". -- Pedro