On Mon, Feb 08, 2021 at 11:49:22AM +0100, Michal Hocko wrote: > On Mon 08-02-21 10:49:17, Mike Rapoport wrote: > > From: Mike Rapoport <rppt@xxxxxxxxxxxxx> > > > > Introduce "memfd_secret" system call with the ability to create memory > > areas visible only in the context of the owning process and not mapped not > > only to other processes but in the kernel page tables as well. > > > > The secretmem feature is off by default and the user must explicitly enable > > it at the boot time. > > > > Once secretmem is enabled, the user will be able to create a file > > descriptor using the memfd_secret() system call. The memory areas created > > by mmap() calls from this file descriptor will be unmapped from the kernel > > direct map and they will be only mapped in the page table of the owning mm. > > Is this really true? I guess you meant to say that the memory will > visible only via page tables to anybody who can mmap the respective file > descriptor. There is nothing like an owning mm as the fd is inherently a > shareable resource and the ownership becomes a very vague and hard to > define term. Hmm, it seems I've been dragging this paragraph from the very first mmap(MAP_EXCLUSIVE) rfc and nobody (including myself) noticed the inconsistency. > > The file descriptor based memory has several advantages over the > > "traditional" mm interfaces, such as mlock(), mprotect(), madvise(). It > > paves the way for VMMs to remove the secret memory range from the process; > > I do not understand how it helps to remove the memory from the process > as the interface explicitly allows to add a memory that is removed from > all other processes via direct map. The current implementation does not help to remove the memory from the process, but using fd-backed memory seems a better interface to remove guest memory from host mappings than mmap. As Andy nicely put it: "Getting fd-backed memory into a guest will take some possibly major work in the kernel, but getting vma-backed memory into a guest without mapping it in the host user address space seems much, much worse." > > As secret memory implementation is not an extension of tmpfs or hugetlbfs, > > usage of a dedicated system call rather than hooking new functionality into > > memfd_create(2) emphasises that memfd_secret(2) has different semantics and > > allows better upwards compatibility. > > What is this supposed to mean? What are differences? Well, the phrasing could be better indeed. That supposed to mean that they differ in the semantics behind the file descriptor: memfd_create implements sealing for shmem and hugetlbfs while memfd_secret implements memory hidden from the kernel. > > The secretmem mappings are locked in memory so they cannot exceed > > RLIMIT_MEMLOCK. Since these mappings are already locked an attempt to > > mlock() secretmem range would fail and mlockall() will ignore secretmem > > mappings. > > What about munlock? Isn't this implied? ;-) I'll add a sentence about it. -- Sincerely yours, Mike.