On 13.05.21 20:47, Mike Rapoport wrote:
From: Mike Rapoport <rppt@xxxxxxxxxxxxx>
Introduce "memfd_secret" system call with the ability to create
memory areas visible only in the context of the owning process and
not mapped not only to other processes but in the kernel page tables
as well.
The secretmem feature is off by default and the user must explicitly
enable it at the boot time.
Once secretmem is enabled, the user will be able to create a file
descriptor using the memfd_secret() system call. The memory areas
created by mmap() calls from this file descriptor will be unmapped
from the kernel direct map and they will be only mapped in the page
table of the processes that have access to the file descriptor.
The file descriptor based memory has several advantages over the
"traditional" mm interfaces, such as mlock(), mprotect(), madvise().
File descriptor approach allows explict and controlled sharing of the
memory
s/explict/explicit/
areas, it allows to seal the operations. Besides, file descriptor
based memory paves the way for VMMs to remove the secret memory range
from the userpace hipervisor process, for instance QEMU. Andy
Lutomirski says:
s/userpace hipervisor/userspace hypervisor/
"Getting fd-backed memory into a guest will take some possibly major
work in the kernel, but getting vma-backed memory into a guest
without mapping it in the host user address space seems much, much
worse."
memfd_secret() is made a dedicated system call rather than an
extention to
s/extention/extension/
memfd_create() because it's purpose is to allow the user to create
more secure memory mappings rather than to simply allow file based
access to the memory. Nowadays a new system call cost is negligible
while it is way simpler for userspace to deal with a clear-cut system
calls than with a multiplexer or an overloaded syscall. Moreover, the
initial implementation of memfd_secret() is completely distinct from
memfd_create() so there is no much sense in overloading
memfd_create() to begin with. If there will be a need for code
sharing between these implementation it can be easily achieved
without a need to adjust user visible APIs.
The secret memory remains accessible in the process context using
uaccess primitives, but it is not exposed to the kernel otherwise;
secret memory areas are removed from the direct map and functions in
the follow_page()/get_user_page() family will refuse to return a page
that belongs to the secret memory area.
Once there will be a use case that will require exposing secretmem to
the kernel it will be an opt-in request in the system call flags so
that user would have to decide what data can be exposed to the
kernel.
Maybe spell out an example: like page migration.
Removing of the pages from the direct map may cause its fragmentation
on architectures that use large pages to map the physical memory
which affects the system performance. However, the original Kconfig
text for CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct
map "... can improve the kernel's performance a tiny bit ..." (commit
00d1c5e05736 ("x86: add gbpages switches")) and the recent report [1]
showed that "... although 1G mappings are a good default choice,
there is no compelling evidence that it must be the only choice".
Hence, it is sufficient to have secretmem disabled by default with
the ability of a system administrator to enable it at boot time.
Maybe add a link to the Intel performance evaluation.
Pages in the secretmem regions are unevictable and unmovable to
avoid accidental exposure of the sensitive data via swap or during
page migration.
Since the secretmem mappings are locked in memory they cannot exceed
RLIMIT_MEMLOCK. Since these mappings are already locked independently
from mlock(), an attempt to mlock()/munlock() secretmem range would
fail and mlockall()/munlockall() will ignore secretmem mappings.
Maybe add something like "similar to pages pinned by VFIO".
However, unlike mlock()ed memory, secretmem currently behaves more
like long-term GUP: secretmem mappings are unmovable mappings
directly consumed by user space. With default limits, there is no
excessive use of secretmem and it poses no real problem in
combination with ZONE_MOVABLE/CMA, but in the future this should be
addressed to allow balanced use of large amounts of secretmem along
with ZONE_MOVABLE/CMA.
A page that was a part of the secret memory area is cleared when it
is freed to ensure the data is not exposed to the next user of that
page.
You could skip that with init_on_free (and eventually also with
init_on_alloc) set to avoid double clearing.
The following example demonstrates creation of a secret mapping
(error handling is omitted):
fd = memfd_secret(0); ftruncate(fd, MAP_SIZE); ptr = mmap(NULL,
MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
[1]
https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@xxxxxxxxxxxxxxx/
[my mail client messed up the remainder of the mail for whatever reason,
will comment in a separate mail if there is anything to comment :) ]
--
Thanks,
David / dhildenb