Normally, a write to unallocated space of a file or the hole of a sparse
file automatically causes space allocation, for memfd, this equals to
memory allocation. This new seal prevents such automatically allocating,
either this is from a direct write() or a write on the previously
mmap-ed area. The seal does not prevent fallocate() so an explicit
fallocate() can still cause allocating and can be used to reserve
memory.
This is used to prevent unintentional allocation from userspace on a
stray or careless write and any intentional allocation should use an
explicit fallocate(). One of the main usecases is to avoid memory double
allocation for confidential computing usage where we use two memfds to
back guest memory and at a single point only one memfd is alive and we
want to prevent memory allocation for the other memfd which may have
been mmap-ed previously. More discussion can be found at:
https://lkml.org/lkml/2022/6/14/1255
Suggested-by: Sean Christopherson <seanjc@xxxxxxxxxx>
Signed-off-by: Chao Peng <chao.p.peng@xxxxxxxxxxxxxxx>
---
include/uapi/linux/fcntl.h | 1 +
mm/memfd.c | 3 ++-
mm/shmem.c | 16 ++++++++++++++--
3 files changed, 17 insertions(+), 3 deletions(-)
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 2f86b2ad6d7e..98bdabc8e309 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -43,6 +43,7 @@
#define F_SEAL_GROW 0x0004 /* prevent file from growing */
#define F_SEAL_WRITE 0x0008 /* prevent writes */
#define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */
+#define F_SEAL_AUTO_ALLOCATE 0x0020 /* prevent allocation for writes */
Why only "on writes" and not "on reads". IIRC, shmem doesn't support the
shared zeropage, so you'll simply allocate a new page via read() or on
read faults.
Also, I *think* you can place pages via userfaultfd into shmem. Not sure
if that would count "auto alloc", but it would certainly bypass fallocate().
I was also thinking this at the same time, but for different reason:
"Want to populate private preboot memory with firmware payload", so was
thinking userfaulftd could be an option as direct writes are restricted?
If that can be a side effect, I definitely glad to see it, though I'm
still not clear how userfaultfd can be particularly helpful for that.
Was thinking if we can use userfaultfd to monitor the pagefault on
virtual firmware memory range and use to populate the private memory.
Not sure if it is a side effect. Was just theoretically thinking (for
now kept the idea aside as these enhancements can be worked later).
Thanks,
Pankaj