On Thu, Jul 21, 2022, David Hildenbrand wrote: > On 21.07.22 11:44, David Hildenbrand wrote: > > On 06.07.22 10:20, Chao Peng wrote: > >> Normally, a write to unallocated space of a file or the hole of a sparse > >> file automatically causes space allocation, for memfd, this equals to > >> memory allocation. This new seal prevents such automatically allocating, > >> either this is from a direct write() or a write on the previously > >> mmap-ed area. The seal does not prevent fallocate() so an explicit > >> fallocate() can still cause allocating and can be used to reserve > >> memory. > >> > >> This is used to prevent unintentional allocation from userspace on a > >> stray or careless write and any intentional allocation should use an > >> explicit fallocate(). One of the main usecases is to avoid memory double > >> allocation for confidential computing usage where we use two memfds to > >> back guest memory and at a single point only one memfd is alive and we > >> want to prevent memory allocation for the other memfd which may have > >> been mmap-ed previously. More discussion can be found at: > >> > >> https://lkml.org/lkml/2022/6/14/1255 > >> > >> Suggested-by: Sean Christopherson <seanjc@xxxxxxxxxx> > >> Signed-off-by: Chao Peng <chao.p.peng@xxxxxxxxxxxxxxx> > >> --- > >> include/uapi/linux/fcntl.h | 1 + > >> mm/memfd.c | 3 ++- > >> mm/shmem.c | 16 ++++++++++++++-- > >> 3 files changed, 17 insertions(+), 3 deletions(-) > >> > >> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h > >> index 2f86b2ad6d7e..98bdabc8e309 100644 > >> --- a/include/uapi/linux/fcntl.h > >> +++ b/include/uapi/linux/fcntl.h > >> @@ -43,6 +43,7 @@ > >> #define F_SEAL_GROW 0x0004 /* prevent file from growing */ > >> #define F_SEAL_WRITE 0x0008 /* prevent writes */ > >> #define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */ > >> +#define F_SEAL_AUTO_ALLOCATE 0x0020 /* prevent allocation for writes */ > > > > Why only "on writes" and not "on reads". IIRC, shmem doesn't support the > > shared zeropage, so you'll simply allocate a new page via read() or on > > read faults. > > Correction: on read() we don't allocate a fresh page. But on read faults > we would. So this comment here needs clarification. Not just the comment, the code too. The intent of F_SEAL_AUTO_ALLOCATE is very much to block _all_ implicit allocations (or maybe just fault-based allocations if "implicit" is too broad of a description).