The man page for fcntl() describing memfd file seals states the following about F_SEAL_WRITE:- Furthermore, trying to create new shared, writable memory-mappings via mmap(2) will also fail with EPERM. With emphasis on _writable_. In turns out in fact that currently the kernel simply disallows _all_ new shared memory mappings for a memfd with F_SEAL_WRITE applied, rendering this documentation inaccurate. This matters because users are therefore unable to obtain a shared mapping to a memfd after write sealing altogether, which limits their usefulness. This was reported in the discussion thread [1] originating from a bug report [2]. This is a product of both using the struct address_space->i_mmap_writable atomic counter to determine whether writing may be permitted, and the kernel adjusting this counter when _any_ VM_SHARED mapping is performed. It seems sensible that we should only update this mapping if VM_MAYWRITE is specified, i.e. whether it is possible that this mapping could at any point be written to. If we do so then all we need to do to permit write seals to function as documented is to clear VM_MAYWRITE when mapping read-only. It turns out this functionality already exists for F_SEAL_FUTURE_WRITE - we can therefore simply adapt this logic to do the same for F_SEAL_WRITE. The final change required is to invoke call_mmap() before mapping_map_writable() - doing so ensures that the memfd-relevant shmem_mmap() or hugetlbfs_file_mmap() custom mmap handlers will be called before the writable test, enabling us to clear VM_MAYWRITE first. Thanks to Andy Lutomirski for the suggestion! [1]:https://lore.kernel.org/all/20230324133646.16101dfa666f253c4715d965@xxxxxxxxxxxxxxxxxxxx/ [2]:https://bugzilla.kernel.org/show_bug.cgi?id=217238 v2: - Removed RFC tag. - Correct incorrect goto pointed out by Jan. - Reworded cover letter as suggested by Jan. v1: https://lore.kernel.org/all/cover.1680560277.git.lstoakes@xxxxxxxxx/ Lorenzo Stoakes (3): mm: drop the assumption that VM_SHARED always implies writable mm: update seal_check_[future_]write() to include F_SEAL_WRITE as well mm: perform the mapping_map_writable() check after call_mmap() fs/hugetlbfs/inode.c | 2 +- include/linux/fs.h | 4 ++-- include/linux/mm.h | 24 ++++++++++++++++++------ kernel/fork.c | 2 +- mm/filemap.c | 2 +- mm/madvise.c | 2 +- mm/mmap.c | 22 +++++++++++----------- mm/shmem.c | 2 +- 8 files changed, 36 insertions(+), 24 deletions(-) -- 2.40.1