Hi everybody, as discussed in the linux-mm alignment session on Wednesday, this is part 1 of the COW fixes: fix the COW security issue using GUP-triggered unsharing of shared anonymous pages (ordinary, THP, hugetlb). In the meeting slides, this approach was referred to as "Copy On Read". If anybody wants to have access to the slides, please feel free to reach out. The patches are based on v5.16-rc5 and available at: https://github.com/davidhildenbrand/linux/pull/new/unshare_v1 It is currently again possible for a child process to observe modifications of anonymous pages performed by the parent process after fork() in some cases, which is not only a violation of the POSIX semantics of MAP_PRIVATE, but more importantly a real security issue. This issue, including other related COW issues, has been summarized at [1]: " 1. Observing Memory Modifications of Private Pages From A Child Process Long story short: process-private memory might not be as private as you think once you fork(): successive modifications of private memory regions in the parent process can still be observed by the child process, for example, by smart use of vmsplice()+munmap(). The core problem is that pinning pages readable in a child process, such as done via the vmsplice system call, can result in a child process observing memory modifications done in the parent process the child is not supposed to observe. [1] contains an excellent summary and [2] contains further details. This issue was assigned CVE-2020-29374 [9]. For this to trigger, it's required to use a fork() without subsequent exec(), for example, as used under Android zygote. Without further details about an application that forks less-privileged child processes, one cannot really say what's actually affected and what's not -- see the details section the end of this mail for a short sshd/openssh analysis. While commit 17839856fd58 ("gup: document and work around "COW can break either way" issue") fixed this issue and resulted in other problems (e.g., ptrace on pmem), commit 09854ba94c6a ("mm: do_wp_page() simplification") re-introduced part of the problem unfortunately. The original reproducer can be modified quite easily to use THP [3] and make the issue appear again on upstream kernels. I modified it to use hugetlb [4] and it triggers as well. The problem is certainly less severe with hugetlb than with THP; it merely highlights that we still have plenty of open holes we should be closing/fixing. Regarding vmsplice(), the only known workaround is to disallow the vmsplice() system call ... or disable THP and hugetlb. But who knows what else is affected (RDMA? O_DIRECT?) to achieve the same goal -- in the end, it's a more generic issue. " This security issue was first reported by Jann Horn on 27 May 2020 and it currently affects anonymous THP and hugetlb again. The "security issue" part for hugetlb might be less important than for THP. However, with this approach it's just easy to get the MAP_PRIVATE semantics of any anonymous pages in that regard and avoid any such information leaks without much added complexity. Ordinary anonymous pages are currently not affected, because the COW logic was changed in commit 09854ba94c6a ("mm: do_wp_page() simplification") for them to COW on "page_count() != 1" instead of "mapcount > 1", which unfortunately results in other COW issues, some of them documented in [1] as well. To fix this COW issue once and for all, introduce GUP-triggered unsharing that can be conditionally triggered via FAULT_FLAG_UNSHARE. In contrast to traditional COW, unsharing will leave the copied page mapped write-protected in the page table, not having the semantics of a write fault. Logically, unsharing is triggered "early", as soon as GUP performs the action that could result in a COW getting missed later and the security issue triggering: however, unsharing is not triggered as before via a write fault with undesired side effects. Long story short, GUP triggers unsharing if all of the following conditions are met: * The page is mapped R/O * We have an anonymous page, excluding KSM * We want to read (!FOLL_WRITE) * Unsharing is not disabled (!FOLL_NOUNSHARE) * We want to take a reference (FOLL_GET or FOLL_PIN) * The page is a shared anonymous page: mapcount > 1 To reliably detect shared anonymous THP without heavy locking, introduce a mapcount_seqcount seqlock that protects the mapcount of a THP and can be used to read an atomic mapcount value. The mapcount_seqlock is stored inside the memmap of the compound page -- to keep it simple, factor out a raw_seqlock_t from the seqlock_t. As this patch series introduces the same unsharing logic for any anonymous pages, it also paves the way to fix other COW issues, e.g., documented in [1], without reintroducing the security issue or reintroducing other issues we observed in the past (e.g., broken ptrace on pmem). All reproducers for this COW issue have been consolidated in the selftest included in this series. Hopefully we'll get this fixed for good. Future work: * get_user_pages_fast_only() can currently spin on the mapcount_seqcount when reading the mapcount, which might be a rare event. While this is fine even when done from get_user_pages_fast_only() in IRQ context, we might want to just fail fast in get_user_pages_fast_only(). We already have patches prepared that add page_anon_maybe_shared() and page_trans_huge_anon_maybe_shared() that will return "true" in case spinning would be required and make get_user_pages_fast_only() fail fast. I'm excluding them for simplicity. ... even better would be finding a way to just not need the mapcount_seqcount, but THP splitting and PageDoubleMap() gives us a hard time -- but maybe we'll eventually find a way someday :) * Part 2 will tackle the other user-space visible breakages / COW issues raised in [1]. This series is the basis for adjusting the COW logic once again without re-introducing the COW issue fixed in this series and without reintroducing the issues we saw with the original CVE fix (e.g., breaking ptrace on pmem). There might be further parts to improve the GUP long-term <-> MM synchronicity and to optimize some things around that. The idea is by Andrea and some patches are rewritten versions of prototype patches by Andrea. I cross-compiled and tested as good as possible. I'll CC locking+selftest folks only on the relevant patch and the cover letter to minimze the noise. I'll put everyone on CC who was either involved with the COW issues in the past or attended the linux-mm alignment session on Wednesday. Appologies if I forget anyone :) [1] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@xxxxxxxxxx David Hildenbrand (11): seqlock: provide lockdep-free raw_seqcount_t variant mm: thp: consolidate mapcount logic on THP split mm: simplify hugetlb and file-THP handling in __page_mapcount() mm: thp: simlify total_mapcount() mm: thp: allow for reading the THP mapcount atomically via a raw_seqlock_t mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb) mm: gup: trigger unsharing via FAULT_FLAG_UNSHARE when required (!hugetlb) mm: hugetlb: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE mm: gup: trigger unsharing via FAULT_FLAG_UNSHARE when required (hugetlb) mm: thp: introduce and use page_trans_huge_anon_shared() selftests/vm: add tests for the known COW security issues Documentation/locking/seqlock.rst | 50 ++++ include/linux/huge_mm.h | 72 +++++ include/linux/mm.h | 14 + include/linux/mm_types.h | 9 + include/linux/seqlock.h | 145 +++++++--- mm/gup.c | 89 +++++- mm/huge_memory.c | 120 +++++++-- mm/hugetlb.c | 129 +++++++-- mm/memory.c | 136 ++++++++-- mm/rmap.c | 40 +-- mm/swapfile.c | 35 ++- mm/util.c | 24 +- tools/testing/selftests/vm/Makefile | 1 + tools/testing/selftests/vm/gup_cow.c | 312 ++++++++++++++++++++++ tools/testing/selftests/vm/run_vmtests.sh | 16 ++ 15 files changed, 1044 insertions(+), 148 deletions(-) create mode 100644 tools/testing/selftests/vm/gup_cow.c -- 2.31.1