On 17.12.21 20:04, Linus Torvalds wrote: > On Fri, Dec 17, 2021 at 3:34 AM David Hildenbrand <david@xxxxxxxxxx> wrote: >> >> + * If the child takes a read-only pin on such a page (i.e., FOLL_WRITE is not >> + * set) and then unmaps the target page, we have: >> + * >> + * * page has mapcount == 1 and refcount > 1 > Hi Linus, > All these games with mapcount makes me think this is still broken. > > mapcount has been a horribly broken thing in the past, and I'm not > convinced it's not a broken thing now. It all started when Jann detected the security issue in GUP – and this patch set is fixing the problem exactly there, in GUP itself. Are you aware of COW issues regarding the mapcount if we would remove GUP from the equation? My point is that COW without GUP works just perfectly fine, but I’ll be happy to learn about other cases I was ignoring so far. Unfortunately, page_count() is even more unreliable, and the issues we're just detecting (see the link in the cover letter: memory corruptions inside user space -- e.g., lost DMA writes) are even worse than what we had before -- IMHO of course. > >> + vmf->page = vm_normal_page(vmf->vma, vmf->address, vmf->orig_pte); >> + if (vmf->page && PageAnon(vmf->page) && !PageKsm(vmf->page) && >> + page_mapcount(vmf->page) > 1) { > > What keeps the mapcount stable in here? So, we're reading an atomic value here. It’s read via atomic_read for regular pages, and the THP mapcount case has also been made atomic (as lockless as page_count) in patch #5. If a page is mapped exactly once, page_mapcount(page) == 1 and there is nothing to do. If the page is mapped more than once, page_mapcount(page) > 1 and we would have to trigger unsharing. And it’s true that the value is unstable in this case, but we really only care about page_mapcount(page) > 1 vs. page_mapcount(page) == 1. In this respect, there is no difference from the instability of the page_count and the mapcount – we still only care if it’s >1 or == 1. So the only case we could care about is concurrent additional mappings that can increment the mapcount -- which can only happen due to concurrent fork. So if we're reading page_mapcount(page) == 1 the only way we can get page_mapcount(page) > 1 is due to fork(). But we're holding the mmap_lock in read mode during faults and fork requires the mmap_lock in write mode. > > And I still believe that the whole notion that "COW should use > mapcount" is pure and utter garbage. > > If we are doing a COW, we need an *exclusive* access to the page. That > is not mapcount, that is the page ref. I thought about this a lot, because initially I had the same opinion. But really, we don't care about any speculative references (pagecache, migration, daemon, pagevec, ...) or any short-term "I just want to grab this reference real quick such that the page can't get freed" references. All we care about are GUP references, and we attack that problem at the root by triggering unsharing exactly at the point where GUP comes into play. So IMHO GUP is the problem and needs unsharing either: * On write access to a shared anonymous page, which is just COW as we know it. * On read access to a shared anonymous page, which is what we’re proposing in this patch set. So as soon as GUP comes into play, even if only pinning R/O, we have to trigger unsharing . Doing so enforces the invariant that it is impossible to take a GUP pin on an anonymous page with a mapcount > 1. In turn, the COW does not need to worry about the GUP after fork() security issue anymore and it can focus on doing optimally the COW faults as if GUP just wouldn’t exist. -- Thanks, David / dhildenb