On Mon 11-10-21 18:52:44, Suren Baghdasaryan wrote: > From: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> > > From: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> > > commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f upstream. > > Doing a "get_user_pages()" on a copy-on-write page for reading can be > ambiguous: the page can be COW'ed at any time afterwards, and the > direction of a COW event isn't defined. > > Yes, whoever writes to it will generally do the COW, but if the thread > that did the get_user_pages() unmapped the page before the write (and > that could happen due to memory pressure in addition to any outright > action), the writer could also just take over the old page instead. > > End result: the get_user_pages() call might result in a page pointer > that is no longer associated with the original VM, and is associated > with - and controlled by - another VM having taken it over instead. > > So when doing a get_user_pages() on a COW mapping, the only really safe > thing to do would be to break the COW when getting the page, even when > only getting it for reading. > > At the same time, some users simply don't even care. > > For example, the perf code wants to look up the page not because it > cares about the page, but because the code simply wants to look up the > physical address of the access for informational purposes, and doesn't > really care about races when a page might be unmapped and remapped > elsewhere. > > This adds logic to force a COW event by setting FOLL_WRITE on any > copy-on-write mapping when FOLL_GET (or FOLL_PIN) is used to get a page > pointer as a result. > > The current semantics end up being: > > - __get_user_pages_fast(): no change. If you don't ask for a write, > you won't break COW. You'd better know what you're doing. > > - get_user_pages_fast(): the fast-case "look it up in the page tables > without anything getting mmap_sem" now refuses to follow a read-only > page, since it might need COW breaking. Which happens in the slow > path - the fast path doesn't know if the memory might be COW or not. > > - get_user_pages() (including the slow-path fallback for gup_fast()): > for a COW mapping, turn on FOLL_WRITE for FOLL_GET/FOLL_PIN, with > very similar semantics to FOLL_FORCE. > > If it turns out that we want finer granularity (ie "only break COW when > it might actually matter" - things like the zero page are special and > don't need to be broken) we might need to push these semantics deeper > into the lookup fault path. So if people care enough, it's possible > that we might end up adding a new internal FOLL_BREAK_COW flag to go > with the internal FOLL_COW flag we already have for tracking "I had a > COW". > > Alternatively, if it turns out that different callers might want to > explicitly control the forced COW break behavior, we might even want to > make such a flag visible to the users of get_user_pages() instead of > using the above default semantics. > > But for now, this is mostly commentary on the issue (this commit message > being a lot bigger than the patch, and that patch in turn is almost all > comments), with that minimal "enable COW breaking early" logic using the > existing FOLL_WRITE behavior. > > [ It might be worth noting that we've always had this ambiguity, and it > could arguably be seen as a user-space issue. > > You only get private COW mappings that could break either way in > situations where user space is doing cooperative things (ie fork() > before an execve() etc), but it _is_ surprising and very subtle, and > fork() is supposed to give you independent address spaces. > > So let's treat this as a kernel issue and make the semantics of > get_user_pages() easier to understand. Note that obviously a true > shared mapping will still get a page that can change under us, so this > does _not_ mean that get_user_pages() somehow returns any "stable" > page ] > > [surenb: backport notes > Since gup_pgd_range does not exist, made appropriate changes on > the the gup_huge_pgd, gup_huge_pd and gup_pud_range calls instead. > Replaced (gup_flags | FOLL_WRITE) with write=1 in gup_huge_pgd, > gup_huge_pd and gup_pud_range. > Removed FOLL_PIN usage in should_force_cow_break since it's missing in > the earlier kernels.] I'd be really careful with backporting this to stable. There was a lot of userspace breakage caused by this change if I remember right which needed to be fixed up later. There is a nice summary at https://lwn.net/Articles/849638/ and https://lwn.net/Articles/849876/ and some problems are still being found... Honza > > Reported-by: Jann Horn <jannh@xxxxxxxxxx> > Tested-by: Christoph Hellwig <hch@xxxxxx> > Acked-by: Oleg Nesterov <oleg@xxxxxxxxxx> > Acked-by: Kirill Shutemov <kirill@xxxxxxxxxxxxx> > Acked-by: Jan Kara <jack@xxxxxxx> > Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx> > Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx> > Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> > [surenb: backport to 4.4 kernel] > Cc: stable@xxxxxxxxxxxxxxx # 4.4.x > Signed-off-by: Suren Baghdasaryan <surenb@xxxxxxxxxx> > --- > mm/gup.c | 48 ++++++++++++++++++++++++++++++++++++++++-------- > mm/huge_memory.c | 7 +++---- > 2 files changed, 43 insertions(+), 12 deletions(-) > > diff --git a/mm/gup.c b/mm/gup.c > index 4c5857889e9d..c80cdc408228 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -59,13 +59,22 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address, > } > > /* > - * FOLL_FORCE can write to even unwritable pte's, but only > - * after we've gone through a COW cycle and they are dirty. > + * FOLL_FORCE or a forced COW break can write even to unwritable pte's, > + * but only after we've gone through a COW cycle and they are dirty. > */ > static inline bool can_follow_write_pte(pte_t pte, unsigned int flags) > { > - return pte_write(pte) || > - ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte)); > + return pte_write(pte) || ((flags & FOLL_COW) && pte_dirty(pte)); > +} > + > +/* > + * A (separate) COW fault might break the page the other way and > + * get_user_pages() would return the page from what is now the wrong > + * VM. So we need to force a COW break at GUP time even for reads. > + */ > +static inline bool should_force_cow_break(struct vm_area_struct *vma, unsigned int flags) > +{ > + return is_cow_mapping(vma->vm_flags) && (flags & FOLL_GET); > } > > static struct page *follow_page_pte(struct vm_area_struct *vma, > @@ -509,12 +518,18 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, > if (!vma || check_vma_flags(vma, gup_flags)) > return i ? : -EFAULT; > if (is_vm_hugetlb_page(vma)) { > + if (should_force_cow_break(vma, foll_flags)) > + foll_flags |= FOLL_WRITE; > i = follow_hugetlb_page(mm, vma, pages, vmas, > &start, &nr_pages, i, > - gup_flags); > + foll_flags); > continue; > } > } > + > + if (should_force_cow_break(vma, foll_flags)) > + foll_flags |= FOLL_WRITE; > + > retry: > /* > * If we have a pending SIGKILL, don't keep faulting pages and > @@ -1346,6 +1361,10 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end, > /* > * Like get_user_pages_fast() except it's IRQ-safe in that it won't fall back to > * the regular GUP. It will only return non-negative values. > + * > + * Careful, careful! COW breaking can go either way, so a non-write > + * access can get ambiguous page results. If you call this function without > + * 'write' set, you'd better be sure that you're ok with that ambiguity. > */ > int __get_user_pages_fast(unsigned long start, int nr_pages, int write, > struct page **pages) > @@ -1375,6 +1394,12 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, > * > * We do not adopt an rcu_read_lock(.) here as we also want to > * block IPIs that come from THPs splitting. > + * > + * NOTE! We allow read-only gup_fast() here, but you'd better be > + * careful about possible COW pages. You'll get _a_ COW page, but > + * not necessarily the one you intended to get depending on what > + * COW event happens after this. COW may break the page copy in a > + * random direction. > */ > > local_irq_save(flags); > @@ -1385,15 +1410,22 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, > next = pgd_addr_end(addr, end); > if (pgd_none(pgd)) > break; > + /* > + * The FAST_GUP case requires FOLL_WRITE even for pure reads, > + * because get_user_pages() may need to cause an early COW in > + * order to avoid confusing the normal COW routines. So only > + * targets that are already writable are safe to do by just > + * looking at the page tables. > + */ > if (unlikely(pgd_huge(pgd))) { > - if (!gup_huge_pgd(pgd, pgdp, addr, next, write, > + if (!gup_huge_pgd(pgd, pgdp, addr, next, 1, > pages, &nr)) > break; > } else if (unlikely(is_hugepd(__hugepd(pgd_val(pgd))))) { > if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr, > - PGDIR_SHIFT, next, write, pages, &nr)) > + PGDIR_SHIFT, next, 1, pages, &nr)) > break; > - } else if (!gup_pud_range(pgd, addr, next, write, pages, &nr)) > + } else if (!gup_pud_range(pgd, addr, next, 1, pages, &nr)) > break; > } while (pgdp++, addr = next, addr != end); > local_irq_restore(flags); > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 6404e4fcb4ed..fae45c56e2ee 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -1268,13 +1268,12 @@ out_unlock: > } > > /* > - * FOLL_FORCE can write to even unwritable pmd's, but only > - * after we've gone through a COW cycle and they are dirty. > + * FOLL_FORCE or a forced COW break can write even to unwritable pmd's, > + * but only after we've gone through a COW cycle and they are dirty. > */ > static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags) > { > - return pmd_write(pmd) || > - ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd)); > + return pmd_write(pmd) || ((flags & FOLL_COW) && pmd_dirty(pmd)); > } > > struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, > -- > 2.33.0.882.g93a45727a2-goog > -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR