On Tue, May 02, 2023 at 02:46:28PM +0200, Christian Borntraeger wrote: > Am 02.05.23 um 01:11 schrieb Lorenzo Stoakes: > > Writing to file-backed dirty-tracked mappings via GUP is inherently broken > > as we cannot rule out folios being cleaned and then a GUP user writing to > > them again and possibly marking them dirty unexpectedly. > > > > This is especially egregious for long-term mappings (as indicated by the > > use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as > > we have already done in the slow path. > > Hmm, does this interfer with KVM on s390 and PCI interpretion of interrupt delivery? > It would no longer work with file backed memory, correct? > > See > arch/s390/kvm/pci.c > > kvm_s390_pci_aif_enable > which does have > FOLL_WRITE | FOLL_LONGTERM > to > Does this memory map a dirty-tracked file? It's kind of hard to dig into where the address originates from without going through a ton of code. In worst case if the fast code doesn't find a whitelist it'll fall back to slow path which explicitly checks for dirty-tracked filesystem. We can reintroduce a flag to permit exceptions if this is really broken, are you able to test? I don't have an s390 sat around :) > > > > We have access to less information in the fast path as we cannot examine > > the VMA containing the mapping, however we can determine whether the folio > > is anonymous and then whitelist known-good mappings - specifically hugetlb > > and shmem mappings. > > > > While we obtain a stable folio for this check, the mapping might not be, as > > a truncate could nullify it at any time. Since doing so requires mappings > > to be zapped, we can synchronise against a TLB shootdown operation. > > > > For some architectures TLB shootdown is synchronised by IPI, against which > > we are protected as the GUP-fast operation is performed with interrupts > > disabled. However, other architectures which specify > > CONFIG_MMU_GATHER_RCU_TABLE_FREE use an RCU lock for this operation. > > > > In these instances, we acquire an RCU lock while performing our checks. If > > we cannot get a stable mapping, we fall back to the slow path, as otherwise > > we'd have to walk the page tables again and it's simpler and more effective > > to just fall back. > > > > It's important to note that there are no APIs allowing users to specify > > FOLL_FAST_ONLY for a PUP-fast let alone with FOLL_LONGTERM, so we can > > always rely on the fact that if we fail to pin on the fast path, the code > > will fall back to the slow path which can perform the more thorough check. > > > > Suggested-by: David Hildenbrand <david@xxxxxxxxxx> > > Suggested-by: Kirill A . Shutemov <kirill@xxxxxxxxxxxxx> > > Signed-off-by: Lorenzo Stoakes <lstoakes@xxxxxxxxx> > > --- > > mm/gup.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-- > > 1 file changed, 85 insertions(+), 2 deletions(-) > > > > diff --git a/mm/gup.c b/mm/gup.c > > index 0f09dec0906c..431618048a03 100644 > > --- a/mm/gup.c > > +++ b/mm/gup.c > > @@ -18,6 +18,7 @@ > > #include <linux/migrate.h> > > #include <linux/mm_inline.h> > > #include <linux/sched/mm.h> > > +#include <linux/shmem_fs.h> > > #include <asm/mmu_context.h> > > #include <asm/tlbflush.h> > > @@ -95,6 +96,77 @@ static inline struct folio *try_get_folio(struct page *page, int refs) > > return folio; > > } > > +#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE > > +static bool stabilise_mapping_rcu(struct folio *folio) > > +{ > > + struct address_space *mapping = READ_ONCE(folio->mapping); > > + > > + rcu_read_lock(); > > + > > + return mapping == READ_ONCE(folio->mapping); > > +} > > + > > +static void unlock_rcu(void) > > +{ > > + rcu_read_unlock(); > > +} > > +#else > > +static bool stabilise_mapping_rcu(struct folio *) > > +{ > > + return true; > > +} > > + > > +static void unlock_rcu(void) > > +{ > > +} > > +#endif > > + > > +/* > > + * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM | > > + * FOLL_WRITE pin is permitted for a specific folio. > > + * > > + * This assumes the folio is stable and pinned. > > + * > > + * Writing to pinned file-backed dirty tracked folios is inherently problematic > > + * (see comment describing the writeable_file_mapping_allowed() function). We > > + * therefore try to avoid the most egregious case of a long-term mapping doing > > + * so. > > + * > > + * This function cannot be as thorough as that one as the VMA is not available > > + * in the fast path, so instead we whitelist known good cases. > > + * > > + * The folio is stable, but the mapping might not be. When truncating for > > + * instance, a zap is performed which triggers TLB shootdown. IRQs are disabled > > + * so we are safe from an IPI, but some architectures use an RCU lock for this > > + * operation, so we acquire an RCU lock to ensure the mapping is stable. > > + */ > > +static bool folio_longterm_write_pin_allowed(struct folio *folio) > > +{ > > + bool ret; > > + > > + /* hugetlb mappings do not require dirty tracking. */ > > + if (folio_test_hugetlb(folio)) > > + return true; > > + > > + if (stabilise_mapping_rcu(folio)) { > > + struct address_space *mapping = folio_mapping(folio); > > + > > + /* > > + * Neither anonymous nor shmem-backed folios require > > + * dirty tracking. > > + */ > > + ret = folio_test_anon(folio) || > > + (mapping && shmem_mapping(mapping)); > > + } else { > > + /* If the mapping is unstable, fallback to the slow path. */ > > + ret = false; > > + } > > + > > + unlock_rcu(); > > + > > + return ret; > > +} > > + > > /** > > * try_grab_folio() - Attempt to get or pin a folio. > > * @page: pointer to page to be grabbed > > @@ -123,6 +195,8 @@ static inline struct folio *try_get_folio(struct page *page, int refs) > > */ > > struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) > > { > > + bool is_longterm = flags & FOLL_LONGTERM; > > + > > if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page))) > > return NULL; > > @@ -136,8 +210,7 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) > > * right zone, so fail and let the caller fall back to the slow > > * path. > > */ > > - if (unlikely((flags & FOLL_LONGTERM) && > > - !is_longterm_pinnable_page(page))) > > + if (unlikely(is_longterm && !is_longterm_pinnable_page(page))) > > return NULL; > > /* > > @@ -148,6 +221,16 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) > > if (!folio) > > return NULL; > > + /* > > + * Can this folio be safely pinned? We need to perform this > > + * check after the folio is stabilised. > > + */ > > + if ((flags & FOLL_WRITE) && is_longterm && > > + !folio_longterm_write_pin_allowed(folio)) { > > + folio_put_refs(folio, refs); > > + return NULL; > > + } > > + > > /* > > * When pinning a large folio, use an exact count to track it. > > *