On Tue, Jun 8, 2021 at 9:12 PM Hugh Dickins <hughd@xxxxxxxxxx> wrote: > > > > ---------- Forwarded message ---------- > Date: Tue, 8 Jun 2021 21:10:19 -0700 (PDT) > From: Hugh Dickins <hughd@xxxxxxxxxx> > To: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > Cc: Hugh Dickins <hughd@xxxxxxxxxx>, > Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx>, > Yang Shi <shy828301@xxxxxxxxx>, Wang Yugui <wangyugui@xxxxxxxxxxxx>, > Matthew Wilcox <willy@xxxxxxxxxxxxx>, > Naoya Horiguchi <naoya.horiguchi@xxxxxxx>, > Alistair Popple <apopple@xxxxxxxxxx>, Ralph Campbell <rcampbell@xxxxxxxxxx>, > Zi Yan <ziy@xxxxxxxxxx>, Miaohe Lin <linmiaohe@xxxxxxxxxx>, > Minchan Kim <minchan@xxxxxxxxxx>, Jue Wang <juew@xxxxxxxxxx>, > Peter Xu <peterx@xxxxxxxxxx>, Jan Kara <jack@xxxxxxx>, > Shakeel Butt <shakeelb@xxxxxxxxxx>, Oscar Salvador <osalvador@xxxxxxx> > Subject: [PATCH v2 03/10] mm/thp: try_to_unmap() use TTU_SYNC for safe splitting > > Stressing huge tmpfs often crashed on unmap_page()'s VM_BUG_ON_PAGE > (!unmap_success): with dump_page() showing mapcount:1, but then its > raw struct page output showing _mapcount ffffffff i.e. mapcount 0. > > And even if that particular VM_BUG_ON_PAGE(!unmap_success) is removed, > it is immediately followed by a VM_BUG_ON_PAGE(compound_mapcount(head)), > and further down an IS_ENABLED(CONFIG_DEBUG_VM) total_mapcount BUG(): > all indicative of some mapcount difficulty in development here perhaps. > But the !CONFIG_DEBUG_VM path handles the failures correctly and silently. > > I believe the problem is that once a racing unmap has cleared pte or pmd, > try_to_unmap_one() may skip taking the page table lock, and emerge from > try_to_unmap() before the racing task has reached decrementing mapcount. > > Instead of abandoning the unsafe VM_BUG_ON_PAGE(), and the ones that > follow, use PVMW_SYNC in try_to_unmap_one() in this case: adding TTU_SYNC > to the options, and passing that from unmap_page(). > > When CONFIG_DEBUG_VM, or for non-debug too? Consensus is to do the same > for both: the slight overhead added should rarely matter, except perhaps > if splitting sparsely-populated multiply-mapped shmem. Once confident > that bugs are fixed, TTU_SYNC here can be removed, and the race tolerated. > > Fixes: fec89c109f3a ("thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers") > Signed-off-by: Hugh Dickins <hughd@xxxxxxxxxx> > Cc: <stable@xxxxxxxxxxxxxxx> > --- > v2: moved TTU_SYNC definition up, to avoid conflict with other patchset > use TTU_SYNC even when non-debug, per Peter Xu and Yang Shi > expanded PVMW_SYNC's spin_unlock(pmd_lock()), per Kirill and Peter Reviewed-by: Yang Shi <shy828301@xxxxxxxxx> > > include/linux/rmap.h | 1 + > mm/huge_memory.c | 2 +- > mm/page_vma_mapped.c | 11 +++++++++++ > mm/rmap.c | 17 ++++++++++++++++- > 4 files changed, 29 insertions(+), 2 deletions(-) > > diff --git a/include/linux/rmap.h b/include/linux/rmap.h > index def5c62c93b3..8d04e7deedc6 100644 > --- a/include/linux/rmap.h > +++ b/include/linux/rmap.h > @@ -91,6 +91,7 @@ enum ttu_flags { > > TTU_SPLIT_HUGE_PMD = 0x4, /* split huge PMD if any */ > TTU_IGNORE_MLOCK = 0x8, /* ignore mlock */ > + TTU_SYNC = 0x10, /* avoid racy checks with PVMW_SYNC */ > TTU_IGNORE_HWPOISON = 0x20, /* corrupted page is recoverable */ > TTU_BATCH_FLUSH = 0x40, /* Batch TLB flushes where possible > * and caller guarantees they will > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 5885c5f5836f..84ab735139dc 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -2350,7 +2350,7 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma, > > static void unmap_page(struct page *page) > { > - enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | > + enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_SYNC | > TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD; > bool unmap_success; > > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c > index 2cf01d933f13..5b559967410e 100644 > --- a/mm/page_vma_mapped.c > +++ b/mm/page_vma_mapped.c > @@ -212,6 +212,17 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) > pvmw->ptl = NULL; > } > } else if (!pmd_present(pmde)) { > + /* > + * If PVMW_SYNC, take and drop THP pmd lock so that we > + * cannot return prematurely, while zap_huge_pmd() has > + * cleared *pmd but not decremented compound_mapcount(). > + */ > + if ((pvmw->flags & PVMW_SYNC) && > + PageTransCompound(pvmw->page)) { > + spinlock_t *ptl = pmd_lock(mm, pvmw->pmd); > + > + spin_unlock(ptl); > + } > return false; > } > if (!map_pte(pvmw)) > diff --git a/mm/rmap.c b/mm/rmap.c > index 693a610e181d..07811b4ae793 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1405,6 +1405,15 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > struct mmu_notifier_range range; > enum ttu_flags flags = (enum ttu_flags)(long)arg; > > + /* > + * When racing against e.g. zap_pte_range() on another cpu, > + * in between its ptep_get_and_clear_full() and page_remove_rmap(), > + * try_to_unmap() may return false when it is about to become true, > + * if page table locking is skipped: use TTU_SYNC to wait for that. > + */ > + if (flags & TTU_SYNC) > + pvmw.flags = PVMW_SYNC; > + > /* munlock has nothing to gain from examining un-locked vmas */ > if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED)) > return true; > @@ -1777,7 +1786,13 @@ bool try_to_unmap(struct page *page, enum ttu_flags flags) > else > rmap_walk(page, &rwc); > > - return !page_mapcount(page) ? true : false; > + /* > + * When racing against e.g. zap_pte_range() on another cpu, > + * in between its ptep_get_and_clear_full() and page_remove_rmap(), > + * try_to_unmap() may return false when it is about to become true, > + * if page table locking is skipped: use TTU_SYNC to wait for that. > + */ > + return !page_mapcount(page); > } > > /** > -- > 2.26.2 > >