On 11/8/19 8:47 PM, Mike Kravetz wrote: > On 11/8/19 11:10 AM, Mike Kravetz wrote: >> On 11/7/19 6:04 PM, Davidlohr Bueso wrote: >>> On Thu, 07 Nov 2019, Mike Kravetz wrote: >>> >>>> Note that huge_pmd_share now increments the page count with the semaphore >>>> held just in read mode. It is OK to do increments in parallel without >>>> synchronization. However, we don't want anyone else changing the count >>>> while that check in huge_pmd_unshare is happening. Hence, the need for >>>> taking the semaphore in write mode. >>> This would be a nice addition to the changelog methinks. >> Last night I remembered there is one place where we currently take >> i_mmap_rwsem in read mode and potentially call huge_pmd_unshare. That >> is in try_to_unmap_one. Yes, there is a potential race here today. > Actually there is no race there today. Callers to huge_pmd_unshare > hold the page table lock. So, this synchronizes those unshare calls > from page migration and page poisoning. > >> But that race is somewhat contained as you need two threads doing some >> combination of page migration and page poisoning to race. This change >> now allows migration or poisoning to race with page fault. I would >> really prefer if we do not open up the race window in this manner. > But, we do open a race window by changing huge_pmd_share to take the > i_mmap_rwsem in read mode as in the original patch. > > Here is the additional code needed to take the semaphore in write mode > for the huge_pmd_unshare calls via try_to_unmap_one. We would need to > combine this with Longman's patch. Please take a look and provide feedback. > Some of the changes are subtle, especially the exception for MAP_PRIVATE > mappings, but I tried to add sufficient comments. > > From 21735818a520705c8573b8d543b8f91aa187bd5d Mon Sep 17 00:00:00 2001 > From: Mike Kravetz <mike.kravetz@xxxxxxxxxx> > Date: Fri, 8 Nov 2019 17:25:37 -0800 > Subject: [PATCH] Changes needed for taking i_mmap_rwsem in write mode before > call to huge_pmd_unshare in try_to_unmap_one. > > Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx> > --- > mm/hugetlb.c | 9 ++++++++- > mm/memory-failure.c | 28 +++++++++++++++++++++++++++- > mm/migrate.c | 27 +++++++++++++++++++++++++-- > 3 files changed, 60 insertions(+), 4 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index f78891f92765..73d9136549a5 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -4883,7 +4883,14 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) > * indicated by page_count > 1, unmap is achieved by clearing pud and > * decrementing the ref count. If count == 1, the pte page is not shared. > * > - * called with page table lock held. > + * Must be called while holding page table lock. > + * In general, the caller should also hold the i_mmap_rwsem in write mode. > + * This is to prevent races with page faults calling huge_pmd_share which > + * will not be holding the page table lock, but will be holding i_mmap_rwsem > + * in read mode. It is possible to call without holding i_mmap_rwsem in > + * write mode if the caller KNOWS the page table is associated with a private > + * mapping. This is because private mappings can not share PMDs and can > + * not race with huge_pmd_share calls during page faults. So the page table lock here is the huge_pte_lock(). Right? In huge_pmd_share(), the pte lock has to be taken before one can share it. So would you mind explaining where exactly is the race? Thanks, Longman > * > * returns: 1 successfully unmapped a shared pte page > * 0 the underlying pte page is not shared, or it is the last user > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index 3151c87dff73..8f52b22cf71b 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -1030,7 +1030,33 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn, > if (kill) > collect_procs(hpage, &tokill, flags & MF_ACTION_REQUIRED); > > - unmap_success = try_to_unmap(hpage, ttu); > + if (!PageHuge(hpage)) { > + unmap_success = try_to_unmap(hpage, ttu); > + } else { > + mapping = page_mapping(hpage); > + if (mapping) { > + /* > + * For hugetlb pages, try_to_unmap could potentially > + * call huge_pmd_unshare. Because of this, take > + * semaphore in write mode here and set TTU_RMAP_LOCKED > + * to indicate we have taken the lock at this higher > + * level. > + */ > + i_mmap_lock_write(mapping); > + unmap_success = try_to_unmap(hpage, > + ttu|TTU_RMAP_LOCKED); > + i_mmap_unlock_write(mapping); > + } else { > + /* > + * !mapping implies a MAP_PRIVATE huge page mapping. > + * Since PMDs will never be shared in a private > + * mapping, it is safe to let huge_pmd_unshare be > + * called with the semaphore in read mode. > + */ > + unmap_success = try_to_unmap(hpage, ttu); > + } > + } > + > if (!unmap_success) > pr_err("Memory failure: %#lx: failed to unmap page (mapcount=%d)\n", > pfn, page_mapcount(hpage)); > diff --git a/mm/migrate.c b/mm/migrate.c > index 4fe45d1428c8..9cae5a4f1e48 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -1333,8 +1333,31 @@ static int unmap_and_move_huge_page(new_page_t get_new_page, > goto put_anon; > > if (page_mapped(hpage)) { > - try_to_unmap(hpage, > - TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS); > + struct address_space *mapping = page_mapping(hpage); > + > + if (mapping) { > + /* > + * try_to_unmap could potentially call huge_pmd_unshare. > + * Because of this, take semaphore in write mode here > + * and set TTU_RMAP_LOCKED to indicate we have taken > + * the lock at this higher level. > + */ > + i_mmap_lock_write(mapping); > + try_to_unmap(hpage, > + TTU_MIGRATION|TTU_IGNORE_MLOCK| > + TTU_IGNORE_ACCESS|TTU_RMAP_LOCKED); > + i_mmap_unlock_write(mapping); > + } else { > + /* > + * !mapping implies a MAP_PRIVATE huge page mapping. > + * Since PMDs will never be shared in a private > + * mapping, it is safe to let huge_pmd_unshare be > + * called with the semaphore in read mode. > + */ > + try_to_unmap(hpage, > + TTU_MIGRATION|TTU_IGNORE_MLOCK| > + TTU_IGNORE_ACCESS); > + } > page_was_mapped = 1; > } >