On 11/12/19 6:11 PM, Mike Kravetz wrote: > On 11/12/19 9:27 AM, Waiman Long wrote: >> On 11/8/19 8:47 PM, Mike Kravetz wrote: >>> On 11/8/19 11:10 AM, Mike Kravetz wrote: >>>> On 11/7/19 6:04 PM, Davidlohr Bueso wrote: >>>>> On Thu, 07 Nov 2019, Mike Kravetz wrote: >>>>> >>>>>> Note that huge_pmd_share now increments the page count with the semaphore >>>>>> held just in read mode. It is OK to do increments in parallel without >>>>>> synchronization. However, we don't want anyone else changing the count >>>>>> while that check in huge_pmd_unshare is happening. Hence, the need for >>>>>> taking the semaphore in write mode. >>>>> This would be a nice addition to the changelog methinks. >>>> Last night I remembered there is one place where we currently take >>>> i_mmap_rwsem in read mode and potentially call huge_pmd_unshare. That >>>> is in try_to_unmap_one. Yes, there is a potential race here today. >>> Actually there is no race there today. Callers to huge_pmd_unshare >>> hold the page table lock. So, this synchronizes those unshare calls >>> from page migration and page poisoning. >>> >>>> But that race is somewhat contained as you need two threads doing some >>>> combination of page migration and page poisoning to race. This change >>>> now allows migration or poisoning to race with page fault. I would >>>> really prefer if we do not open up the race window in this manner. >>> But, we do open a race window by changing huge_pmd_share to take the >>> i_mmap_rwsem in read mode as in the original patch. >>> >>> Here is the additional code needed to take the semaphore in write mode >>> for the huge_pmd_unshare calls via try_to_unmap_one. We would need to >>> combine this with Longman's patch. Please take a look and provide feedback. >>> Some of the changes are subtle, especially the exception for MAP_PRIVATE >>> mappings, but I tried to add sufficient comments. >>> >>> From 21735818a520705c8573b8d543b8f91aa187bd5d Mon Sep 17 00:00:00 2001 >>> From: Mike Kravetz <mike.kravetz@xxxxxxxxxx> >>> Date: Fri, 8 Nov 2019 17:25:37 -0800 >>> Subject: [PATCH] Changes needed for taking i_mmap_rwsem in write mode before >>> call to huge_pmd_unshare in try_to_unmap_one. >>> >>> Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx> >>> --- >>> mm/hugetlb.c | 9 ++++++++- >>> mm/memory-failure.c | 28 +++++++++++++++++++++++++++- >>> mm/migrate.c | 27 +++++++++++++++++++++++++-- >>> 3 files changed, 60 insertions(+), 4 deletions(-) >>> >>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>> index f78891f92765..73d9136549a5 100644 >>> --- a/mm/hugetlb.c >>> +++ b/mm/hugetlb.c >>> @@ -4883,7 +4883,14 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) >>> * indicated by page_count > 1, unmap is achieved by clearing pud and >>> * decrementing the ref count. If count == 1, the pte page is not shared. >>> * >>> - * called with page table lock held. >>> + * Must be called while holding page table lock. >>> + * In general, the caller should also hold the i_mmap_rwsem in write mode. >>> + * This is to prevent races with page faults calling huge_pmd_share which >>> + * will not be holding the page table lock, but will be holding i_mmap_rwsem >>> + * in read mode. It is possible to call without holding i_mmap_rwsem in >>> + * write mode if the caller KNOWS the page table is associated with a private >>> + * mapping. This is because private mappings can not share PMDs and can >>> + * not race with huge_pmd_share calls during page faults. >> So the page table lock here is the huge_pte_lock(). Right? In >> huge_pmd_share(), the pte lock has to be taken before one can share it. >> So would you mind explaining where exactly is the race? > My concern was that this code in huge_pmd_share: > > saddr = page_table_shareable(svma, vma, addr, idx); > if (saddr) { > spte = huge_pte_offset(svma->vm_mm, saddr, > vma_mmu_pagesize(svma)); > if (spte) { > get_page(virt_to_page(spte)); > break; > } > } > > and this code in huge_pmd_unshare: > > BUG_ON(page_count(virt_to_page(ptep)) == 0); > if (page_count(virt_to_page(ptep)) == 1) > return 0; > > could run concurrently. Note that return code for huge_pmd_unshare is > specified as, > > * returns: 1 successfully unmapped a shared pte page > * 0 the underlying pte page is not shared, or it is the last user > > And, callers take different action depending on the return value. > > Now, since the code blocks above can run concurrently it is possible that: > - huge_pmd_unshare sees page_count == 1 > - huge_pmd_share increments page_count in anticipation of sharing > - huge_pmd_unshare returns 0 indicating there is no pmd sharing > - huge_pmd_share waits for page table lock to insert pmd page before it > sharts sharing > > My concern was with what might happen if a huge_pmd_unshare caller received > a 0 return code and thought there were no other users of the pmd. In the > specific case mentioned here (try_to_unmap_one) I now do not believe there > are any issues. The caller is simply going to clear one entry on the pmd > page. I can not think of a way this could impact the other thread waiting > to share the pmd. It will simply start sharing the pmd after it gets the > page table lock and the entry has been removed. > > As the result of your question, I went back and took a really close look > at the code in question. While that huge_pmd_share/huge_pmd_unshare code > running concurrently does make me a bit nervous, I can not find any specific > problem. In addition, I ran a test causes this race for several hours > without issue. > > Sorry for the churn/second thoughts. This code is subtle, and sometimes hard > to understand. IMO, Longman's patch (V2) can go forward, but please delete > the following blob in the commit message from me: > > "Note that huge_pmd_share now increments the page count with the > semaphore held just in read mode. It is OK to do increments in > parallel without synchronization. However, we don't want anyone else > changing the count while that check in huge_pmd_unshare is happening. > Hence, the need for taking the semaphore in write mode." > Thanks for the explanation. I understand that it is often time hard to figure out if a race exists or not. I got confused myself many times. Anyway, as long as any destructive action is only taken while holding the page table lock, it should be safe. Cheers, Longman