On Mon, Feb 06, 2017 at 07:02:41AM -0600, Zi Yan wrote: > On 6 Feb 2017, at 1:43, Naoya Horiguchi wrote: > > > On Sun, Feb 05, 2017 at 11:12:41AM -0500, Zi Yan wrote: > >> From: Zi Yan <ziy@xxxxxxxxxx> > >> > >> Originally, zap_pmd_range() checks pmd value without taking pmd lock. > >> This can cause pmd_protnone entry not being freed. > >> > >> Because there are two steps in changing a pmd entry to a pmd_protnone > >> entry. First, the pmd entry is cleared to a pmd_none entry, then, > >> the pmd_none entry is changed into a pmd_protnone entry. > >> The racy check, even with barrier, might only see the pmd_none entry > >> in zap_pmd_range(), thus, the mapping is neither split nor zapped. > >> > >> Later, in free_pmd_range(), pmd_none_or_clear() will see the > >> pmd_protnone entry and clear it as a pmd_bad entry. Furthermore, > >> since the pmd_protnone entry is not properly freed, the corresponding > >> deposited pte page table is not freed either. > >> > >> This causes memory leak or kernel crashing, if VM_BUG_ON() is enabled. > >> > >> This patch relies on __split_huge_pmd_locked() and > >> __zap_huge_pmd_locked(). > >> > >> Signed-off-by: Zi Yan <zi.yan@xxxxxxxxxxxxxx> > >> --- > >> mm/memory.c | 24 +++++++++++------------- > >> 1 file changed, 11 insertions(+), 13 deletions(-) > >> > >> diff --git a/mm/memory.c b/mm/memory.c > >> index 3929b015faf7..7cfdd5208ef5 100644 > >> --- a/mm/memory.c > >> +++ b/mm/memory.c > >> @@ -1233,33 +1233,31 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb, > >> struct zap_details *details) > >> { > >> pmd_t *pmd; > >> + spinlock_t *ptl; > >> unsigned long next; > >> > >> pmd = pmd_offset(pud, addr); > >> + ptl = pmd_lock(vma->vm_mm, pmd); > > > > If USE_SPLIT_PMD_PTLOCKS is true, pmd_lock() returns different ptl for > > each pmd. The following code runs over pmds within [addr, end) with > > a single ptl (of the first pmd,) so I suspect this locking really works. > > Maybe pmd_lock() should be called inside while loop? > > According to include/linux/mm.h, pmd_lockptr() first gets the page the pmd is in, > using mask = ~(PTRS_PER_PMD * sizeof(pmd_t) -1) = 0xfffffffffffff000 and virt_to_page(). > Then, ptlock_ptr() gets spinlock_t either from page->ptl (split case) or > mm->page_table_lock (not split case). > > It seems to me that all PMDs in one page table page share a single spinlock. Let me know > if I misunderstand any code. Thanks for clarification, it was my misunderstanding. Naoya > > But your suggestion can avoid holding the pmd lock for long without cond_sched(), > I can move the spinlock inside the loop. > > Thanks. > > diff --git a/mm/memory.c b/mm/memory.c > index 5299b261c4b4..ff61d45eaea7 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1260,31 +1260,34 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb, > struct zap_details *details) > { > pmd_t *pmd; > - spinlock_t *ptl; > + spinlock_t *ptl = NULL; > unsigned long next; > > pmd = pmd_offset(pud, addr); > - ptl = pmd_lock(vma->vm_mm, pmd); > do { > + ptl = pmd_lock(vma->vm_mm, pmd); > next = pmd_addr_end(addr, end); > if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) { > if (next - addr != HPAGE_PMD_SIZE) { > VM_BUG_ON_VMA(vma_is_anonymous(vma) && > !rwsem_is_locked(&tlb->mm->mmap_sem), vma); > __split_huge_pmd_locked(vma, pmd, addr, false); > - } else if (__zap_huge_pmd_locked(tlb, vma, pmd, addr)) > - continue; > + } else if (__zap_huge_pmd_locked(tlb, vma, pmd, addr)) { > + spin_unlock(ptl); > + goto next; > + } > /* fall through */ > } > > - if (pmd_none_or_clear_bad(pmd)) > - continue; > + if (pmd_none_or_clear_bad(pmd)) { > + spin_unlock(ptl); > + goto next; > + } > spin_unlock(ptl); > next = zap_pte_range(tlb, vma, pmd, addr, next, details); > +next: > cond_resched(); > - spin_lock(ptl); > } while (pmd++, addr = next, addr != end); > - spin_unlock(ptl); > > return addr; > } > > > > > > Thanks, > > Naoya Horiguchi > > > >> do { > >> next = pmd_addr_end(addr, end); > >> if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) { > >> if (next - addr != HPAGE_PMD_SIZE) { > >> VM_BUG_ON_VMA(vma_is_anonymous(vma) && > >> !rwsem_is_locked(&tlb->mm->mmap_sem), vma); > >> - __split_huge_pmd(vma, pmd, addr, false, NULL); > >> - } else if (zap_huge_pmd(tlb, vma, pmd, addr)) > >> - goto next; > >> + __split_huge_pmd_locked(vma, pmd, addr, false); > >> + } else if (__zap_huge_pmd_locked(tlb, vma, pmd, addr)) > >> + continue; > >> /* fall through */ > >> } > >> - /* > >> - * Here there can be other concurrent MADV_DONTNEED or > >> - * trans huge page faults running, and if the pmd is > >> - * none or trans huge it can change under us. This is > >> - * because MADV_DONTNEED holds the mmap_sem in read > >> - * mode. > >> - */ > >> - if (pmd_none_or_trans_huge_or_clear_bad(pmd)) > >> - goto next; > >> + > >> + if (pmd_none_or_clear_bad(pmd)) > >> + continue; > >> + spin_unlock(ptl); > >> next = zap_pte_range(tlb, vma, pmd, addr, next, details); > >> -next: > >> cond_resched(); > >> + spin_lock(ptl); > >> } while (pmd++, addr = next, addr != end); > >> + spin_unlock(ptl); > >> > >> return addr; > >> } > >> -- > >> 2.11.0 > >> > > > -- > Best Regards > Yan Zi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href