> On Jun 13, 2019, at 9:47 AM, Song Liu <songliubraving@xxxxxx> wrote: > > Hi Kirill, > >> On Jun 13, 2019, at 8:24 AM, Song Liu <songliubraving@xxxxxx> wrote: >> >> >> >>> On Jun 13, 2019, at 8:14 AM, Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx> wrote: >>> >>> On Thu, Jun 13, 2019 at 03:03:01PM +0000, Song Liu wrote: >>>> >>>> >>>>> On Jun 13, 2019, at 7:16 AM, Kirill A. Shutemov <kirill@xxxxxxxxxxxxx> wrote: >>>>> >>>>> On Thu, Jun 13, 2019 at 01:57:30PM +0000, Song Liu wrote: >>>>>>> And I'm not convinced that it belongs here at all. User requested PMD >>>>>>> split and it is done after split_huge_pmd(). The rest can be handled by >>>>>>> the caller as needed. >>>>>> >>>>>> I put this part here because split_huge_pmd() for file-backed THP is >>>>>> not really done after split_huge_pmd(). And I would like it done before >>>>>> calling follow_page_pte() below. Maybe we can still do them here, just >>>>>> for file-backed THPs? >>>>>> >>>>>> If we would move it, shall we move to callers of follow_page_mask()? >>>>>> In that case, we will probably end up with similar code in two places: >>>>>> __get_user_pages() and follow_page(). >>>>>> >>>>>> Did I get this right? >>>>> >>>>> Would it be enough to replace pte_offset_map_lock() in follow_page_pte() >>>>> with pte_alloc_map_lock()? >>>> >>>> This is similar to my previous version: >>>> >>>> + } else { /* flags & FOLL_SPLIT_PMD */ >>>> + pte_t *pte; >>>> + spin_unlock(ptl); >>>> + split_huge_pmd(vma, pmd, address); >>>> + pte = get_locked_pte(mm, address, &ptl); >>>> + if (!pte) >>>> + return no_page_table(vma, flags); >>>> + spin_unlock(ptl); >>>> + ret = 0; >>>> + } >>>> >>>> I think this is cleaner than use pte_alloc_map_lock() in follow_page_pte(). >>>> What's your thought on these two versions (^^^ vs. pte_alloc_map_lock)? >>> >>> It's additional lock-unlock cycle and few more lines of code... >>> >>>>> This will leave bunch not populated PTE entries, but it is fine: they will >>>>> be populated on the next access to them. >>>> >>>> We need to handle page fault during next access, right? Since we already >>>> allocated everything, we can just populate the PTE entries and saves a >>>> lot of page faults (assuming we will access them later). >>> >>> Not a lot due to faultaround and they may never happen, but you need to >>> tear down the mapping any way. >> >> I see. Let me try this way. >> >> Thanks, >> Song > > To make sure I understand your suggestions. Here is what I got: > > diff --git c/mm/gup.c w/mm/gup.c > index ddde097cf9e4..85e6f46fd925 100644 > --- c/mm/gup.c > +++ w/mm/gup.c > @@ -197,7 +197,10 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, > if (unlikely(pmd_bad(*pmd))) > return no_page_table(vma, flags); > > - ptep = pte_offset_map_lock(mm, pmd, address, &ptl); > + ptep = pte_alloc_map_lock(mm, pmd, address, &ptl); > + if (!ptep) > + return ERR_PTR(-ENOMEM); > + > pte = *ptep; > if (!pte_present(pte)) { > swp_entry_t entry; > @@ -398,7 +401,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, > spin_unlock(ptl); > return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap); > } > - if (flags & FOLL_SPLIT) { > + if (flags & (FOLL_SPLIT | FOLL_SPLIT_PMD)) { > int ret; > page = pmd_page(*pmd); > if (is_huge_zero_page(page)) { > @@ -407,7 +410,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, > split_huge_pmd(vma, pmd, address); > if (pmd_trans_unstable(pmd)) > ret = -EBUSY; > - } else { > + } else if (flags & FOLL_SPLIT) { > if (unlikely(!try_get_page(page))) { > spin_unlock(ptl); > return ERR_PTR(-ENOMEM); > @@ -419,6 +422,10 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, > put_page(page); > if (pmd_none(*pmd)) > return no_page_table(vma, flags); > + } else { /* flags & FOLL_SPLIT_PMD */ > + spin_unlock(ptl); > + split_huge_pmd(vma, pmd, address); > + ret = 0; > } > > return ret ? ERR_PTR(ret) : > follow_page_pte(vma, address, pmd, flags, &ctx->pgmap); > > > This version doesn't work as-is, because it returns at the first check: > > if (unlikely(pmd_bad(*pmd))) > return no_page_table(vma, flags); > > Did I misunderstand anything here? > > Thanks, > Song I guess this would be the best. It _is_ a lot simpler. diff --git c/mm/gup.c w/mm/gup.c index ddde097cf9e4..0cd3ce599f41 100644 --- c/mm/gup.c +++ w/mm/gup.c @@ -398,7 +398,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, spin_unlock(ptl); return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap); } - if (flags & FOLL_SPLIT) { + if (flags & (FOLL_SPLIT | FOLL_SPLIT_PMD)) { int ret; page = pmd_page(*pmd); if (is_huge_zero_page(page)) { @@ -407,7 +407,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, split_huge_pmd(vma, pmd, address); if (pmd_trans_unstable(pmd)) ret = -EBUSY; - } else { + } else if (flags & FOLL_SPLIT) { if (unlikely(!try_get_page(page))) { spin_unlock(ptl); return ERR_PTR(-ENOMEM); @@ -419,6 +419,11 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, put_page(page); if (pmd_none(*pmd)) return no_page_table(vma, flags); + } else { /* flags & FOLL_SPLIT_PMD */ + spin_unlock(ptl); + ret = 0; + split_huge_pmd(vma, pmd, address); + pte_alloc(mm, pmd); } Thanks again for the suggestions. I will send v4 soon. Song > > >> >>> >>> -- >>> Kirill A. Shutemov