> On Jul 31, 2019, at 8:18 AM, Oleg Nesterov <oleg@xxxxxxxxxx> wrote: > > On 07/30, Song Liu wrote: >> >> >>> On Jul 30, 2019, at 9:11 AM, Oleg Nesterov <oleg@xxxxxxxxxx> wrote: >>> >>> So after the next patch we have a single user of FOLL_SPLIT_PMD (uprobes) >>> and a single user of FOLL_SPLIT: arch/s390/mm/gmap.c:thp_split_mm(). >>> >>> Hmm. >> >> I think this is what we want. :) > > We? I don't ;) > >> FOLL_SPLIT is the fallback solution for users who cannot handle THP. > > and again, we have a single user: thp_split_mm(). I do not know if it > can use FOLL_SPLIT_PMD or not, may be you can take a look... I haven't played with s390, so it gonna take me some time to ramp up. I will add it to my to-do list. > >> With >> more THP aware code, there will be fewer users of FOLL_SPLIT. > > Fewer than 1? Good ;) Yes! It will be great if thp_split_mm() can use FOLL_SPLIT_PMD instead. > >>>> @@ -399,7 +399,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, >>>> spin_unlock(ptl); >>>> return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap); >>>> } >>>> - if (flags & FOLL_SPLIT) { >>>> + if (flags & (FOLL_SPLIT | FOLL_SPLIT_PMD)) { >>>> int ret; >>>> page = pmd_page(*pmd); >>>> if (is_huge_zero_page(page)) { >>>> @@ -408,7 +408,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, >>>> split_huge_pmd(vma, pmd, address); >>>> if (pmd_trans_unstable(pmd)) >>>> ret = -EBUSY; >>>> - } else { >>>> + } else if (flags & FOLL_SPLIT) { >>>> if (unlikely(!try_get_page(page))) { >>>> spin_unlock(ptl); >>>> return ERR_PTR(-ENOMEM); >>>> @@ -420,6 +420,10 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, >>>> put_page(page); >>>> if (pmd_none(*pmd)) >>>> return no_page_table(vma, flags); >>>> + } else { /* flags & FOLL_SPLIT_PMD */ >>>> + spin_unlock(ptl); >>>> + split_huge_pmd(vma, pmd, address); >>>> + ret = pte_alloc(mm, pmd); >>> >>> I fail to understand why this differs from the is_huge_zero_page() case above. >> >> split_huge_pmd() handles is_huge_zero_page() differently. In this case, we >> cannot use the pmd_trans_unstable() check. > > Please correct me, but iiuc the problem is not that split_huge_pmd() handles > is_huge_zero_page() differently, the problem is that __split_huge_pmd_locked() > handles the !vma_is_anonymous(vma) differently and returns with pmd_none() = T > after pmdp_huge_clear_flush_notify(). This means that pmd_trans_unstable() will > fail. Agreed. > > Now, I don't understand why do we need pmd_trans_unstable() after > split_huge_pmd(huge-zero-pmd), but whatever reason we have, why can't we > unify both cases? > > IOW, could you explain why the path below is wrong? I _think_ the following patch works (haven't fully tested yet). But I am not sure whether this is the best. By separating the two cases, we don't duplicate much code. And it is clear that the two cases are handled differently. Therefore, I would prefer to keep these separate for now. Thanks, Song > > > --- x/mm/gup.c > +++ x/mm/gup.c > @@ -399,14 +399,16 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, > spin_unlock(ptl); > return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap); > } > - if (flags & FOLL_SPLIT) { > + if (flags & (FOLL_SPLIT | FOLL_SPLIT_PMD)) { > int ret; > page = pmd_page(*pmd); > - if (is_huge_zero_page(page)) { > + if ((flags & FOLL_SPLIT_PMD) || is_huge_zero_page(page)) { > spin_unlock(ptl); > - ret = 0; > split_huge_pmd(vma, pmd, address); > - if (pmd_trans_unstable(pmd)) > + ret = 0; > + if (pte_alloc(mm, pmd)) > + ret = -ENOMEM; > + else if (pmd_trans_unstable(pmd)) > ret = -EBUSY; > } else { > if (unlikely(!try_get_page(page))) { >