On 07/30, Song Liu wrote: > > > > On Jul 30, 2019, at 9:11 AM, Oleg Nesterov <oleg@xxxxxxxxxx> wrote: > > > > So after the next patch we have a single user of FOLL_SPLIT_PMD (uprobes) > > and a single user of FOLL_SPLIT: arch/s390/mm/gmap.c:thp_split_mm(). > > > > Hmm. > > I think this is what we want. :) We? I don't ;) > FOLL_SPLIT is the fallback solution for users who cannot handle THP. and again, we have a single user: thp_split_mm(). I do not know if it can use FOLL_SPLIT_PMD or not, may be you can take a look... > With > more THP aware code, there will be fewer users of FOLL_SPLIT. Fewer than 1? Good ;) > >> @@ -399,7 +399,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, > >> spin_unlock(ptl); > >> return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap); > >> } > >> - if (flags & FOLL_SPLIT) { > >> + if (flags & (FOLL_SPLIT | FOLL_SPLIT_PMD)) { > >> int ret; > >> page = pmd_page(*pmd); > >> if (is_huge_zero_page(page)) { > >> @@ -408,7 +408,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, > >> split_huge_pmd(vma, pmd, address); > >> if (pmd_trans_unstable(pmd)) > >> ret = -EBUSY; > >> - } else { > >> + } else if (flags & FOLL_SPLIT) { > >> if (unlikely(!try_get_page(page))) { > >> spin_unlock(ptl); > >> return ERR_PTR(-ENOMEM); > >> @@ -420,6 +420,10 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, > >> put_page(page); > >> if (pmd_none(*pmd)) > >> return no_page_table(vma, flags); > >> + } else { /* flags & FOLL_SPLIT_PMD */ > >> + spin_unlock(ptl); > >> + split_huge_pmd(vma, pmd, address); > >> + ret = pte_alloc(mm, pmd); > > > > I fail to understand why this differs from the is_huge_zero_page() case above. > > split_huge_pmd() handles is_huge_zero_page() differently. In this case, we > cannot use the pmd_trans_unstable() check. Please correct me, but iiuc the problem is not that split_huge_pmd() handles is_huge_zero_page() differently, the problem is that __split_huge_pmd_locked() handles the !vma_is_anonymous(vma) differently and returns with pmd_none() = T after pmdp_huge_clear_flush_notify(). This means that pmd_trans_unstable() will fail. Now, I don't understand why do we need pmd_trans_unstable() after split_huge_pmd(huge-zero-pmd), but whatever reason we have, why can't we unify both cases? IOW, could you explain why the path below is wrong? Oleg. --- x/mm/gup.c +++ x/mm/gup.c @@ -399,14 +399,16 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, spin_unlock(ptl); return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap); } - if (flags & FOLL_SPLIT) { + if (flags & (FOLL_SPLIT | FOLL_SPLIT_PMD)) { int ret; page = pmd_page(*pmd); - if (is_huge_zero_page(page)) { + if ((flags & FOLL_SPLIT_PMD) || is_huge_zero_page(page)) { spin_unlock(ptl); - ret = 0; split_huge_pmd(vma, pmd, address); - if (pmd_trans_unstable(pmd)) + ret = 0; + if (pte_alloc(mm, pmd)) + ret = -ENOMEM; + else if (pmd_trans_unstable(pmd)) ret = -EBUSY; } else { if (unlikely(!try_get_page(page))) {