On Thu, Jul 27, 2023 at 03:28:49PM +0200, David Hildenbrand wrote: > > > > Therefore, when obtaining pages through the follow_trans_huge_pmd > > > > interface, add the FOLL_FORCE flag to count the pages corresponding to > > > > PROTNONE to solve the above problem. > > > > > > > > > > We really want to avoid the usage of FOLL_FORCE, and ideally limit it > > > to ptrace only. > > > > Fundamentally when removing FOLL_NUMA we did already assumed !FORCE is > > FOLL_NUMA. It means to me after the removal it's not possible to say in a > > gup walker that "it's not FORCEd, but I don't want to trigger NUMA but just > > get the page". > > > > Is that what we want? Shall we document that in FOLL_FORCE if we intended > > to enforce numa balancing as long as !FORCE? > > That was the idea, yes. I could have sworn we had that at least in some > patch description. > > Back then, I played with special-casing on gup_can_follow_protnone() on > FOLL_GET | FOLL_PIN. But it's all just best guesses. > > Can always be added if deemed necessary and worth it. > > Here, it's simply an abuse of that GUP function that I wasn't aware of -- > otherwise I'd have removed that before hand. > > > > > > > > > > Signed-off-by: liubo <liubo254@xxxxxxxxxx> > > > > Fixes: 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") > > > > --- > > > > fs/proc/task_mmu.c | 6 ++++-- > > > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > > > index c1e6531cb02a..ed08f9b869e2 100644 > > > > --- a/fs/proc/task_mmu.c > > > > +++ b/fs/proc/task_mmu.c > > > > @@ -571,8 +571,10 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr, > > > > bool migration = false; > > > > > > > > if (pmd_present(*pmd)) { > > > > - /* FOLL_DUMP will return -EFAULT on huge zero page */ > > > > - page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP); > > > > + /* FOLL_DUMP will return -EFAULT on huge zero page > > > > + * FOLL_FORCE follow a PROT_NONE mapped page > > > > + */ > > > > + page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP | FOLL_FORCE); > > > > } else if (unlikely(thp_migration_supported() && is_swap_pmd(*pmd))) { > > > > swp_entry_t entry = pmd_to_swp_entry(*pmd); > > > > > > Might do as an easy fix. But we really should get rid of that > > > absolutely disgusting usage of follow_trans_huge_pmd(). > > > > > > We don't need 99% of what follow_trans_huge_pmd() does here. > > > > > > Would the following also fix your issue? > > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > > index 507cd4e59d07..fc744964816e 100644 > > > --- a/fs/proc/task_mmu.c > > > +++ b/fs/proc/task_mmu.c > > > @@ -587,8 +587,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr, > > > bool migration = false; > > > > > > if (pmd_present(*pmd)) { > > > - /* FOLL_DUMP will return -EFAULT on huge zero page */ > > > - page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP); > > > + page = vm_normal_page_pmd(vma, addr, *pmd); > > > } else if (unlikely(thp_migration_supported() && is_swap_pmd(*pmd))) { > > > swp_entry_t entry = pmd_to_swp_entry(*pmd); > > > > > > It also skips the shared zeropage and pmd_devmap(), > > > > > > Otherwise, a simple pmd_page(*pmd) + is_huge_zero_pmd(*pmd) check will do, but I > > > suspect vm_normal_page_pmd() might be what we actually want to have here. > > > > > > Because smaps_pte_entry() properly checks for vm_normal_page(). > > > > There're indeed some very trivial detail in vm_normal_page_pmd() that's > > different, but maybe not so relevant. E.g., > > > > if (WARN_ON_ONCE(folio_ref_count(folio) <= 0)) > > return -ENOMEM; > > Note that we're not even passing FOLL_GET | FOLL_PIN. Because we're not > actually doing GUP. So the refcount is not that relevant. > > > > > if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page))) > > return -EREMOTEIO; > > > > I'm not sure whether the p2pdma page would matter in any form here. E.g., > > whether it can be mapped privately. > > Good point, but I don't think that people messing with GUP even imagined > that we would call that function from a !GUP place. > > This was wrong from the very start. If we're not in GUP, we shouldn't call > GUP functions. My understanding is !GET && !PIN is also called gup.. otherwise we don't need GET and it can just be always implied. The other proof is try_grab_page() doesn't fail hard on !GET && !PIN. So I don't know whether that's "wrong" to be used.. Back to the topic: I'd say either of the patches look good to solve the problem. If p2pdma pages are mapped as PFNMAP/MIXEDMAP (?), I guess vm_normal_page_pmd() proposed here will also work on it, so nothing I see wrong on 2nd one yet. It looks nicer indeed to not have FOLL_FORCE here, but it also makes me just wonder whether we should document NUMA behavior for FOLL_* somewhere, because we have an implication right now on !FOLL_FORCE over NUMA, which is not obvious to me.. And to look more over that aspect, see follow_page(): previously we can follow a page for protnone (as it never applies FOLL_NUMA) but now it won't (it never applies FOLL_FORCE, either, so it seems "accidentally" implies FOLL_NUMA now). Not sure whether it's intended, though.. -- Peter Xu