On 18/07/2023 00:37, Hugh Dickins wrote: > On Mon, 17 Jul 2023, Ryan Roberts wrote: > >>>>>> +static int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio) >>>>>> +{ >>>>>> + int i; >>>>>> + gfp_t gfp; >>>>>> + pte_t *pte; >>>>>> + unsigned long addr; >>>>>> + struct vm_area_struct *vma = vmf->vma; >>>>>> + int prefer = anon_folio_order(vma); >>>>>> + int orders[] = { >>>>>> + prefer, >>>>>> + prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0, >>>>>> + 0, >>>>>> + }; >>>>>> + >>>>>> + *folio = NULL; >>>>>> + >>>>>> + if (vmf_orig_pte_uffd_wp(vmf)) >>>>>> + goto fallback; >>>>>> + >>>>>> + for (i = 0; orders[i]; i++) { >>>>>> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]); >>>>>> + if (addr >= vma->vm_start && >>>>>> + addr + (PAGE_SIZE << orders[i]) <= vma->vm_end) >>>>>> + break; >>>>>> + } >>>>>> + >>>>>> + if (!orders[i]) >>>>>> + goto fallback; >>>>>> + >>>>>> + pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK); >>>>>> + if (!pte) >>>>>> + return -EAGAIN; >>>>> >>>>> It would be a bug if this happens. So probably -EINVAL? >>>> >>>> Not sure what you mean? Hugh Dickins' series that went into v6.5-rc1 makes it >>>> possible for pte_offset_map() to fail (if I understood correctly) and we have to >>>> handle this. The intent is that we will return from the fault without making any >>>> change, then we will refault and try again. >>> >>> Thanks for checking that -- it's very relevant. One detail is that >>> that series doesn't affect anon. IOW, collapsing PTEs into a PMD can't >>> happen while we are holding mmap_lock for read here, and therefore, >>> the race that could cause pte_offset_map() on shmem/file PTEs to fail >>> doesn't apply here. >> >> But Hugh's patches have changed do_anonymous_page() to handle failure from >> pte_offset_map_lock(). So I was just following that pattern. If this really >> can't happen, then I'd rather WARN/BUG on it, and simplify alloc_anon_folio()'s >> prototype to just return a `struct folio *` (and if it's null that means ENOMEM). >> >> Hugh, perhaps you can comment? > > I agree with your use of -EAGAIN there: I find it better to allow for the > possibility, than to go to great effort persuading that it's impossible; > especially because what's possible tomorrow may differ from today. > > And notice that, before my changes, there used to be a pmd_trans_unstable() > check above, implying that it is possible for it to fail (for more reasons > than corruption causing pmd_bad()) - one scenario would be that the > pte_alloc() above succeeded *because* someone else had managed to insert > a huge pmd there already (maybe we have MMF_DISABLE_THP but they did not). > > But I see from later mail that Yu Zhao now agrees with your -EAGAIN too, > so we are all on the same folio. Thanks for the explanation. I think we are all now agreed that the error case needs handling and -EAGAIN is the correct code. > > Hugh > > p.s. while giving opinions, I'm one of those against using "THP" for > large but not pmd-mappable folios; and was glad to see Matthew arguing > the same way when considering THP_SWPOUT in another thread today. Honestly, I don't have an opinion either way on this (probably because I don't have the full context and history of THP like many of you do). So given there is a fair bit of opposition to FLEXIBLE_THP, I'll change it back to LARGE_ANON_FOLIO (and move it out of the THP sub-menu) in the next revision.