On 05/31/22 10:00, Mike Kravetz wrote: > On 5/30/22 12:56, Peter Xu wrote: > > Hi, Mike, > > > > On Fri, May 27, 2022 at 03:58:47PM -0700, Mike Kravetz wrote: > >> +unsigned long hugetlb_mask_last_hp(struct hstate *h) > >> +{ > >> + unsigned long hp_size = huge_page_size(h); > >> + > >> + if (hp_size == P4D_SIZE) > >> + return PGDIR_SIZE - P4D_SIZE; > >> + else if (hp_size == PUD_SIZE) > >> + return P4D_SIZE - PUD_SIZE; > >> + else if (hp_size == PMD_SIZE) > >> + return PUD_SIZE - PMD_SIZE; > >> + > >> + return ~(0); > >> +} > > > > How about: > > > > unsigned long hugetlb_mask_last_hp(struct hstate *h) > > { > > unsigned long hp_size = huge_page_size(h); > > > > return hp_size * (PTRS_PER_PTE - 1); > > } > > > > ? As mentioned in a followup e-mail, I am a little worried about this calculation not being accurate for all configurations. Today, PTRS_PER_PTE == PTRS_PER_PMD == PTRS_PER_PUD == PTRS_PER_P4D in all architectures that CONFIG_ARCH_WANT_GENERAL_HUGETLB. However, if we code things as above and that changes the bug might be hard to find. In the next version, I will leave this as above but move to a switch statement for better readability. > > > > This is definitely a good idea, though I'm wondering the possibility to go > > one step further to make hugetlb pgtable walk just like the normal pages. > > > > Say, would it be non-trivial to bring some of huge_pte_offset() into the > > walker functions, so that we can jump over even larger than PTRS_PER_PTE > > entries (e.g. when p4d==NULL for 2m huge pages)? It's very possible I > > overlooked something, though. I briefly looked at this. To make it work, the walker zapping functions such as zap_*_range would need to have a 'is_vm_hugetlb_page(vma)', and if true use hugetlb specific page table routines instead of the generic routines. In many cases, the hugetlb specific page table routines are the same as the generic routines. But, there are a few exceptions. IMO, it would be better to first try to cleanup and unify those routines. That would make changes to the walker routines less invasive and easier to maintain. I believe is other code that would benefit from such a cleanup. Unless there are strong objections, I suggest we move forward with the optimization here and move the cleanup and possible walker changes to a later series. -- Mike Kravetz