Le 10/04/2024 à 17:28, Peter Xu a écrit : > On Tue, Apr 09, 2024 at 08:43:55PM -0300, Jason Gunthorpe wrote: >> On Fri, Apr 05, 2024 at 05:42:44PM -0400, Peter Xu wrote: >>> In short, hugetlb mappings shouldn't be special comparing to other huge pXd >>> and large folio (cont-pXd) mappings for most of the walkers in my mind, if >>> not all. I need to look at all the walkers and there can be some tricky >>> ones, but I believe that applies in general. It's actually similar to what >>> I did with slow gup here. >> >> I think that is the big question, I also haven't done the research to >> know the answer. >> >> At this point focusing on moving what is reasonable to the pXX_* API >> makes sense to me. Then reviewing what remains and making some >> decision. >> >>> Like this series, for cont-pXd we'll need multiple walks comparing to >>> before (when with hugetlb_entry()), but for that part I'll provide some >>> performance tests too, and we also have a fallback plan, which is to detect >>> cont-pXd existance, which will also work for large folios. >> >> I think we can optimize this pretty easy. >> >>>> I think if you do the easy places for pXX conversion you will have a >>>> good idea about what is needed for the hard places. >>> >>> Here IMHO we don't need to understand "what is the size of this hugetlb >>> vma" >> >> Yeh, I never really understood why hugetlb was linked to the VMA.. The >> page table is self describing, obviously. > > Attaching to vma still makes sense to me, where we should definitely avoid > a mixture of hugetlb and !hugetlb pages in a single vma - hugetlb pages are > allocated, managed, ... totally differently. > > And since hugetlb is designed as file-based (which also makes sense to me, > at least for now), it's also natural that it's vma-attached. > >> >>> or "which level of pgtable does this hugetlb vma pages locate", >> >> Ditto >> >>> because we may not need that, e.g., when we only want to collect some smaps >>> statistics. "whether it's hugetlb" may matter, though. E.g. in the mm >>> walker we see a huge pmd, it can be a thp, it can be a hugetlb (when >>> hugetlb_entry removed), we may need extra check later to put things into >>> the right bucket, but for the walker itself it doesn't necessarily need >>> hugetlb_entry(). >> >> Right, places may still need to know it is part of a huge VMA because we >> have special stuff linked to that. >> >>>> But then again we come back to power and its big list of page sizes >>>> and variety :( Looks like some there have huge sizes at the pgd level >>>> at least. >>> >>> Yeah this is something I want to be super clear, because I may miss >>> something: we don't have real pgd pages, right? Powerpc doesn't even >>> define p4d_leaf(), AFAICT. >> >> AFAICT it is because it hides it all in hugepd. > > IMHO one thing we can benefit from such hugepd rework is, if we can squash > all the hugepds like what Christophe does, then we push it one more layer > down, and we have a good chance all things should just work. > > So again my Power brain is close to zero, but now I'm referring to what > Christophe shared in the other thread: > > https://github.com/linuxppc/wiki/wiki/Huge-pages > > Together with: > > https://lore.kernel.org/r/288f26f487648d21fd9590e40b390934eaa5d24a.1711377230.git.christophe.leroy@xxxxxxxxxx > > Where it has: > > --- a/arch/powerpc/platforms/Kconfig.cputype > +++ b/arch/powerpc/platforms/Kconfig.cputype > @@ -98,6 +98,7 @@ config PPC_BOOK3S_64 > select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION > select ARCH_ENABLE_SPLIT_PMD_PTLOCK > select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE > + select ARCH_HAS_HUGEPD if HUGETLB_PAGE > select ARCH_SUPPORTS_HUGETLBFS > select ARCH_SUPPORTS_NUMA_BALANCING > select HAVE_MOVE_PMD > @@ -290,6 +291,7 @@ config PPC_BOOK3S > config PPC_E500 > select FSL_EMB_PERFMON > bool > + select ARCH_HAS_HUGEPD if HUGETLB_PAGE > select ARCH_SUPPORTS_HUGETLBFS if PHYS_64BIT || PPC64 > select PPC_SMP_MUXED_IPI > select PPC_DOORBELL > > So I think it means we have three PowerPC systems that supports hugepd > right now (besides the 8xx which Christophe is trying to drop support > there), besides 8xx we still have book3s_64 and E500. > > Let's check one by one: > > - book3s_64 > > - hash > > - 64K: p4d is not used, largest pgsize pgd 16G @pud level. It > means after squashing it'll be a bunch of cont-pmd, all good. > > - 4K: p4d also not used, largest pgsize pgd 128G, after squashed > it'll be cont-pud. all good. > > - radix > > - 64K: largest 1G @pud, then cont-pmd after squashed. all good. > > - 4K: largest 1G @pud, then cont-pmd, all good. > > - e500 & 8xx > > - both of them use 2-level pgtables (pgd + pte), after squashed hugepd > @pgd level they become cont-pte. all good. e500 has two modes: 32 bits and 64 bits. For 32 bits: 8xx is the only one handling it through HW-assisted pagetable walk hence requiring a 2-level whatever the pagesize is. On e500 it is all software so pages 2M and larger should be cont-PGD (by the way I'm a bit puzzled that on arches that have only 2 levels, ie PGD and PTE, the PGD entries are populated by a function called PMD_populate()). Current situation for 8xx is illustrated here: https://github.com/linuxppc/wiki/wiki/Huge-pages#8xx I also tried to better illustrate e500/32 here: https://github.com/linuxppc/wiki/wiki/Huge-pages#e500 For 64 bits: We have PTE/PMD/PUD/PGD, no P4D See arch/powerpc/include/asm/nohash/64/pgtable-4k.h > > I think the trick here is there'll be no pgd leaves after hugepd squashing > to lower levels, then since PowerPC seems to never have p4d, then all > things fall into pud or lower. We seem to be all good there? > >> >> If the goal is to purge hugepd then some of the options might turn out >> to convert hugepd into huge p4d/pgd, as I understand it. It would be >> nice to have certainty on this at least. > > Right. I hope the pmd/pud plan I proposed above can already work too with > such ambicious goal too. But review very welcomed from either you or > Christophe. > > PS: I think I'll also have a closer look at Christophe's series this week > or next. > >> >> We have effectively three APIs to parse a single page table and >> currently none of the APIs can return 100% of the data for power. > > Thanks, >