Le 17/05/2024 à 16:27, Oscar Salvador a écrit : > On Mon, Mar 25, 2024 at 03:55:53PM +0100, Christophe Leroy wrote: >> This series reimplements hugepages with hugepd on powerpc 8xx. >> >> Unlike most architectures, powerpc 8xx HW requires a two-level >> pagetable topology for all page sizes. So a leaf PMD-contig approach >> is not feasible as such. >> >> Possible sizes are 4k, 16k, 512k and 8M. >> >> First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD entries >> must point to a single entry level-2 page table. Until now that was >> done using hugepd. This series changes it to use standard page tables >> where the entry is replicated 1024 times on each of the two pagetables >> refered by the two associated PMD entries for that 8M page. >> >> At the moment it has to look into each helper to know if the >> hugepage ptep is a PTE or a PMD in order to know it is a 8M page or >> a lower size. I hope this can me handled by core-mm in the future. >> >> There are probably several ways to implement stuff, so feedback is >> very welcome. > > > Hi Christophe, > > I have been looking into this because I am interested in the ongoing work of > the hugetlb unification, but my knowledge of ppc pagetables tends to zero, > So be prepared for some stupid questions. > > First, let me have a clear picture of the current situation: > > power8xx has 4KB, 16KB, 512KB, and 8MB page sizes, and operate on a 2Level > pagetables. Wiki [1] mentions PGD + PTE, here you seem to be referring them > as PMD + PTE though. > > And we can have 1024 PGDs, each of one covers 4MB, so we can cover a total of > of 4GB. > > Looking at the page table diagram for power8xx, it seems power8xx has also some > sort of CONTIG_PTE? (same as arm64 does) So we can have contig_ptes representing > bigger page sizes? > I also guess that although power8xx supports all these different sizes, only one > of them can be active at any time, right? Don't know what you mean by "active at any time". In a running system with PAGE_SIZE defined as 4k, you can at any time have some hugepages of size 16K, some 512K and some 8M. > > It also seems that this whole hugepd thing is only used when we are using 8MB > PAGE_SIZE, right? Today yes. In the past it was also used for 512K pages, until commit b250c8c08c79 ("powerpc/8xx: Manage 512k huge pages as standard pages.") > And when that is active, we do have 2 PGDs(4MB each) pointing to the same 8MB > hugepd. > E.g: > ____________ > [PGD#0] ------------> | | > | 8MB hugepd | > [PGD#1] ------------> |____________| > > What you want to do with this work is to get rid of that hugepd abstraction > because it is something power8xx/hugetlb specific and cannot be represented > with our "normal" page table layout (PGD,P4D,PUD,PMD,PTE). It is more than 8xx, also used on e500 and book3s/64 but for sure that's specific to powerpc and it would help to get rid of it completely. > I did not check, but I guess we cannot walk the hugepd thing with a normal > page table walker, or can we? (how special is a hugepd? can you describe its > internal layout?) depends on what you mean by "normal". For instance walk_page_range_novma() handles hugepd. > > So, what you proprose, is something like the following? > > [PGD#X] -----------> [PTE#0] > -----------> [PTE..#1023] > [PGD#Y] -----------> [PTE#0] > -----------> [PTE..#1023] > > so a 8MB hugepage will be covered by PGD#X and PGD#Y using contiguos PTEs. > > The diagram at [1] for 8xx 16K seems a bit misleading to me (or maybe it is > just me). They say that a Level2 table (aka PTE) covers 4KB chunks regardless > of the pagesize, but either I read that wrong or..else. What it means is that when PAGE_SIZE is 16k, the pte_t is a table of 4 long, see https://elixir.bootlin.com/linux/v6.9/source/arch/powerpc/include/asm/pgtable-types.h#L11 > Because on 16K page size, they show that each pte covers for 16KB memory chunk. > But that would mean 16KB << 10 = 16M each PGD, which is not really that, so what > is the deal there? Or it is just that we always use 4KB PTEs, and use contiguous > PTEs for bigger sizes? In a way yes, we cheat the HW by defining a PTE as a table of 4 u32 values to behave like a cont-PTE. > > Now, it seems that power8xx has no-p4d, no-pud and no-pmd, right? > > Peter mentioned that we should have something like: > > X X > [PGD] - [P4D] - [PUD] - [PMD] - [PTE] > > where the PMD and PTE would be the ones we use for representing the 2Level > ptage table, and PGD,P4D and PUD would just be dummies. > > But, is not the convention to at least have PGD-PTE always, and have anything > in between as optional? E.g: > > X ? ? ? X > [PGD] - [P4D] - [PUD] - [PMD] - [PTE] > That's also my understanding hence the discussion with Peter. The fog is that there is a mix between PGD and PMD, for instance to populate a PGD entry you use pmd_populate(). To clear a PGD entry you use pmd_clear, pgd_clear only exists when you have P4Ds. > I mean, are page table walkers ready to deal with non-PGD? I thought they were > not. Yes that's how it is today but Peter was thinking for the future. > > Also, in patch#1, you mentioned: > > "At the time being, for 512k pages the flag is kept in the PTE and inserted in > the PMD entry at TLB miss exception". > > Can you point out where to look for that in the code? https://elixir.bootlin.com/linux/v6.9/source/arch/powerpc/kernel/head_8xx.S#L219 and https://elixir.bootlin.com/linux/v6.9/source/arch/powerpc/kernel/head_8xx.S#L277 > > Also, what exactly is the "sz" parameter that gets passed down to pmd_populate_size()? > Is the size of the current mapping we are establishing? > I see that you only make a distinction when the mapping size is 8MB. > So the PMD will have _PMD_PAGE_8MB, so it works that all 1024 PTEs below are contiguous > representing a 4MB chunk? > > I will start looking deeper into this series on Monday, but just wanted to have a better > insight of what is going on. > > PD: I think we could make the changelog of the coverletter a bit fatter and cover some > details in there, e.g: layout of page-tables for different page sizes, layout of hugepd, > expected layout after the work, etc. > I think it would help in reviewing this series. I prefer keeping details in individual patches and keep the cover letter for a high level summary and only administrative information because information in cover letter are usually lost on the long term. > > Thanks! > > [1] https://github.com/linuxppc/wiki/wiki/Huge-pages > >