On Mon, Oct 05, 2020 at 03:12:55PM -0400, Zi Yan wrote: > On 5 Oct 2020, at 11:55, Matthew Wilcox wrote: > > One of the longer-term todo items is to support variable sized THPs for > > anonymous memory, just like I've done for the pagecache. With that in > > place, I think scaling up from PMD sized pages to PUD sized pages starts > > to look more natural. Itanium and PA-RISC (two architectures that will > > never be found in phones...) support 1MB, 4MB, 16MB, 64MB and upwards. > > The RiscV spec you pointed me at the other day confines itself to adding > > support for 16, 64 & 256kB today, but does note that 8MB, 32MB and 128MB > > sizes would be possible additions in the future. > > Just to understand the todo items clearly. With your pagecache patchset, > kernel should be able to understand variable sized THPs no matter they > are anonymous or not, right? ... yes ... modulo bugs and places I didn't fix because only anonymous pages can get there ;-) There are still quite a few references to HPAGE_PMD_MASK / SIZE / NR and I couldn't swear that they're all related to things which are actually PMD sized. I did fix a couple of places where the anonymous path assumed that pages were PMD sized because I thought we'd probably want to do that sooner rather than later. > For anonymous memory, we need kernel policies > to decide what THP sizes to use at allocation, what to do when under > memory pressure, and so on. In terms of implementation, THP split function > needs to support from any order to any lower order. Anything I am missing here? I think that's the bulk of the work. The swap code also needs work so we don't have to split pages to swap them out. > > I think I'm leaning towards not merging this patchset yet. I'm in > > agreement with the goals (allowing systems to use PUD-sized pages > > automatically), but I think we need to improve the infrastructure to > > make it work well automatically. Does that make sense? > > I agree that this patchset should not be merged in the current form. > I think PUD THP support is a part of variable sized THP support, but > current form of the patchset does not have the “variable sized THP” > spirit yet and is more like a special PUD case support. I guess some > changes to existing THP code to make PUD THP less a special case would > make the whole patchset more acceptable? > > Can you elaborate more on the infrastructure part? Thanks. Oh, this paragraph was just summarising the above. We need to be consistently using thp_size() instead of HPAGE_PMD_SIZE, etc. I haven't put much effort yet into supporting pages which are larger than PMD-size -- that is, if a page is mapped with a PMD entry, we assume it's PMD-sized. Once we can allocate a larger-than-PMD sized page, that's off. I assume a lot of that is dealt with in your patchset, although I haven't audited it to check for that. > > (*) It would be nice if hardware provided a way to track D/A on a sub-PTE > > level when using PMD/PUD sized mappings. I don't know of any that does > > that today. > > I agree it would be a nice hardware feature, but it also has a high cost. > Each TLB would support this with 1024 bits, which is about 16 TLB entry size, > assuming each entry takes 8B space. Now it becomes why not having a bigger > TLB. ;) Oh, we don't have to track at the individual-page level for this to be useful. Let's take the RISC-V Sv39 page table entry format as an example: 63-54 attributes 53-28 PPN2 27-19 PPN1 18-10 PPN0 9-8 RSW 7-0 DAGUXWRV For a 2MB page, we currently insist that 18-10 are zero. If we repurpose eight of those nine bits as A/D bits, we can track at 512kB granularity. For 1GB pages, we can use 16 of the 18 bits to track A/D at 128MB granularity. It's not great, but it is quite cheap!