I'd like to throw in my 2p here. Although quick disclaimer first; I'm new to mm so I'm sure I'll say a bunch of dumb stuff - please go easy ;-) In the few workloads that I'm focused on, I can see a disparity in performance between a kernel configured for 4K vs 16K pages. My goal is to release the extra performance we see in the 16K variant into the 4K variant. My results show that a ~half of the uplift is down to SW efficiency savings in the kernel; 4x fewer data aborts (vast majority for anon memory) and less effort spent in mm-heavy syscalls (as expected). And the other ~half is down to HW; the TLB has less pressure, which causes everything to speed up a bit. But this "bit" is important, given most of the time is spent in user space, which only benefits from the HW part. See [1] for more details. Newer Arm CPUs have a uarch feature called Hardware Page Aggregation (HPA). This allows the TLB to aggregate 2-8 physically- and virtually- contiguous pages into a single TLB entry to reduce pressure. (Note this is separate from the contig bit and is invisible from a SW programming perspective). So my hope is that I can get the equivalent SW efficiencies with a 4K base page size and large anonymous folios, and also benefit from a lot of the HW performance due to it all naturally fitting HPA's requirements. On 21/02/2023 21:49, Matthew Wilcox wrote: > In a sense this question is premature, because we don't have any code > in place to handle folios which are any size but PMD_SIZE or PAGE_SIZE, > but let's pretend that code already exists and is just waiting for us > to answer this policy question. > > I'd like to reject three ideas up front: 1. a CONFIG option, 2. a boot > option and 3. a sysfs tunable. It is foolish to expect the distro > packager or the sysadmin to be able to make such a decision. The > correct decision will depend upon the instantaneous workload of the > entire machine and we'll want different answers for different VMAs. > > I'm open to applications having some kind of madvise() call they can > use to specify hints, but I would prefer to handle memory efficiently > for applications which do not. Firmly agree. > > For pagecache memory, we use the per-fd readahead code; if readahead has > been successful in the past we bump up the folio size until it reaches > its maximum. There is no equivalent for anonymous memory. > > I'm working my way towards a solution that looks a little like this: > > A. We modify khugepaged to quadruple the folio size each time it scans. > At the moment, it always attempts to promote straight from order 0 > to PMD size. Instead, if it finds four adjacent order-0 folios, > it will allocate an order-2 folio to replace them. Next time it > scans, it finds four order-2 folios and replaces them with a single > order-4 folio. And so on, up to PMD order. >From the SW efficiencies perspective, what is the point of doing a replacement after you have allocated all the order-0 folios? Surely that just adds more overhead? I think the aim has to be to try to allocate the correct order up front to cut down the allocation cost; one order-2 allocation is ~4x less expensive than 4x order-0, right? I wonder if it is preferable to optimistically allocate a mid-order folio to begin with, then later choose to split or merge from there? Perhaps these folios could initially go on a separate list to make them faster to split and reclaim the unused portions when under memory pressure? (My data/workloads suggest 16K allocations are knee, and making them bigger than that doesn't proportionally improve performance). > > B. A further modification is that it will require three of the four > folios being combined to be on the active list. If two (or more) > of the four folios are inactive, we should leave them alone; either > they will remain inactive and eventually be evicted, or they will be > activated and eligible for merging in a future pass of khugepaged. > > C. We add a new wrinkle to the LRU handling code. When our scan of the > active list examines a folio, we look to see how many of the PTEs > mapping the folio have been accessed. If it is fewer than half, and > those half are all in either the first or last half of the folio, we > split it. The active half stays on the active list and the inactive > half is moved to the inactive list. > > I feel that these three changes should allow us to iterate towards a > solution for any given VMA that is close to optimal, and adapts to a > changing workload with no intervention from a sysadmin, or even hint > from a program. > > There are three different circumstances where we currently allocate > anonymous memory. The first is for mmap(MAP_ANONYMOUS), the second is > COW on a file-backed MAP_PRIVATE and the third is COW of a post-fork > anonymous mapping. > > For the first option, the only hint we have is the size of the VMA. > I'm tempted to suggest our initial guess at the right size folio to > allocate should be scaled to that, although I don't have a clear idea > about what the scale factor should be. Ahh - perhaps I misunderstood what you were saying above. My experience has been that order-2 seems to be the knee in terms of performance gain, so perhaps one approach would be to start with order-2 allocations, then adjust based on the observed page fault pattern within the VMA? i.e. if you're getting mostly in-order faults, increase the VMA's scaling factor, and if its mostly random, decrease. (just a suggestion based on intuition, so feel free to shoot it down). > > For the second case, I want to strongly suggest that the size of the > folio allocated by the page cache should be of no concern. It is largely > irrelevant to the application's usage pattern what size the page cache > has chosen to cache the file. I might start out very conservatively > here with an order-0 allocation. > > For the third case, in contrast, the parent had already established > an appropriate size folio to use for this VMA before calling fork(). > Whether it is the parent or the child causing the COW, it should probably > inherit that choice and we should default to the same size folio that > was already found. > > > I don't stay current with the research literature, so if someone wants > to point me to a well-studied algorithm and let me know that I can stop > thinking about this, that'd be great. And if anyone wants to start > working on implementing this, that'd also be great. > > P.S. I didn't want to interrupt the flow of the above description to > note that allocation of any high-order folio can and will fail, so > there will definitely be fallback points to order-0 folios, which will > be no different from today. Except that maybe we'll be able to iterate > towards the correct folio size in the new khugepaged. > > P.P.S. I still consider myself a bit of a novice in the handling of > anonymous memory, so don't be shy to let me know what I got wrong. > > Thanks, Ryan [1] https://lore.kernel.org/linux-mm/4c991dcb-c5bb-86bb-5a29-05df24429607@xxxxxxx/