On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > > Hi All, > > This is v2 of a series to implement variable order, large folios for anonymous > memory. The objective of this is to improve performance by allocating larger > chunks of memory during anonymous page faults. See [1] for background. Thanks for the quick response! > I've significantly reworked and simplified the patch set based on comments from > Yu Zhao (thanks for all your feedback!). I've also renamed the feature to > VARIABLE_THP, on Yu's advice. > > The last patch is for arm64 to explicitly override the default > arch_wants_pte_order() and is intended as an example. If this series is accepted > I suggest taking the first 4 patches through the mm tree and the arm64 change > could be handled through the arm64 tree separately. Neither has any build > dependency on the other. > > The one area where I haven't followed Yu's advice is in the determination of the > size of folio to use. It was suggested that I have a single preferred large > order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there > being existing overlapping populated PTEs, etc) then fallback immediately to > order-0. It turned out that this approach caused a performance regression in the > Speedometer benchmark. I suppose it's regression against the v1, not the unpatched kernel. > With my v1 patch, there were significant quantities of > memory which could not be placed in the 64K bucket and were instead being > allocated for the 32K and 16K buckets. With the proposed simplification, that > memory ended up using the 4K bucket, so page faults increased by 2.75x compared > to the v1 patch (although due to the 64K bucket, this number is still a bit > lower than the baseline). So instead, I continue to calculate a folio order that > is somewhere between the preferred order and 0. (See below for more details). I suppose the benchmark wasn't running under memory pressure, which is uncommon for client devices. It could be easier the other way around: using 32/16KB shows regression whereas order-0 shows better performance under memory pressure. I'm not sure we should use v1 as the baseline. Unpatched kernel sounds more reasonable at this point. If 32/16KB is proven to be better in most scenarios including under memory pressure, we can reintroduce that policy. I highly doubt this is the case: we tried 16KB base page size on client devices, and overall, the regressions outweighs the benefits. > The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series > [2], which is a hard dependency. I have a branch at [3]. It's not clear to me why [2] is a hard dependency. It seems to me we are getting close and I was hoping we could get into mm-unstable soon without depending on other series...