On 7/4/2023 10:18 AM, Yu Zhao wrote: > On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: >> >> Hi All, >> >> This is v2 of a series to implement variable order, large folios for anonymous >> memory. The objective of this is to improve performance by allocating larger >> chunks of memory during anonymous page faults. See [1] for background. > > Thanks for the quick response! > >> I've significantly reworked and simplified the patch set based on comments from >> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to >> VARIABLE_THP, on Yu's advice. >> >> The last patch is for arm64 to explicitly override the default >> arch_wants_pte_order() and is intended as an example. If this series is accepted >> I suggest taking the first 4 patches through the mm tree and the arm64 change >> could be handled through the arm64 tree separately. Neither has any build >> dependency on the other. >> >> The one area where I haven't followed Yu's advice is in the determination of the >> size of folio to use. It was suggested that I have a single preferred large >> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there >> being existing overlapping populated PTEs, etc) then fallback immediately to >> order-0. It turned out that this approach caused a performance regression in the >> Speedometer benchmark. > > I suppose it's regression against the v1, not the unpatched kernel. >From the performance data Ryan shared, it's against unpatched kernel: Speedometer 2.0: | kernel | runs_per_min | |:-------------------------------|---------------:| | baseline-4k | 0.0% | | anonfolio-lkml-v1 | 0.7% | | anonfolio-lkml-v2-simple-order | -0.9% | | anonfolio-lkml-v2 | 0.5% | What if we use 32K or 16K instead of 64K as default anonymous folio size? I suspect this app may have 32K or 16K anon folio as sweet spot. Regards Yin, Fengwei > >> With my v1 patch, there were significant quantities of >> memory which could not be placed in the 64K bucket and were instead being >> allocated for the 32K and 16K buckets. With the proposed simplification, that >> memory ended up using the 4K bucket, so page faults increased by 2.75x compared >> to the v1 patch (although due to the 64K bucket, this number is still a bit >> lower than the baseline). So instead, I continue to calculate a folio order that >> is somewhere between the preferred order and 0. (See below for more details). > > I suppose the benchmark wasn't running under memory pressure, which is > uncommon for client devices. It could be easier the other way around: > using 32/16KB shows regression whereas order-0 shows better > performance under memory pressure. > > I'm not sure we should use v1 as the baseline. Unpatched kernel sounds > more reasonable at this point. If 32/16KB is proven to be better in > most scenarios including under memory pressure, we can reintroduce > that policy. I highly doubt this is the case: we tried 16KB base page > size on client devices, and overall, the regressions outweighs the > benefits. > >> The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series >> [2], which is a hard dependency. I have a branch at [3]. > > It's not clear to me why [2] is a hard dependency. > > It seems to me we are getting close and I was hoping we could get into > mm-unstable soon without depending on other series... >