Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory

Yu Zhao <yuzhao@xxxxxxxxxx> · Mon, 3 Jul 2023 20:18:50 -0600

On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
>
> Hi All,
>
> This is v2 of a series to implement variable order, large folios for anonymous
> memory. The objective of this is to improve performance by allocating larger
> chunks of memory during anonymous page faults. See [1] for background.

Thanks for the quick response!

> I've significantly reworked and simplified the patch set based on comments from
> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to
> VARIABLE_THP, on Yu's advice.
>
> The last patch is for arm64 to explicitly override the default
> arch_wants_pte_order() and is intended as an example. If this series is accepted
> I suggest taking the first 4 patches through the mm tree and the arm64 change
> could be handled through the arm64 tree separately. Neither has any build
> dependency on the other.
>
> The one area where I haven't followed Yu's advice is in the determination of the
> size of folio to use. It was suggested that I have a single preferred large
> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there
> being existing overlapping populated PTEs, etc) then fallback immediately to
> order-0. It turned out that this approach caused a performance regression in the
> Speedometer benchmark.

I suppose it's regression against the v1, not the unpatched kernel.

> With my v1 patch, there were significant quantities of
> memory which could not be placed in the 64K bucket and were instead being
> allocated for the 32K and 16K buckets. With the proposed simplification, that
> memory ended up using the 4K bucket, so page faults increased by 2.75x compared
> to the v1 patch (although due to the 64K bucket, this number is still a bit
> lower than the baseline). So instead, I continue to calculate a folio order that
> is somewhere between the preferred order and 0. (See below for more details).

I suppose the benchmark wasn't running under memory pressure, which is
uncommon for client devices. It could be easier the other way around:
using 32/16KB shows regression whereas order-0 shows better
performance under memory pressure.

I'm not sure we should use v1 as the baseline. Unpatched kernel sounds
more reasonable at this point. If 32/16KB is proven to be better in
most scenarios including under memory pressure, we can reintroduce
that policy. I highly doubt this is the case: we tried 16KB base page
size on client devices, and overall, the regressions outweighs the
benefits.

> The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
> [2], which is a hard dependency. I have a branch at [3].

It's not clear to me why [2] is a hard dependency.

It seems to me we are getting close and I was hoping we could get into
mm-unstable soon without depending on other series...