On Tue, Jul 4, 2023 at 5:53 PM Yin Fengwei <fengwei.yin@xxxxxxxxx> wrote: > > > > On 7/4/23 23:36, Ryan Roberts wrote: > > On 04/07/2023 08:11, Yu Zhao wrote: > >> On Tue, Jul 4, 2023 at 12:22 AM Yin, Fengwei <fengwei.yin@xxxxxxxxx> wrote: > >>> > >>> On 7/4/2023 10:18 AM, Yu Zhao wrote: > >>>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > >>>>> > >>>>> Hi All, > >>>>> > >>>>> This is v2 of a series to implement variable order, large folios for anonymous > >>>>> memory. The objective of this is to improve performance by allocating larger > >>>>> chunks of memory during anonymous page faults. See [1] for background. > >>>> > >>>> Thanks for the quick response! > >>>> > >>>>> I've significantly reworked and simplified the patch set based on comments from > >>>>> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to > >>>>> VARIABLE_THP, on Yu's advice. > >>>>> > >>>>> The last patch is for arm64 to explicitly override the default > >>>>> arch_wants_pte_order() and is intended as an example. If this series is accepted > >>>>> I suggest taking the first 4 patches through the mm tree and the arm64 change > >>>>> could be handled through the arm64 tree separately. Neither has any build > >>>>> dependency on the other. > >>>>> > >>>>> The one area where I haven't followed Yu's advice is in the determination of the > >>>>> size of folio to use. It was suggested that I have a single preferred large > >>>>> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there > >>>>> being existing overlapping populated PTEs, etc) then fallback immediately to > >>>>> order-0. It turned out that this approach caused a performance regression in the > >>>>> Speedometer benchmark. > >>>> > >>>> I suppose it's regression against the v1, not the unpatched kernel. > >>> From the performance data Ryan shared, it's against unpatched kernel: > >>> > >>> Speedometer 2.0: > >>> > >>> | kernel | runs_per_min | > >>> |:-------------------------------|---------------:| > >>> | baseline-4k | 0.0% | > >>> | anonfolio-lkml-v1 | 0.7% | > >>> | anonfolio-lkml-v2-simple-order | -0.9% | > >>> | anonfolio-lkml-v2 | 0.5% | > >> > >> I see. Thanks. > >> > >> A couple of questions: > >> 1. Do we have a stddev? > > > > | kernel | mean_abs | std_abs | mean_rel | std_rel | > > |:------------------------- |-----------:|----------:|-----------:|----------:| > > | baseline-4k | 117.4 | 0.8 | 0.0% | 0.7% | > > | anonfolio-v1 | 118.2 | 1 | 0.7% | 0.9% | > > | anonfolio-v2-simple-order | 116.4 | 1.1 | -0.9% | 0.9% | > > | anonfolio-v2 | 118 | 1.2 | 0.5% | 1.0% | > > > > This is with 3 runs per reboot across 5 reboots, with first run after reboot > > trimmed (it's always a bit slower, I assume due to cold page cache). So 10 data > > points per kernel in total. > > > > I've rerun the test multiple times and see similar results each time. > > > > I've also run anonfolio-v2 with Kconfig FLEXIBLE_THP=disabled and in this case I > > see the same performance as baseline-4k. > > > > > >> 2. Do we have a theory why it regressed? > > > > I have a woolly hypothesis; I think Chromium is doing mmap/munmap in ways that > > mean when we fault, order-4 is often too big to fit in the VMA. So we fallback > > to order-0. I guess this is happening so often for this workload that the cost > > of doing the checks and fallback is outweighing the benefit of the memory that > > does end up with order-4 folios. > > > > I've sampled the memory in each bucket (once per second) while running and its > > roughly: > > > > 64K: 25% > > 32K: 15% > > 16K: 15% > > 4K: 45% > > > > 32K and 16K obviously fold into the 4K bucket with anonfolio-v2-simple-order. > > But potentially, I suspect there is lots of mmap/unmap for the smaller sizes and > > the 64K contents is more static - that's just a guess though. > So this is like out of vma range thing. > > > > >> Assuming no bugs, I don't see how a real regression could happen -- > >> falling back to order-0 isn't different from the original behavior. > >> Ryan, could you `perf record` and `cat /proc/vmstat` and share them? > > > > I can, but it will have to be a bit later in the week. I'll do some more test > > runs overnight so we have a larger number of runs - hopefully that might tell us > > that this is noise to a certain extent. > > > > I'd still like to hear a clear technical argument for why the bin-packing > > approach is not the correct one! > My understanding to Yu's (Yu, correct me if I am wrong) comments is that we > postpone this part of change and make basic anon large folio support in. Then > discuss which approach we should take. Maybe people will agree retry is the > choice, maybe other approach will be taken... > > For example, for this out of VMA range case, per VMA order should be considered. > We don't need make decision that the retry should be taken now. I've articulated the reasons in another email. Just summarize the most important point here: using more fallback orders makes a system reach equilibrium faster, at which point it can't allocate the order of arch_wants_pte_order() anymore. IOW, this best-fit policy can reduce the number of folios of the h/w prefered order for a system running long enough.