Re: [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory

David Hildenbrand <david@xxxxxxxxxx> · Wed, 17 May 2023 15:58:57 +0200

On 26.04.23 12:41, Ryan Roberts wrote:
Hi David,

On 17/04/2023 16:44, David Hildenbrand wrote:

So what should be safe is replacing all sub-pages of a folio that are marked
"maybe shared" by a new folio under PT lock. However, I wonder if it's really
worth the complexity. For THP we were happy so far to *not* optimize this,
implying that maybe we shouldn't worry about optimizing the fork() case for
now
that heavily.

I don't have the exact numbers to hand, but I'm pretty sure I remember enabling
large copies was contributing a measurable amount to the performance
improvement. (Certainly, the zero-page copy case, is definitely a big
contributer). I don't have access to the HW at the moment but can rerun later
with and without to double check.

In which test exactly? Some micro-benchmark?

The kernel compile benchmark that I quoted numbers for in the cover letter. I
have some trace points (not part of the submitted series) that tell me how many
mappings of each order we get for each code path. I'm pretty sure I remember all
of these 4 code paths contributing non-negligible amounts.

Interesting! It would be great to see if there is an actual difference after
patch #10 was applied without the other COW replacement.

Sorry about the delay. I now have some numbers for this...

Dito, I'm swamped :) Thanks for running these benchmarks!

As LSF/MM reminded me again of this topic ...

I rearranged the patch order so that all the "utility" stuff (new rmap
functions, etc) are first (1, 2, 3, 4, 5, 8, 9, 11, 12, 13), followed by a
couple of general improvements (7, 17), which should be dormant until we have
the final patches, then finally (6, 10, 14, 15), which implement large anon
folios the allocate, reuse, copy-non-zero and copy-zero paths respectively. I've
dropped patch 16 and fixed the copy-exclusive bug you spotted (by ensuring we
never replace an exclusive page).

I've measured performance at the following locations in the patch set:

- baseline: none of my patches applied
- utility: has utility and general improvement patches applied
- alloc: utility + 6
- reuse: utility + 6 + 10
- copy: utility + 6 + 10 + 14
- zero-alloc: utility + 6 + 19 + 14 + 15

The test is `make defconfig && time make -jN Image` for a clean checkout of
v6.3-rc3. The first result is thrown away, and the next 3 are kept. I saw some
per-boot variance (probably down to kaslr, etc). So have booted each kernel 7
times for a total of 3x7=21 samples per kernel. Then I've taken the mean:

jobs=8:

| label      |   real |   user |   kernel |
|:-----------|-------:|-------:|---------:|
| baseline   |   0.0% |   0.0% |     0.0% |
| utility    |  -2.7% |  -2.8% |    -3.1% |
| alloc      |  -6.0% |  -2.3% |   -24.1% |
| reuse      |  -9.5% |  -5.8% |   -28.5% |
| copy       | -10.6% |  -6.9% |   -29.4% |
| zero-alloc |  -9.2% |  -5.1% |   -29.8% |

jobs=160:

| label      |   real |   user |   kernel |
|:-----------|-------:|-------:|---------:|
| baseline   |   0.0% |   0.0% |     0.0% |
| utility    |  -1.8% |  -0.0% |    -7.7% |
| alloc      |  -6.0% |   1.8% |   -20.9% |
| reuse      |  -7.8% |  -1.6% |   -24.1% |
| copy       |  -7.8% |  -2.5% |   -26.8% |
| zero-alloc |  -7.7% |   1.5% |   -29.4% |

So it looks like patch 10 (reuse) is making a difference, but copy and
zero-alloc are not adding a huge amount, as you hypothesized. Personally I would
prefer not to drop those patches though, as it will all help towards utilization
of contiguous PTEs on arm64, which is the second part of the change that I'm now
working on.

Yes, pretty much what I expected :) I can only suggest to

(1) Make the initial support as simple and minimal as possible. That
    means, strip anything that is not absolutely required. That is,
    exclude *at least* copy and zero-alloc. We can always add selected
    optimizations on top later.

    You'll do yourself a favor to get as much review coverage, faster
    review for inclusion, and less chances for nasty BUGs.

(2) Keep the COW logic simple. We've had too many issues
    in that area for my taste already. As 09854ba94c6a ("mm:
    do_wp_page() simplification") from Linus puts it: "Simplify,
    simplify, simplify.". If it doesn't add significant benefit, rather
    keep it simple.

For the final config ("zero-alloc") I also collected stats on how many
operations each of the 4 paths was performing, using ftrace and histograms.
"pnr" is the number of pages allocated/reused/copied, and "fnr" is the number of
pages in the source folio):

do_anonymous_page:

{ pnr:          1 } hitcount:    2749722
{ pnr:          4 } hitcount:     387832
{ pnr:          8 } hitcount:     409628
{ pnr:         16 } hitcount:    4296115

pages: 76315914
faults: 7843297
pages per fault: 9.7

wp_page_reuse (anon):

{ pnr:          1, fnr:          1 } hitcount:      47887
{ pnr:          3, fnr:          4 } hitcount:          2
{ pnr:          4, fnr:          4 } hitcount:       6131
{ pnr:          6, fnr:          8 } hitcount:          1
{ pnr:          7, fnr:          8 } hitcount:         10
{ pnr:          8, fnr:          8 } hitcount:       3794
{ pnr:          1, fnr:         16 } hitcount:         36
{ pnr:          2, fnr:         16 } hitcount:         23
{ pnr:          3, fnr:         16 } hitcount:          5
{ pnr:          4, fnr:         16 } hitcount:          9
{ pnr:          5, fnr:         16 } hitcount:          8
{ pnr:          6, fnr:         16 } hitcount:          9
{ pnr:          7, fnr:         16 } hitcount:          3
{ pnr:          8, fnr:         16 } hitcount:         24
{ pnr:          9, fnr:         16 } hitcount:          2
{ pnr:         10, fnr:         16 } hitcount:          1
{ pnr:         11, fnr:         16 } hitcount:          9
{ pnr:         12, fnr:         16 } hitcount:          2
{ pnr:         13, fnr:         16 } hitcount:         27
{ pnr:         14, fnr:         16 } hitcount:          2
{ pnr:         15, fnr:         16 } hitcount:         54
{ pnr:         16, fnr:         16 } hitcount:       6673

pages: 211393
faults: 64712
pages per fault: 3.3

wp_page_copy (anon):

{ pnr:          1, fnr:          1 } hitcount:      81242
{ pnr:          4, fnr:          4 } hitcount:       5974
{ pnr:          1, fnr:          8 } hitcount:          1
{ pnr:          4, fnr:          8 } hitcount:          1
{ pnr:          8, fnr:          8 } hitcount:      12933
{ pnr:          1, fnr:         16 } hitcount:         19
{ pnr:          4, fnr:         16 } hitcount:          3
{ pnr:          8, fnr:         16 } hitcount:          7
{ pnr:         16, fnr:         16 } hitcount:       4106

pages: 274390
faults: 104286
pages per fault: 2.6

wp_page_copy (zero):

{ pnr:          1 } hitcount:     178699
{ pnr:          4 } hitcount:      14498
{ pnr:          8 } hitcount:      23644
{ pnr:         16 } hitcount:     257940

pages: 4552883
faults: 474781
pages per fault: 9.6

I'll have to set aside more time to digest these values :)

--
Thanks,

David / dhildenb