Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64

David Hildenbrand <david@xxxxxxxxxx> · Wed, 16 Oct 2024 17:16:42 +0200

Performance Testing
===================

I've run some limited performance benchmarks:

First, a real-world benchmark that causes a lot of page table manipulation (and
therefore we would expect to see regression here if we are going to see it
anywhere); kernel compilation. It barely registers a change. Values are times,
so smaller is better. All relative to base-4k:

|             |    kern |    kern |    user |    user |    real |    real |
| config      |    mean |   stdev |    mean |   stdev |    mean |   stdev |
|-------------|---------|---------|---------|---------|---------|---------|
| base-4k     |    0.0% |    1.1% |    0.0% |    0.3% |    0.0% |    0.3% |
| compile-4k  |   -0.2% |    1.1% |   -0.2% |    0.3% |   -0.1% |    0.3% |
| boot-4k     |    0.1% |    1.0% |   -0.3% |    0.2% |   -0.2% |    0.2% |

The Speedometer JavaScript benchmark also shows no change. Values are runs per
min, so bigger is better. All relative to base-4k:

| config      |    mean |   stdev |
|-------------|---------|---------|
| base-4k     |    0.0% |    0.8% |
| compile-4k  |    0.4% |    0.8% |
| boot-4k     |    0.0% |    0.9% |

Finally, I've run some microbenchmarks known to stress page table manipulations
(originally from David Hildenbrand). The fork test maps/allocs 1G of anon
memory, then measures the cost of fork(). The munmap test maps/allocs 1G of anon
memory then measures the cost of munmap()ing it. The fork test is known to be
extremely sensitive to any changes that cause instructions to be aligned
differently in cachelines. When using this test for other changes, I've seen
double digit regressions for the slightest thing, so 12% regression on this test
is actually fairly good. This likely represents the extreme worst case for
regressions that will be observed across other microbenchmarks (famous last
words). Values are times, so smaller is better. All relative to base-4k:

... and here I am, worrying about much smaller degradation in these 
micro-benchmark ;) You're right, these are pure micro-benchmarks, and 
while 12% does sound like "much", even stupid compiler code movement can 
result in such changes in the fork() micro benchmark.

So I think this is just fine, and actually "surprisingly" small. And, 
there is even a way to statically compile a page size and not worry 
about that at all.

As discussed ahead of times, I consider this change very valuable. In 
RHEL, the biggest issue is actually the test matrix, that cannot really 
be reduced significantly ... but it will make shipping/packaging easier.

CCing Don, who did the separate 64k RHEL flavor kernel.

--
Cheers,

David / dhildenb