Re: [PATCH V3] arm64: Don't flush tlb while clearing the accessed bit

Alexander Van Brunt <avanbrunt@xxxxxxxxxx> · Mon, 3 Dec 2018 21:20:25 +0000

> If we roll a TLB invalidation routine without the trailing DSB, what sort of
> performance does that get you?

It is not as good. In some cases, it is really bad. Skipping the invalidate was
the most consistent and fast implementation.

Methodology:

We ran 6 tests on Jetson Xavier with three different implementations of
ptep_clear_flush_young: the existing version that does a TLB invalidate and a
DSB, our proposal to skip the TLB invalidate, and Will's suggestion to just skip
the DSB. The 6 tests are read and write versions of sequential access, random
access, and alternating between a fixed page and a random page. We ran each of
the (6 tests) * (3 configs) 31 times and measured the execution time.

The Jetson Xavier platform has 8 Carmel CPUs, 16 GB of DRAM, and an NVMe hard
drive. Carmel CPUs have a unique feature where they batch up TLB invalidates
until either the very large buffer overflows or it executes a DSB.

Below we report statistically significant (p < .01) differences in the mean
execution time or the variation in execution time. There are 36 comparisons
tested. Because of that, there is a 50% chance that at least one of the 36
comparisons would have a p <= 1 / 36. p = .01 should make false positives
unlikely.

Sequential Reads:

Executing a TLB invalidate but skipping the DSB had 3.5x more variation than an
invalidate and a DSB and it had 12.3x more run-to-run variation than skipping
the TLB invalidate and the DSB. The run-to-run variation when skipping the DSB
was 38% of the execution time. This is likely because Carmel's feature of
batching up TLB invalidates until it executes a DSB and the need to wait for
the other 7 cores to compete the invalidate.

Skipping the TLB invalidate was 8% faster than executing an invalidate and a
DSB. It also had 3.5x less run-to-run variation. Because the run-to-run
variation with the implementation that executed a TLB invalidate but not a DSB
was so much higher, its execution time could not be estimated with enough
precision to say that it is statistically different from the other two
implementations.

Random Reads:

Executing a TLB invalidate but not a DSB was the faster and had less run-to-run
variation than either of the other implementations. It is 8% faster and has ~3x
lower run-to-run variation than either alternative. The run-to-run variation
when skipping the DSB was 1.5% of the overall execution time.

Skipping the TLB invalidate was not statistically different from the existing
implementation that does a TLB invalidate and a DSB.

Alternating Random and Hot Page Reads:

In this test, executing a TLB invalidate but not a DSB was the fastest. It was
12% faster than an invalidate and a DSB. It was 9% faster than executing no
invalidate. Similarly, skipping the invalidate was 4% faster than executing an
invalidate and a DSB (9% + 4% != 12% because of rounding).

The run-to-run variation was the lowest when executing an invalidate and a DSB.
Its variation was 1% of the time.  That is 64% of variation when skipping the
DSB and 22% of executing no TLB invalidate (5% of the execution time).

This test was meant to test the effects of a TLB not being updated with the
newly young PTE in memory. By never executing a TLB invalidate, then the kernel
almost never gets a chance to take a page fault because the access flag is
clear. By executing an invalidate but not executing a DSB probably results in
the TLB usually updated with the PTE value before the page falls off the LRU
list. So, it makes sense that skipping the DSB is the fastest. The cases where
the hot page erroneously evicted are likely the reason why the variation
increases with looser TLB invalidate implementations.

Sequential Writes:

There were no statistically significant results in this test. That is likely
because IO was limiting the write speed. Also, the write tests had much more run
-to-run variation (about 10% of the execution time) than the read tests.

For openness, the existing implementation that executes an invalidate and DSB
was faster by 8% but didn't quite meet the requirements to be statistically
significant. Its p-value was .014. Since it less than 1 / 36 = .028, it is
unlikely to be coincidental. But, every other result reported has a p < .004.

Random Writes:

Skipping the invalidate was the fastest. It was 51% faster than executing an
invalidate and a DSB. It was 38% faster than executing an invalidate but not a
DSB. The run-to-run variations were not statistically different.

Alternating Random and Hot Page Writes:

Similar to random writes, skipping the invalidate was the fastest. It was 46%
faster than executing an invalidate and a DSB and 45% faster than executing an
invalidate without a DSB. The run-to-run variations were not statistically
different.

Conclusion:

There were no statistically significant results where executing a TLB invalidate
and a DSB was fastest. Except for the sequential write case where there were no
significant results, it was slower by 8-50% than the alternatives.

Executing a TLB invalidate but not a DSB was faster than not executing a TLB
invalidate in the two random read cases by about 8%. However, skipping the
invalidate was faster in the random write tests by about 40%.

The existing implementation that executes an invalidate and a DSB had 3-4x less
run-to-run variation than the alternatives in the one hot page read test. That
is the strongest reason to continue fully invalidating TLBs. However, at worst
it had a 5% execution time. I think that going from 1% to 5% on this test is
more than made up for by reducing the variation in the sequential read test from
12% to 4% by skipping the invalidates altogether.

Because these are microbenchmarks that represent small parts of real
applications, I think that we should use the worst case run-to-run variation to
choose the implementation that has the least variation. Using that metric,
skipping the invalidate has a worst case of 5% (random read), skipping just the
DSB has a worst case of 38% (sequential read), and executing an invalidate and a
DSB has a worst case of 12% (sequential read).