Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

Ryan Roberts <ryan.roberts@xxxxxxx> · Thu, 16 Nov 2023 09:34:56 +0000

On 15/11/2023 21:37, Andrew Morton wrote:
> On Wed, 15 Nov 2023 16:30:05 +0000 Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
> 
>> However, the primary motivation for this change is to reduce the number
>> of tlb maintenance operations that the arm64 backend has to perform
>> during fork
> 
> Do you have a feeling for how much performance improved due to this? 

The commit log for patch 13 (the one which implements ptep_set_wrprotects() for
armt64) has performance numbers for a fork() microbenchmark with/without the
optimization:

---8<---

I see huge performance regression when PTE_CONT support was added, then
the regression is mostly fixed with the addition of this change. The
following shows regression relative to before PTE_CONT was enabled
(bigger negative value is bigger regression):

|   cpus |   before opt |   after opt |
|-------:|-------------:|------------:|
|      1 |       -10.4% |       -5.2% |
|      8 |       -15.4% |       -3.5% |
|     16 |       -38.7% |       -3.7% |
|     24 |       -57.0% |       -4.4% |
|     32 |       -65.8% |       -5.4% |

---8<---

Note that's running on Ampere Altra, where TLBI tends to have high cost.

> 
> Are there other architectures which might similarly benefit?  By
> implementing ptep_set_wrprotects(), it appears.  If so, what sort of
> gains might they see?

The rationale for this is to reduce expense for arm64 to manage
contpte-mappings. If other architectures support contpte-mappings then they
could benefit from this API for the same reasons that arm64 benefits. I have a
vague understanding that riscv has a similar concept to the arm64's contiguous
bit, so perhaps they are a future candidate. But I'm not familiar with the
details of the riscv feature so couldn't say whether they would be likely to see
the same level of perf improvement as arm64.

Thanks,
Ryan