Re: [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64

Ryan Roberts <ryan.roberts@xxxxxxx> · Wed, 26 Jun 2024 11:47:10 +0100

On 25/06/2024 19:11, Christoph Lameter (Ampere) wrote:
> On Tue, 25 Jun 2024, Ryan Roberts wrote:
> 
>> But I also want to raise a more general point; We are not done with the
>> optimizations yet. contpte can also improve performance for iTLB, but this
>> requires a change to the page cache to store text in (at least) 64K folios.
>> Typically the iTLB is under a lot of pressure and this can help reduce it. This
>> change is not in mainline yet (and I still need to figure out how to make the
>> patch acceptable), but is worth another ~1.5% for the 4KPS case. I suspect this
>> will also move the needle on the other benchmarks you ran. See [3] - I'd
>> appreciate any thoughts you have on how to get something like this accepted.
>>
>> [3] https://lore.kernel.org/lkml/20240111154106.3692206-1-ryan.roberts@xxxxxxx/
> 
> The discussion here seems to indicate that readahead is already ok for order-2
> (16K mTHP size?). So this is only for 64K mTHP on 4K?

Kind of; for fiflesystems that report support for large folios, readahead starts
with order-2 folio, then increments the folio order by 2 orders for every
subsequent readahead marker that is hit. But text is rarely accessed
sequentially so readahead markers are rarely hit in practice and therefore all
the text folios tend to end up as order-2 (16K for 4K base pages).

But the important bit is that the filesystem needs to support large folios in
the first place, without that, we are always stuck using small (order-0) folios.
XFS and a few other (network) filesystems support large folios today, but ext4
doesn't - that's being worked on though.

> 
> From what I read in the ARM64 manuals it seems that CONT_PTE can only be used
> for 64K mTHP on 4K kernels. The 16K case will not benefit from CONT_PTE nor any
> other intermediate size than 64K.

Yes and no. The contiguous hint, when applied, constitutes a single fixed size
and that size depends on the base page size. Its 64K for 4KPS, 2M for 16KPS and
2M for 64KPS.

However, most modern Arm-designed CPUs support a micro-architectural feature
called Hardware Page Aggregation (HPA), which can aggregate up to 4 pages into a
single TLB in a way that is transparent to SW. So that feature can benefit from
16K folios when using 4K base pages. Although HPA is implemented in the Neoverse
N1 CPU (which is what I believe is in the Ampere Altra), it is disabled and due
to an errata can't be enabled. So HPA is not relevant for Altra.

> 
> Quoting:
> 
> https://developer.arm.com/documentation/ddi0406/c/System-Level-Architecture/Virtual-Memory-System-Architecture--VMSA-/Memory-region-attributes/Long-descriptor-format-memory-region-attributes?lang=en#BEIIBEIJ

Note this link is for armv7A, not v8. But hopefully my explanation about answers
everything.

Thanks,
Ryan

> 
> "Contiguous hint
> 
> The Long-descriptor translation table format descriptors contain a Contiguous
> hint bit. Setting this bit to 1 indicates that 16 adjacent translation table
> entries point to a contiguous output address range. These 16 entries must be
> aligned in the translation table so that the top 5 bits of their input
> addresses, that index their position in the translation table, are the same. For
> example, referring to Figure 12.21, to use this hint for a block of 16 entries
> in the third-level translation table, bits[20:16] of the input addresses for the
> 16 entries must be the same.
> 
> The contiguous output address range must be aligned to size of 16 translation
> table entries at the same translation table level.
> 
> Use of this hint means that the TLB can cache a single entry to cover the 16
> translation table entries.
> 
> This bit is only a hint bit. The architecture does not require a processor to
> cache TLB entries in this way. To avoid TLB coherency issues, any TLB
> maintenance by address must not assume any optimization of the TLB tables that
> might result from use of the hint bit.
>