Re: [LSF/MM/BPF TOPIC] Multi-sized THP performance benchmarks and analysis on ARM64

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thanks for the CC, Zi! I must admit I'm not great at following the list...


On 08/04/2024 19:56, Zi Yan wrote:
> On 8 Apr 2024, at 12:30, Matthew Wilcox wrote:
> 
>> On Thu, Apr 04, 2024 at 11:57:03AM -0700, Christoph Lameter (Ampere) wrote:
>>> On Mon, 1 Apr 2024, Jonathan Cameron wrote:
>>>
>>>> Sounds like useful data, but is it a suitable topic for LSF-MM?
>>>> What open questions etc is it raising?

I'm happy to see others looking at mTHP, and would be very keen to be involved
in any discussion. Unfortunately I won't be able to make it to LSFMM this year -
my wife is expecting a baby the same week. I'll register for online, but even
joining that is looking unlikely.

It would be great to be cc'ed on any future results you make public though. And
I'd be very happy to work more closely together to debug problems or extend
things further - feel free to reach out!

I have a roadmap of items that I believe are needed to get this to perform
optimally (see first 2 columns of attached slide); only some of this is in
mainline so would be good to understand exactly what code you were doing your
testing with?

>>>
>>>
>>> mTHP is new functionality that will require additional work to support more
>>> use cases. It is also unclear at this point in what usecases mTHP is useful
>>> and where no benefit can so far be seen. Also the effect of coalescing
>>> multiple PTE entries into one TLB entry is new to MM (CONT_PTE).
> 
> I think we need a clarification of CONT_PTE from Christoph.
> 
> From the context of ARM CPUs, CONT_PTE might be a group of PTEs with contiguous
> bit set. It was used by hugetlb and kernel linear mapping before Ryan added
> CONT_PTE support for mTHPs. 

Yes indeed. Note the macro "PTE_CONT" is private to the arm64 arch and is never
used directly by the core-mm. It's been around for a while and used for hugetlb
and kernel memory. So the only new use is for regular user memory (anon and page
cache). So I don't think there are any risks from HW conformance PoV, if that
was the concern.

> This requires software support (setting contiguous bits)
> to be able to coalesce PTEs. But ARM also has this Hardware Page Aggregation (HPA)
> feature[1], which can coalesce PTEs without software intervention. I am not
> sure which ARM CPUs actually implement it.

All of the latest Arm-implemented cores support HPA. However sometimes it needs
to be explicitly enabled by EL3 FW. The N1 used in Ampere Altra has it, but it
is not propoerly enabled and due to errata, it is not possible to fully enable
it even with access to EL3.

Thanks,
Ryan

> 
> From the context of all CPUs, AMD has "PTE coalescing/clustering"[2] feature
> from Zen1. It is similar to ARM's HPA, not requiring software changes to
> coalesce PTEs. RISC-V also has Svnapot (Naturally-Aligned Power-of-Two
> Address-Translation Contiguity) [3], which requires software help.
> 
> So with Matthew's folio patches back in 2020, hardware-only CONT_PTE
> would work since then. But software-assist CONT_PTE just began to work
> on ARM CPUs with Ryan's cont-pte patchset for anonymous memory and page cache.
> 
>>>
>>> Ultimately it would be useful to have mTHP support also provide larger
>>> blocksize capabilities for filesystem etc etc. mTHP needs to mature and an
>>> analysis of the arguable a bit experimental state of affairs can help a lot
>>> in getting there.
>>
>> Have you been paying attention to anything that's been happening in Linux
>> development in the last three years?  7b230db3b8d3 introduced folios
>> in December 2020 (was merged in November 2021 for v5.16).  v5.17 (March
>> 2022) did everything short of enabling large folios for the page cache,
>> which landed in v5.18 (May 2022).  We started using cont-PTEs for large
>> folios in August 2023.  Again, the page cache led the way here and we're
>> just adding support for anonymous large folios (called mTHP) now.
> 
> Matthew, your cont-PTE here is "New page table range API" right? There is
> no ARM contiguous bit manipulation, right?
> 
>>
>> There's still a ton of work to do, but we've been busy doing it since
>> LSFMM in Puerto Rico (2019) with READ_ONLY_THP_FOR_FS being the very
>> first result from the group of interested developers.
>>
>> And if you haven't seen the results that Ryan Roberts has posted for
>> the tests he's run, I suggest you look them up.  He does a great job
>> of breaking down how much benefit he sees from the hardware side (use of
>> contPTE) vs the software side (shorter LRU lists, fewer atomic ops).
> 
> It is definitely helpful to distinguish hardware and software benefits,
> since not all CPUs can coalesce PTEs.
> 
> 
> [1] https://developer.arm.com/documentation/100616/0301/register-descriptions/aarch64-system-registers/cpuectlr-el1--cpu-extended-control-register--el1
> [2] https://www.eliot.so/memsys23.pdf
> [3] https://github.com/riscv/virtual-memory?tab=readme-ov-file#svnapot-naturally-aligned-power-of-two-address-translation-contiguity
> 
> --
> Best Regards,
> Yan, Zi

Attachment: folios roadmap.pdf
Description: Adobe PDF document


[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux