On 05/07/2023 18:24, Yu Zhao wrote: > On Wed, Jul 5, 2023 at 3:11 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: >> >> On 05/07/2023 03:07, Yu Zhao wrote: >>> On Tue, Jul 4, 2023 at 7:20 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: >>>> >>>> On 03/07/2023 20:50, Yu Zhao wrote: >>>>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: >>>>>> >>>>>> arch_wants_pte_order() can be overridden by the arch to return the >>>>>> preferred folio order for pte-mapped memory. This is useful as some >>>>>> architectures (e.g. arm64) can coalesce TLB entries when the physical >>>>>> memory is suitably contiguous. >>>>>> >>>>>> The first user for this hint will be FLEXIBLE_THP, which aims to >>>>>> allocate large folios for anonymous memory to reduce page faults and >>>>>> other per-page operation costs. >>>>>> >>>>>> Here we add the default implementation of the function, used when the >>>>>> architecture does not define it, which returns the order corresponding >>>>>> to 64K. >>>>> >>>>> I don't really mind a non-zero default value. But people would ask why >>>>> non-zero and why 64KB. Probably you could argue this is the large size >>>>> all known archs support if they have TLB coalescing. For x86, AMD CPUs >>>>> would want to override this. I'll leave it to Fengwei to decide >>>>> whether Intel wants a different default value.> >>>>> Also I don't like the vma parameter because it makes >>>>> arch_wants_pte_order() a mix of hw preference and vma policy. From my >>>>> POV, the function should be only about the former; the latter should >>>>> be decided by arch-independent MM code. However, I can live with it if >>>>> ARM MM people think this is really what you want. ATM, I'm skeptical >>>>> they do. >>>> >>>> Here's the big picture for what I'm tryng to achieve: >>>> >>>> - In the common case, I'd like all programs to get a performance bump by >>>> automatically and transparently using large anon folios - so no explicit >>>> requirement on the process to opt-in. >>> >>> We all agree on this :) >>> >>>> - On arm64, in the above case, I'd like the preferred folio size to be 64K; >>>> from the (admittedly limitted) testing I've done that's about where the >>>> performance knee is and it doesn't appear to increase the memory wastage very >>>> much. It also has the benefits that for 4K base pages this is the contpte size >>>> (order-4) so I can take full benefit of contpte mappings transparently to the >>>> process. And for 16K this is the HPA size (order-2). >>> >>> My highest priority is to get 16KB proven first because it would >>> benefit both client and server devices. So it may be different from >>> yours but I don't see any conflict. >> >> Do you mean 16K folios on a 4K base page system > > Yes. > >> or large folios on a 16K base >> page system? I thought your focus was on speeding up 4K base page client systems >> but this statement has got me wondering? > > Sorry, I should have said 4x4KB. OK. Be aware that a number of Arm CPUs that support HPA don't have it enabled by default (or at least don't have it enabled in the mode that you would want it to see best performance with large anon folios). You would need EL3 access to reconfigure it. > >>>> - On arm64 when the process has marked the VMA for THP (or when >>>> transparent_hugepage=always) but the VMA does not meet the requirements for a >>>> PMD-sized mapping (or we failed to allocate, ...) then I'd like to map using >>>> contpte. For 4K base pages this is 64K (order-4), for 16K this is 2M (order-7) >>>> and for 64K this is 2M (order-5). The 64K base page case is very important since >>>> the PMD size for that base page is 512MB which is almost impossible to allocate >>>> in practice. >>> >>> Which case (server or client) are you focusing on here? For our client >>> devices, I can confidently say that 64KB has to be after 16KB, if it >>> happens at all. For servers in general, I don't know of any major >>> memory-intensive workloads that are not THP-aware, i.e., I don't think >>> "VMA does not meet the requirements" is a concern. >> >> For the 64K base page case, the focus is server. The problem reported by our >> partner is that the 512M huge page size is too big to reliably allocate and so >> the fauls always fall back to 64K base pages in practice. I would also speculate >> (happy to be proved wrong) that there are many THP-aware workloads that assume >> the THP size is 2M. In this case, their VMAs may well be too small to fit a 512M >> huge page when running on 64K base page system. > > Interesting. When you have something ready to share, I might be able > to try it on our ARM servers as well. That would be really helpful. I'm currently updating my branch that collates everything to reflect the review comments in this patch set and the contpte patch set. I'll share it in a couple of weeks. > >> But the TL;DR is that Arm has a partner for which enabling 2M THP on a 64K base >> page system is a very real requirement. Our intent is that this will be the >> mechanism we use to enable it. > > Yes, contpte makes more sense for what you described. It'd fit in a > lot better in the hugetlb case, but I guess your partner uses anon. arm64 already supports contpte for hugetlb, but they need it to work with anon memory using THP.