Ryan Roberts <ryan.roberts@xxxxxxx> writes: > On 15/08/2023 22:32, Huang, Ying wrote: >> Hi, Ryan, >> >> Ryan Roberts <ryan.roberts@xxxxxxx> writes: >> >>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be >>> allocated in large folios of a determined order. All pages of the large >>> folio are pte-mapped during the same page fault, significantly reducing >>> the number of page faults. The number of per-page operations (e.g. ref >>> counting, rmap management lru list management) are also significantly >>> reduced since those ops now become per-folio. >>> >>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig, >>> which defaults to disabled for now; The long term aim is for this to >>> defaut to enabled, but there are some risks around internal >>> fragmentation that need to be better understood first. >>> >>> Large anonymous folio (LAF) allocation is integrated with the existing >>> (PMD-order) THP and single (S) page allocation according to this policy, >>> where fallback (>) is performed for various reasons, such as the >>> proposed folio order not fitting within the bounds of the VMA, etc: >>> >>> | prctl=dis | prctl=ena | prctl=ena | prctl=ena >>> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always >>> ----------------|-----------|-------------|---------------|------------- >>> no hint | S | LAF>S | LAF>S | THP>LAF>S >>> MADV_HUGEPAGE | S | LAF>S | THP>LAF>S | THP>LAF>S >>> MADV_NOHUGEPAGE | S | S | S | S >> >> IMHO, we should use the following semantics as you have suggested >> before. >> >> | prctl=dis | prctl=ena | prctl=ena | prctl=ena >> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always >> ----------------|-----------|-------------|---------------|------------- >> no hint | S | S | LAF>S | THP>LAF>S >> MADV_HUGEPAGE | S | S | THP>LAF>S | THP>LAF>S >> MADV_NOHUGEPAGE | S | S | S | S >> >> Or even, >> >> | prctl=dis | prctl=ena | prctl=ena | prctl=ena >> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always >> ----------------|-----------|-------------|---------------|------------- >> no hint | S | S | S | THP>LAF>S >> MADV_HUGEPAGE | S | S | THP>LAF>S | THP>LAF>S >> MADV_NOHUGEPAGE | S | S | S | S >> >> From the implementation point of view, PTE mapped PMD-sized THP has >> almost no difference with LAF (just some small sized THP). It will be >> confusing to distinguish them from the interface point of view. >> >> So, IMHO, the real difference is the policy. For example, prefer >> PMD-sized THP, prefer small sized THP, or fully auto. The sysfs >> interface is used to specify system global policy. In the long term, it >> can be something like below, >> >> never: S # disable all THP >> madvise: # never by default, control via madvise() >> always: THP>LAF>S # prefer PMD-sized THP in fact >> small: LAF>S # prefer small sized THP >> auto: # use in-kernel heuristics for THP size >> >> But it may be not ready to add new policies now. So, before the new >> policies are ready, we can add a debugfs interface to override the >> original policy in /sys/kernel/mm/transparent_hugepage/enabled. After >> we have tuned enough workloads, collected enough data, we can add new >> policies to the sysfs interface. > > I think we can all imagine many policy options. But we don't really have much > evidence yet for what it best. The policy I'm currently using is intended to > give some flexibility for testing (use LAF without THP by setting sysfs=never, > use THP without LAF by compiling without LAF) without adding any new knobs at > all. Given that, surely we can defer these decisions until we have more data? > > In the absence of data, your proposed solution sounds very sensible to me. But > for the purposes of scaling up perf testing, I don't think its essential given > the current policy will also produce the same options. > > If we were going to add a debugfs knob, I think the higher priority would be a > knob to specify the folio order. (but again, I would rather avoid if possible). I totally understand we need some way to control PMD-sized THP and LAF to tune the workload, and nobody likes debugfs knob. My concern about interface is that we have no way to disable LAF system-wise without rebuilding the kernel. In the future, should we add a new policy to /sys/kernel/mm/transparent_hugepage/enabled to be stricter than "never"? "really_never"? -- Best Regards, Huang, Ying _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@xxxxxxxxxxxxxxxxxxx http://lists.infradead.org/mailman/listinfo/linux-arm-kernel