On 8/31/2023 3:57 PM, David Hildenbrand wrote: > On 31.08.23 03:40, Huang, Ying wrote: >> Ryan Roberts <ryan.roberts@xxxxxxx> writes: >> >>> On 15/08/2023 22:32, Huang, Ying wrote: >>>> Hi, Ryan, >>>> >>>> Ryan Roberts <ryan.roberts@xxxxxxx> writes: >>>> >>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be >>>>> allocated in large folios of a determined order. All pages of the large >>>>> folio are pte-mapped during the same page fault, significantly reducing >>>>> the number of page faults. The number of per-page operations (e.g. ref >>>>> counting, rmap management lru list management) are also significantly >>>>> reduced since those ops now become per-folio. >>>>> >>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig, >>>>> which defaults to disabled for now; The long term aim is for this to >>>>> defaut to enabled, but there are some risks around internal >>>>> fragmentation that need to be better understood first. >>>>> >>>>> Large anonymous folio (LAF) allocation is integrated with the existing >>>>> (PMD-order) THP and single (S) page allocation according to this policy, >>>>> where fallback (>) is performed for various reasons, such as the >>>>> proposed folio order not fitting within the bounds of the VMA, etc: >>>>> >>>>> | prctl=dis | prctl=ena | prctl=ena | prctl=ena >>>>> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always >>>>> ----------------|-----------|-------------|---------------|------------- >>>>> no hint | S | LAF>S | LAF>S | THP>LAF>S >>>>> MADV_HUGEPAGE | S | LAF>S | THP>LAF>S | THP>LAF>S >>>>> MADV_NOHUGEPAGE | S | S | S | S >>>> >>>> IMHO, we should use the following semantics as you have suggested >>>> before. >>>> >>>> | prctl=dis | prctl=ena | prctl=ena | prctl=ena >>>> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always >>>> ----------------|-----------|-------------|---------------|------------- >>>> no hint | S | S | LAF>S | THP>LAF>S >>>> MADV_HUGEPAGE | S | S | THP>LAF>S | THP>LAF>S >>>> MADV_NOHUGEPAGE | S | S | S | S >>>> >>>> Or even, >>>> >>>> | prctl=dis | prctl=ena | prctl=ena | prctl=ena >>>> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always >>>> ----------------|-----------|-------------|---------------|------------- >>>> no hint | S | S | S | THP>LAF>S >>>> MADV_HUGEPAGE | S | S | THP>LAF>S | THP>LAF>S >>>> MADV_NOHUGEPAGE | S | S | S | S >>>> >>>> From the implementation point of view, PTE mapped PMD-sized THP has >>>> almost no difference with LAF (just some small sized THP). It will be >>>> confusing to distinguish them from the interface point of view. >>>> >>>> So, IMHO, the real difference is the policy. For example, prefer >>>> PMD-sized THP, prefer small sized THP, or fully auto. The sysfs >>>> interface is used to specify system global policy. In the long term, it >>>> can be something like below, >>>> >>>> never: S # disable all THP >>>> madvise: # never by default, control via madvise() >>>> always: THP>LAF>S # prefer PMD-sized THP in fact >>>> small: LAF>S # prefer small sized THP >>>> auto: # use in-kernel heuristics for THP size >>>> >>>> But it may be not ready to add new policies now. So, before the new >>>> policies are ready, we can add a debugfs interface to override the >>>> original policy in /sys/kernel/mm/transparent_hugepage/enabled. After >>>> we have tuned enough workloads, collected enough data, we can add new >>>> policies to the sysfs interface. >>> >>> I think we can all imagine many policy options. But we don't really have much >>> evidence yet for what it best. The policy I'm currently using is intended to >>> give some flexibility for testing (use LAF without THP by setting sysfs=never, >>> use THP without LAF by compiling without LAF) without adding any new knobs at >>> all. Given that, surely we can defer these decisions until we have more data? >>> >>> In the absence of data, your proposed solution sounds very sensible to me. But >>> for the purposes of scaling up perf testing, I don't think its essential given >>> the current policy will also produce the same options. >>> >>> If we were going to add a debugfs knob, I think the higher priority would be a >>> knob to specify the folio order. (but again, I would rather avoid if possible). >> >> I totally understand we need some way to control PMD-sized THP and LAF >> to tune the workload, and nobody likes debugfs knob. >> >> My concern about interface is that we have no way to disable LAF >> system-wise without rebuilding the kernel. In the future, should we add >> a new policy to /sys/kernel/mm/transparent_hugepage/enabled to be >> stricter than "never"? "really_never"? > > Let's talk about that in a bi-weekly MM session. (I proposed it as a topic for next week). The time slot of the meeting is not friendly to our timezone. Like it's 1 or 2 AM. Yes. I know it's very hard to find a good time slot for US, EU and Asia. :(. So maybe we still need to discuss it through mail? Regards Yin, Fengwei > > As raised in another mail, we can then discuss > * how we want to call this feature (transparent large pages? there is > the concern that "THP" might confuse users. Maybe we can consider > "large" the more generic version and "huge" only PMD-size, TBD) > * how to expose it in stats towards the user (e.g., /proc/meminfo) > * which minimal toggles we want > > I think there *really* has to be a way to disable it for a running system, otherwise no distro will dare pulling it in, even after we figured out the other stuff. > > Note that for the pagecache, large folios can be disabled and distributions are actively making use of that. >