On Thu, Aug 31, 2023 at 12:57 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > > On 31.08.23 03:40, Huang, Ying wrote: > > Ryan Roberts <ryan.roberts@xxxxxxx> writes: > > > >> On 15/08/2023 22:32, Huang, Ying wrote: > >>> Hi, Ryan, > >>> > >>> Ryan Roberts <ryan.roberts@xxxxxxx> writes: > >>> > >>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be > >>>> allocated in large folios of a determined order. All pages of the large > >>>> folio are pte-mapped during the same page fault, significantly reducing > >>>> the number of page faults. The number of per-page operations (e.g. ref > >>>> counting, rmap management lru list management) are also significantly > >>>> reduced since those ops now become per-folio. > >>>> > >>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig, > >>>> which defaults to disabled for now; The long term aim is for this to > >>>> defaut to enabled, but there are some risks around internal > >>>> fragmentation that need to be better understood first. > >>>> > >>>> Large anonymous folio (LAF) allocation is integrated with the existing > >>>> (PMD-order) THP and single (S) page allocation according to this policy, > >>>> where fallback (>) is performed for various reasons, such as the > >>>> proposed folio order not fitting within the bounds of the VMA, etc: > >>>> > >>>> | prctl=dis | prctl=ena | prctl=ena | prctl=ena > >>>> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always > >>>> ----------------|-----------|-------------|---------------|------------- > >>>> no hint | S | LAF>S | LAF>S | THP>LAF>S > >>>> MADV_HUGEPAGE | S | LAF>S | THP>LAF>S | THP>LAF>S > >>>> MADV_NOHUGEPAGE | S | S | S | S > >>> > >>> IMHO, we should use the following semantics as you have suggested > >>> before. > >>> > >>> | prctl=dis | prctl=ena | prctl=ena | prctl=ena > >>> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always > >>> ----------------|-----------|-------------|---------------|------------- > >>> no hint | S | S | LAF>S | THP>LAF>S > >>> MADV_HUGEPAGE | S | S | THP>LAF>S | THP>LAF>S > >>> MADV_NOHUGEPAGE | S | S | S | S > >>> > >>> Or even, > >>> > >>> | prctl=dis | prctl=ena | prctl=ena | prctl=ena > >>> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always > >>> ----------------|-----------|-------------|---------------|------------- > >>> no hint | S | S | S | THP>LAF>S > >>> MADV_HUGEPAGE | S | S | THP>LAF>S | THP>LAF>S > >>> MADV_NOHUGEPAGE | S | S | S | S > >>> > >>> From the implementation point of view, PTE mapped PMD-sized THP has > >>> almost no difference with LAF (just some small sized THP). It will be > >>> confusing to distinguish them from the interface point of view. > >>> > >>> So, IMHO, the real difference is the policy. For example, prefer > >>> PMD-sized THP, prefer small sized THP, or fully auto. The sysfs > >>> interface is used to specify system global policy. In the long term, it > >>> can be something like below, > >>> > >>> never: S # disable all THP > >>> madvise: # never by default, control via madvise() > >>> always: THP>LAF>S # prefer PMD-sized THP in fact > >>> small: LAF>S # prefer small sized THP > >>> auto: # use in-kernel heuristics for THP size > >>> > >>> But it may be not ready to add new policies now. So, before the new > >>> policies are ready, we can add a debugfs interface to override the > >>> original policy in /sys/kernel/mm/transparent_hugepage/enabled. After > >>> we have tuned enough workloads, collected enough data, we can add new > >>> policies to the sysfs interface. > >> > >> I think we can all imagine many policy options. But we don't really have much > >> evidence yet for what it best. The policy I'm currently using is intended to > >> give some flexibility for testing (use LAF without THP by setting sysfs=never, > >> use THP without LAF by compiling without LAF) without adding any new knobs at > >> all. Given that, surely we can defer these decisions until we have more data? > >> > >> In the absence of data, your proposed solution sounds very sensible to me. But > >> for the purposes of scaling up perf testing, I don't think its essential given > >> the current policy will also produce the same options. > >> > >> If we were going to add a debugfs knob, I think the higher priority would be a > >> knob to specify the folio order. (but again, I would rather avoid if possible). > > > > I totally understand we need some way to control PMD-sized THP and LAF > > to tune the workload, and nobody likes debugfs knob. > > > > My concern about interface is that we have no way to disable LAF > > system-wise without rebuilding the kernel. In the future, should we add > > a new policy to /sys/kernel/mm/transparent_hugepage/enabled to be > > stricter than "never"? "really_never"? > > Let's talk about that in a bi-weekly MM session. (I proposed it as a > topic for next week). > > As raised in another mail, we can then discuss > * how we want to call this feature (transparent large pages? there is > the concern that "THP" might confuse users. Maybe we can consider > "large" the more generic version and "huge" only PMD-size, TBD) I tend to agree. "Huge" means PMD-mappable (transparent or HugeTLB), "Large" means any order but less than PMD-mappable order, "Gigantic" means PUD mappable. This should incur the least confusion IMHO. > * how to expose it in stats towards the user (e.g., /proc/meminfo) I recalled I suggested new statistics for each order, but was NAK'ed. > * which minimal toggles we want > > I think there *really* has to be a way to disable it for a running > system, otherwise no distro will dare pulling it in, even after we > figured out the other stuff. TBH I really don't like to tie large folio to THP toggles. THP (PMD-mappable) is just a special case of LAF. The large folio should be tried whenever it is possible ideally. But I do agree we may not be able to achieve the ideal case at the time being, and also understand the concern about regression in early adoption, so a knob that can disable large folio may be needed for now. But it should be just a simple binary knob (on/off), and should not be a part of kernel ABI (temporary and debugging only) IMHO. One more thing we may discuss is whether huge page madvise APIs should take effect for large folio or not. > > Note that for the pagecache, large folios can be disabled and > distributions are actively making use of that. > > -- > Cheers, > > David / dhildenb >