On 23/09/2023 01:33, John Hubbard wrote: > On 9/22/23 08:48, Ryan Roberts wrote: > ... >> I never had any feedback on the below; I'm not sure if that means everyone is >> happy or that nobody read it?? > > One can never really know: zero or more people read it, and of those, no > one hated it enough to send out a quick NAK. So that's a *possible*, > lukewarm endorsement of sorts. Success! :) You really know how to fill a guy with confidence! ;-) > > ... > >> BUT I've had yet another idea on the controls front, which would enable exposing >> this to user space as an extension to transparent_hugepage, while continuing to >> support THP as is and also be able to control THP and ALF (anon large folio) > > The new ALF / ANON_LARGE_FOLIO naming looks good to me. The grep aspect > is a nice touch. Well if we go the route of the newest proposal, then I guess the naming is less important, because it all attaches to transparent_hugepage. > > ... > >> Add 2 controls to sysfs: >> >> /sys/kernel/mm/transparent_hugepage/anon_orders >> - bitfield where set bits are orders that will be tried during allocation >> - defaults to 1<<PMD_ORDER, which gives current THP behaviour with no ALF >> - For now, 1<<PMD_ORDER is highest settable bit, but easy to expand in future >> - To enable ALF, set the appropriate lower bits >> - To disable THP, clear 1<<PMD_ORDER >> - (In future we could add an "auto" option too) >> >> /sys/kernel/mm/transparent_hugepage/anon_always_mask >> - orders in (anon_orders & anon_always_mask) are not subject to madvise >> - so when enabled=madvise, still try (anon_orders & anon_always_mask) orders >> as if enabled=always >> - defaults to 0 (all subject to madvise) >> > > I *think* I like this a lot, On the weight of this lukewarm endorsement, I'm going to code it up and aim to post something for dicussion end of this week. ;-) > although I have some clarifying question > below. It seems to address the key things that have been complicating > the discussions: the API is now looking more flexible, and yet still > easy to understand and reason about. Nice. > > A couple of questions about how this works: > >> >> The defaults for those controls give you "legacy THP". But you can modify the >> controls to generate policies like this: >> >> > > For these tables, a small key or legend would help. I've forgotten already > what "S" means, and am also vague about exactly what "THP>ALF>S" behavior > means, too. THP: transparent hugepage allocation; specifically PMD sized/aligned/mapped. ALF: anonymous large folio allocation; specifically some order between [PMD_ORDER-1, 1]. Always PTE-mapped. S: single page allocation; order-0, always PTE-mapped. I've found these discrete logical buckets useful for thinking about the problem, although the implementation doesn't always treat them completely separately (S is just a final fallback order in ALF's list of orders to try) and the new proposal exposes both THP and ALF through a unified THP interface. The '>' indicates 'fallback'. Fallback happens for a few different reasons; VMA is too small to contain the proposed folio order, or some PTEs that would be covered by the new folio are already populated, etc. ALF usually isn't just a single order either - it has a list of orders that it will try. Possibly all a bit confusing, but this is the nomenclature I've been using in the context of all the discusions so far and wanted to try to keep things comparable. > >> THP only - existing behaviour (default): >> ---------------------------------------- >> >> anon_orders = 1<<PMD_ORDER >> anon_always_mask = 0 >> >> thp prctl: | dis | ena | ena | ena > > All I see in the prctl(2) man page is PR_SET_THP_DISABLE, I don't > see any _ENABLE. What does the above refer to? dis: PR_SET_THP_DISABLE with arg2=1 (thp disabled via prctl) ena: PR_SET_THP_DISABLE with arg2=0 (thp not disabled via prctl) I was trying to illustrate that ALF is now also affected by this prctl. With the previous proposal it was independent of THP and therefore independent of this prctl. Of course it would still be _possible_ to ignore this control for the ALF orders, but I think that risks being very confusing for users. > > >> thp sysfs: | X | never | madvise | always >> ----------------------|-----------|-----------|-----------|------------- >> no hint | S | S | S | THP>S >> MADV_HUGEPAGE | S | S | THP>S | THP>S >> MADV_NOHUGEPAGE | S | S | S | S >> >> > ... >> >> It does have the disadvantage that ALF is tied to MADV_HUGEPAGE, whereas the > > Right, that is a little awkward. But maybe less so now, with this new proposal, > which leaves THP a little closer to ALF. Indeed, this approach makes it clearer/easier for users to understand, because conceptually we are just introducing a wider set of folio sizes that THP can use and all the existing THP controls continue to mean what they always meant. The only risk I see is if there are workloads that want to use both (PMD) THP and ALF, but in different VMAs, and they absolutely do not want the possibillity of ALF in the (PMD) THP area if THP fails, and instead always fallback to Single allocations for that VMA. But that sounds very niche to me. And would be better solved by the additional (future) introduction of a set of allowed orders that can be attached to a specific VMA. There are a couple of other wrinkles that I didn't highlight in my first mail: - khugepaged will continue to work only on PMD-sized THP. It will ignore the new ALF orders. This was always the plan, but if exposing the ALF functionality through THP interface to user space, does that make it confusing? I don't think its a big issue personally. And we can always enhance khugepaged to work on <PMD_ORDER folios later if we find a compelling reason. - We will want to name new counters following THP naming, not large folio. I propose that the existing AnonHugePages type counters will count ALL THP (i.e. PMD order and ALF orders), and additionally add 2 new counters for PMD-mapped and PTE-mapped, which should sum to the value in the original counter. Hopefully that makes things clear while retaining back compat. > > > thanks,