Re: ANON_LARGE_FOLIOS meeting follow-up & refined proposal

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 23/09/2023 01:33, John Hubbard wrote:
> On 9/22/23 08:48, Ryan Roberts wrote:
> ...
>> I never had any feedback on the below; I'm not sure if that means everyone is
>> happy or that nobody read it??
> 
> One can never really know: zero or more people read it, and of those, no
> one hated it enough to send out a quick NAK. So that's a *possible*,
> lukewarm endorsement of sorts. Success! :)

You really know how to fill a guy with confidence! ;-)

> 
> ...
> 
>> BUT I've had yet another idea on the controls front, which would enable exposing
>> this to user space as an extension to transparent_hugepage, while continuing to
>> support THP as is and also be able to control THP and ALF (anon large folio)
> 
> The new ALF / ANON_LARGE_FOLIO naming looks good to me. The grep aspect
> is a nice touch.

Well if we go the route of the newest proposal, then I guess the naming is less
important, because it all attaches to transparent_hugepage.

> 
> ...
> 
>> Add 2 controls to sysfs:
>>
>> /sys/kernel/mm/transparent_hugepage/anon_orders
>>    - bitfield where set bits are orders that will be tried during allocation
>>    - defaults to 1<<PMD_ORDER, which gives current THP behaviour with no ALF
>>    - For now, 1<<PMD_ORDER is highest settable bit, but easy to expand in future
>>    - To enable ALF, set the appropriate lower bits
>>    - To disable THP, clear 1<<PMD_ORDER
>>    - (In future we could add an "auto" option too)
>>
>> /sys/kernel/mm/transparent_hugepage/anon_always_mask
>>    - orders in (anon_orders & anon_always_mask) are not subject to madvise
>>    - so when enabled=madvise, still try (anon_orders & anon_always_mask) orders
>>      as if enabled=always
>>    - defaults to 0 (all subject to madvise)
>>
> 
> I *think* I like this a lot, 

On the weight of this lukewarm endorsement, I'm going to code it up and aim to
post something for dicussion end of this week. ;-)

> although I have some clarifying question
> below. It seems to address the key things that have been complicating
> the discussions: the API is now looking more flexible, and yet still
> easy to understand and reason about. Nice.
> 
> A couple of questions about how this works:
> 
>>
>> The defaults for those controls give you "legacy THP". But you can modify the
>> controls to generate policies like this:
>>
>>
> 
> For these tables, a small key or legend would help. I've forgotten already
> what "S" means, and am also vague about exactly what "THP>ALF>S" behavior
> means, too.

THP:
    transparent hugepage allocation; specifically PMD sized/aligned/mapped.

ALF:
    anonymous large folio allocation; specifically some order between
    [PMD_ORDER-1, 1]. Always PTE-mapped.
S:
    single page allocation; order-0, always PTE-mapped.

I've found these discrete logical buckets useful for thinking about the problem,
although the implementation doesn't always treat them completely separately (S
is just a final fallback order in ALF's list of orders to try) and the new
proposal exposes both THP and ALF through a unified THP interface.

The '>' indicates 'fallback'. Fallback happens for a few different reasons; VMA
is too small to contain the proposed folio order, or some PTEs that would be
covered by the new folio are already populated, etc. ALF usually isn't just a
single order either - it has a list of orders that it will try.

Possibly all a bit confusing, but this is the nomenclature I've been using in
the context of all the discusions so far and wanted to try to keep things
comparable.


> 
>> THP only - existing behaviour (default):
>> ----------------------------------------
>>
>> anon_orders = 1<<PMD_ORDER
>> anon_always_mask = 0
>>
>> thp prctl:            | dis       | ena       | ena       | ena
> 
> All I see in the prctl(2) man page is PR_SET_THP_DISABLE, I don't
> see any _ENABLE. What does the above refer to?

dis: PR_SET_THP_DISABLE with arg2=1 (thp disabled via prctl)
ena: PR_SET_THP_DISABLE with arg2=0 (thp not disabled via prctl)

I was trying to illustrate that ALF is now also affected by this prctl. With the
previous proposal it was independent of THP and therefore independent of this
prctl. Of course it would still be _possible_ to ignore this control for the ALF
orders, but I think that risks being very confusing for users.

> 
> 
>> thp sysfs:            | X         | never     | madvise   | always
>> ----------------------|-----------|-----------|-----------|-------------
>> no hint               | S         | S         | S         | THP>S
>> MADV_HUGEPAGE         | S         | S         | THP>S     | THP>S
>> MADV_NOHUGEPAGE       | S         | S         | S         | S
>>
>>
> ...
>>
>> It does have the disadvantage that ALF is tied to MADV_HUGEPAGE, whereas the
> 
> Right, that is a little awkward. But maybe less so now, with this new proposal,
> which leaves THP a little closer to ALF.

Indeed, this approach makes it clearer/easier for users to understand, because
conceptually we are just introducing a wider set of folio sizes that THP can use
and all the existing THP controls continue to mean what they always meant.

The only risk I see is if there are workloads that want to use both (PMD) THP
and ALF, but in different VMAs, and they absolutely do not want the possibillity
of ALF in the (PMD) THP area if THP fails, and instead always fallback to Single
allocations for that VMA. But that sounds very niche to me. And would be better
solved by the additional (future) introduction of a set of allowed orders that
can be attached to a specific VMA.


There are a couple of other wrinkles that I didn't highlight in my first mail:

- khugepaged will continue to work only on PMD-sized THP. It will ignore the new
  ALF orders. This was always the plan, but if exposing the ALF functionality
  through THP interface to user space, does that make it confusing? I don't
  think its a big issue personally. And we can always enhance khugepaged to work
  on <PMD_ORDER folios later if we find a compelling reason.

- We will want to name new counters following THP naming, not large folio. I
  propose that the existing AnonHugePages type counters will count ALL THP (i.e.
  PMD order and ALF orders), and additionally add 2 new counters for PMD-mapped
  and PTE-mapped, which should sum to the value in the original counter.
  Hopefully that makes things clear while retaining back compat.


> 
> 
> thanks,





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux