Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory

Ryan Roberts <ryan.roberts@xxxxxxx> · Fri, 27 Oct 2023 13:27:27 +0100

On 26/10/2023 16:19, David Hildenbrand wrote:
> [...]
> 
>>>> Hi,
>>>>
>>>> I wanted to remind people in the THP cabal meeting, but that either
>>>> didn't happen or zoomed decided to not let me join :)
>>
>> I didn't make it yesterday either - was having to juggle child care.
> 
> I think it didn't happen, or started quite late (>20 min).
> 
>>
>>>>
>>>>>
>>>>> It's been a week since the mm alignment meeting discussion we had around
>>>>> prerequisites and the ABI. I haven't heard any further feedback on the ABI
>>>>> proposal, so I'm going to be optimistic and assume that nobody has found any
>>>>> fatal flaws in it :).
>>>>
>>>> After saying in the call probably 10 times that people should comment
>>>> here if there are reasonable alternatives worth discussing, call me
>>>> "optimistic" as well; but, it's only been a week and people might still
>>>> be thinking about this/
>>>>
>>>> There were two things discussed in the call:
>>>>
>>>> * Yu brought up "lists" so we can have priorities. As briefly discussed
>>>>     in the  call, this (a) might not be needed right now in an initial
>>>>     version;  (b) the kernel might be able to handle that (or many cases)
>>>>     automatically, TBD. Adding lists now would kind-of set the semantics
>>>>     of that interface in stone. As you describe below, the approach
>>>>     discussed here could easily be extended to cover priorities, if need
>>>>     be.
>>>
>>> I want to expand on this: the argument that "if you could allocate a
>>> higher order you should use it" is too simplistic. There are many
>>> reasons in addition to the one above that we want to "fall back" to
>>> higher orders, e.g., those higher orders are not on PCP or from the
>>> local node. When we consider the sequence of orders to try, user
>>> preference is just one of the parameters to the cost function. The
>>> bottom line is that I think we should all agree that there needs to be
>>> a cost function down the road, whatever it looks like. Otherwise I
>>> don't know how we can make "auto" happen.
> 
> I agree that there needs to be a cost function, and as pagecache showed that's
> independent of initial enablement.
> 
>>
>> I don't dispute that this sounds like it could be beneficial, but I see it as
>> research to happen further down the road (as you say), and we don't know what
>> that research might conclude. Also, I think the scope of this is bigger than
>> anonymous memory - you would also likely want to look at the policy for page
>> cache folio order too, since today that's based solely on readahead. So I see it
>> as an optimization that is somewhat orthogonal to small-sized THP.
> 
> Exactly my thoughts.
> 
> The important thing is that we should plan ahead that we still have the option
> to let the admin configure if we cannot make this work automatically in the kernel.
> 
> What we'll need, nobody knows. Maybe it's a per-size priority, maybe it's a
> single global toggle.
> 
>>
>> The proposed interface does not imply any preference order - it only states
>> which sizes the user wants the kernel to select from, so I think there is lots
>> of freedom to change this down the track if the kernel wants to start using the
>> buddy allocator's state as a signal to make its decisions.
> 
> Yes.
> 
> [..]
> 
>>>> Jup, same opinion here. But again, I'm very happy to hear other
>>>> alternatives and why they are better.
>>>
>>> I'm not against David's proposal but I want to hear a lot more about
>>> "lots of flexibility for growth" before I'm fully convinced.
>>
>> My point was that in an abstract sense, there are properties a user may wish to
>> apply individually to a size, which is catered for by having a per-size
>> directory into which we can add more files if/when requirements for new per-size
>> properties arise. There are also properties that may be applied globally, for
>> which we have the top-level transparent_hugepage directory where properties can
>> be extended or added.
> 
> Exactly, well said.
> 
>>
>> For your case around tighter integration with the buddy allocator, I could
>> imagine a per-size file allowing the user to specify if the kernel should allow
>> splitting a higher order to make a THP of that size (I'm not suggesting that's a
>> good idea, I'm just pointing out that this sort of thing is possible with the
>> interface). And we have discussed how the global enabled prpoerty could be
>> extended to support "auto" [1].
>>
>> But perhaps what we really need are lots more ideas for future directions for
>> small-sized THP to allow us to evaluate this interface more widely.
> 
> David R. motivated a future size-aware setting of the defrag option. As
> discussed we might want something similar to shmem_enable. What will happen with
> khugepaged, nobody knows yet :)
> 
> I could imagine exposing per-size boolean read-only properties like
> "native-hw-size" (PMD, cont-pte). But these things require much more thought.

FWIW, the reason I opted for the "recommend" special case in the v5 posting was
because that felt like an easy thing to also add to the command line in future.
Having a separate file, native-hw-size, that the user has to read then enable
through another file is not very command-line friendly, if you want the
hw-preferred size(s) enabled from boot.

Maybe the wider observation is "how does the proposed interface translate to the
kernel command line if needed in future?".