Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory

Ryan Roberts <ryan.roberts@xxxxxxx> · Fri, 27 Oct 2023 13:47:42 +0100

On 27/10/2023 13:29, David Hildenbrand wrote:
> On 27.10.23 14:27, Ryan Roberts wrote:
>> On 26/10/2023 16:19, David Hildenbrand wrote:
>>> [...]
>>>
>>>>>> Hi,
>>>>>>
>>>>>> I wanted to remind people in the THP cabal meeting, but that either
>>>>>> didn't happen or zoomed decided to not let me join :)
>>>>
>>>> I didn't make it yesterday either - was having to juggle child care.
>>>
>>> I think it didn't happen, or started quite late (>20 min).
>>>
>>>>
>>>>>>
>>>>>>>
>>>>>>> It's been a week since the mm alignment meeting discussion we had around
>>>>>>> prerequisites and the ABI. I haven't heard any further feedback on the ABI
>>>>>>> proposal, so I'm going to be optimistic and assume that nobody has found any
>>>>>>> fatal flaws in it :).
>>>>>>
>>>>>> After saying in the call probably 10 times that people should comment
>>>>>> here if there are reasonable alternatives worth discussing, call me
>>>>>> "optimistic" as well; but, it's only been a week and people might still
>>>>>> be thinking about this/
>>>>>>
>>>>>> There were two things discussed in the call:
>>>>>>
>>>>>> * Yu brought up "lists" so we can have priorities. As briefly discussed
>>>>>>      in the  call, this (a) might not be needed right now in an initial
>>>>>>      version;  (b) the kernel might be able to handle that (or many cases)
>>>>>>      automatically, TBD. Adding lists now would kind-of set the semantics
>>>>>>      of that interface in stone. As you describe below, the approach
>>>>>>      discussed here could easily be extended to cover priorities, if need
>>>>>>      be.
>>>>>
>>>>> I want to expand on this: the argument that "if you could allocate a
>>>>> higher order you should use it" is too simplistic. There are many
>>>>> reasons in addition to the one above that we want to "fall back" to
>>>>> higher orders, e.g., those higher orders are not on PCP or from the
>>>>> local node. When we consider the sequence of orders to try, user
>>>>> preference is just one of the parameters to the cost function. The
>>>>> bottom line is that I think we should all agree that there needs to be
>>>>> a cost function down the road, whatever it looks like. Otherwise I
>>>>> don't know how we can make "auto" happen.
>>>
>>> I agree that there needs to be a cost function, and as pagecache showed that's
>>> independent of initial enablement.
>>>
>>>>
>>>> I don't dispute that this sounds like it could be beneficial, but I see it as
>>>> research to happen further down the road (as you say), and we don't know what
>>>> that research might conclude. Also, I think the scope of this is bigger than
>>>> anonymous memory - you would also likely want to look at the policy for page
>>>> cache folio order too, since today that's based solely on readahead. So I
>>>> see it
>>>> as an optimization that is somewhat orthogonal to small-sized THP.
>>>
>>> Exactly my thoughts.
>>>
>>> The important thing is that we should plan ahead that we still have the option
>>> to let the admin configure if we cannot make this work automatically in the
>>> kernel.
>>>
>>> What we'll need, nobody knows. Maybe it's a per-size priority, maybe it's a
>>> single global toggle.
>>>
>>>>
>>>> The proposed interface does not imply any preference order - it only states
>>>> which sizes the user wants the kernel to select from, so I think there is lots
>>>> of freedom to change this down the track if the kernel wants to start using the
>>>> buddy allocator's state as a signal to make its decisions.
>>>
>>> Yes.
>>>
>>> [..]
>>>
>>>>>> Jup, same opinion here. But again, I'm very happy to hear other
>>>>>> alternatives and why they are better.
>>>>>
>>>>> I'm not against David's proposal but I want to hear a lot more about
>>>>> "lots of flexibility for growth" before I'm fully convinced.
>>>>
>>>> My point was that in an abstract sense, there are properties a user may wish to
>>>> apply individually to a size, which is catered for by having a per-size
>>>> directory into which we can add more files if/when requirements for new
>>>> per-size
>>>> properties arise. There are also properties that may be applied globally, for
>>>> which we have the top-level transparent_hugepage directory where properties can
>>>> be extended or added.
>>>
>>> Exactly, well said.
>>>
>>>>
>>>> For your case around tighter integration with the buddy allocator, I could
>>>> imagine a per-size file allowing the user to specify if the kernel should allow
>>>> splitting a higher order to make a THP of that size (I'm not suggesting
>>>> that's a
>>>> good idea, I'm just pointing out that this sort of thing is possible with the
>>>> interface). And we have discussed how the global enabled prpoerty could be
>>>> extended to support "auto" [1].
>>>>
>>>> But perhaps what we really need are lots more ideas for future directions for
>>>> small-sized THP to allow us to evaluate this interface more widely.
>>>
>>> David R. motivated a future size-aware setting of the defrag option. As
>>> discussed we might want something similar to shmem_enable. What will happen with
>>> khugepaged, nobody knows yet :)
>>>
>>> I could imagine exposing per-size boolean read-only properties like
>>> "native-hw-size" (PMD, cont-pte). But these things require much more thought.
>>
>> FWIW, the reason I opted for the "recommend" special case in the v5 posting was
>> because that felt like an easy thing to also add to the command line in future.
>> Having a separate file, native-hw-size, that the user has to read then enable
>> through another file is not very command-line friendly, if you want the
>> hw-preferred size(s) enabled from boot.
> 
> Jup. I strongly suspect distributions will just have their setup script to
> handle such things, though.

OK fair enough.

> 
>>
>> Maybe the wider observation is "how does the proposed interface translate to the
>> kernel command line if needed in future?".
> 
> I guess in the distant future, "auto" is what we want.

Looks like hugetlb solves this with a magic tuple, where hugepagesz sets the
"directory" for the following properties. So if we did need to support per-size
properties on the command line, we have president to follow:

  hugepagesz=X hugepages=Y

>