On 27/10/2023 13:29, David Hildenbrand wrote: > On 27.10.23 14:27, Ryan Roberts wrote: >> On 26/10/2023 16:19, David Hildenbrand wrote: >>> [...] >>> >>>>>> Hi, >>>>>> >>>>>> I wanted to remind people in the THP cabal meeting, but that either >>>>>> didn't happen or zoomed decided to not let me join :) >>>> >>>> I didn't make it yesterday either - was having to juggle child care. >>> >>> I think it didn't happen, or started quite late (>20 min). >>> >>>> >>>>>> >>>>>>> >>>>>>> It's been a week since the mm alignment meeting discussion we had around >>>>>>> prerequisites and the ABI. I haven't heard any further feedback on the ABI >>>>>>> proposal, so I'm going to be optimistic and assume that nobody has found any >>>>>>> fatal flaws in it :). >>>>>> >>>>>> After saying in the call probably 10 times that people should comment >>>>>> here if there are reasonable alternatives worth discussing, call me >>>>>> "optimistic" as well; but, it's only been a week and people might still >>>>>> be thinking about this/ >>>>>> >>>>>> There were two things discussed in the call: >>>>>> >>>>>> * Yu brought up "lists" so we can have priorities. As briefly discussed >>>>>> in the call, this (a) might not be needed right now in an initial >>>>>> version; (b) the kernel might be able to handle that (or many cases) >>>>>> automatically, TBD. Adding lists now would kind-of set the semantics >>>>>> of that interface in stone. As you describe below, the approach >>>>>> discussed here could easily be extended to cover priorities, if need >>>>>> be. >>>>> >>>>> I want to expand on this: the argument that "if you could allocate a >>>>> higher order you should use it" is too simplistic. There are many >>>>> reasons in addition to the one above that we want to "fall back" to >>>>> higher orders, e.g., those higher orders are not on PCP or from the >>>>> local node. When we consider the sequence of orders to try, user >>>>> preference is just one of the parameters to the cost function. The >>>>> bottom line is that I think we should all agree that there needs to be >>>>> a cost function down the road, whatever it looks like. Otherwise I >>>>> don't know how we can make "auto" happen. >>> >>> I agree that there needs to be a cost function, and as pagecache showed that's >>> independent of initial enablement. >>> >>>> >>>> I don't dispute that this sounds like it could be beneficial, but I see it as >>>> research to happen further down the road (as you say), and we don't know what >>>> that research might conclude. Also, I think the scope of this is bigger than >>>> anonymous memory - you would also likely want to look at the policy for page >>>> cache folio order too, since today that's based solely on readahead. So I >>>> see it >>>> as an optimization that is somewhat orthogonal to small-sized THP. >>> >>> Exactly my thoughts. >>> >>> The important thing is that we should plan ahead that we still have the option >>> to let the admin configure if we cannot make this work automatically in the >>> kernel. >>> >>> What we'll need, nobody knows. Maybe it's a per-size priority, maybe it's a >>> single global toggle. >>> >>>> >>>> The proposed interface does not imply any preference order - it only states >>>> which sizes the user wants the kernel to select from, so I think there is lots >>>> of freedom to change this down the track if the kernel wants to start using the >>>> buddy allocator's state as a signal to make its decisions. >>> >>> Yes. >>> >>> [..] >>> >>>>>> Jup, same opinion here. But again, I'm very happy to hear other >>>>>> alternatives and why they are better. >>>>> >>>>> I'm not against David's proposal but I want to hear a lot more about >>>>> "lots of flexibility for growth" before I'm fully convinced. >>>> >>>> My point was that in an abstract sense, there are properties a user may wish to >>>> apply individually to a size, which is catered for by having a per-size >>>> directory into which we can add more files if/when requirements for new >>>> per-size >>>> properties arise. There are also properties that may be applied globally, for >>>> which we have the top-level transparent_hugepage directory where properties can >>>> be extended or added. >>> >>> Exactly, well said. >>> >>>> >>>> For your case around tighter integration with the buddy allocator, I could >>>> imagine a per-size file allowing the user to specify if the kernel should allow >>>> splitting a higher order to make a THP of that size (I'm not suggesting >>>> that's a >>>> good idea, I'm just pointing out that this sort of thing is possible with the >>>> interface). And we have discussed how the global enabled prpoerty could be >>>> extended to support "auto" [1]. >>>> >>>> But perhaps what we really need are lots more ideas for future directions for >>>> small-sized THP to allow us to evaluate this interface more widely. >>> >>> David R. motivated a future size-aware setting of the defrag option. As >>> discussed we might want something similar to shmem_enable. What will happen with >>> khugepaged, nobody knows yet :) >>> >>> I could imagine exposing per-size boolean read-only properties like >>> "native-hw-size" (PMD, cont-pte). But these things require much more thought. >> >> FWIW, the reason I opted for the "recommend" special case in the v5 posting was >> because that felt like an easy thing to also add to the command line in future. >> Having a separate file, native-hw-size, that the user has to read then enable >> through another file is not very command-line friendly, if you want the >> hw-preferred size(s) enabled from boot. > > Jup. I strongly suspect distributions will just have their setup script to > handle such things, though. OK fair enough. > >> >> Maybe the wider observation is "how does the proposed interface translate to the >> kernel command line if needed in future?". > > I guess in the distant future, "auto" is what we want. Looks like hugetlb solves this with a magic tuple, where hugepagesz sets the "directory" for the following properties. So if we did need to support per-size properties on the command line, we have president to follow: hugepagesz=X hugepages=Y >