On 26/10/2023 16:19, David Hildenbrand wrote: > [...] > >>>> Hi, >>>> >>>> I wanted to remind people in the THP cabal meeting, but that either >>>> didn't happen or zoomed decided to not let me join :) >> >> I didn't make it yesterday either - was having to juggle child care. > > I think it didn't happen, or started quite late (>20 min). > >> >>>> >>>>> >>>>> It's been a week since the mm alignment meeting discussion we had around >>>>> prerequisites and the ABI. I haven't heard any further feedback on the ABI >>>>> proposal, so I'm going to be optimistic and assume that nobody has found any >>>>> fatal flaws in it :). >>>> >>>> After saying in the call probably 10 times that people should comment >>>> here if there are reasonable alternatives worth discussing, call me >>>> "optimistic" as well; but, it's only been a week and people might still >>>> be thinking about this/ >>>> >>>> There were two things discussed in the call: >>>> >>>> * Yu brought up "lists" so we can have priorities. As briefly discussed >>>> in the call, this (a) might not be needed right now in an initial >>>> version; (b) the kernel might be able to handle that (or many cases) >>>> automatically, TBD. Adding lists now would kind-of set the semantics >>>> of that interface in stone. As you describe below, the approach >>>> discussed here could easily be extended to cover priorities, if need >>>> be. >>> >>> I want to expand on this: the argument that "if you could allocate a >>> higher order you should use it" is too simplistic. There are many >>> reasons in addition to the one above that we want to "fall back" to >>> higher orders, e.g., those higher orders are not on PCP or from the >>> local node. When we consider the sequence of orders to try, user >>> preference is just one of the parameters to the cost function. The >>> bottom line is that I think we should all agree that there needs to be >>> a cost function down the road, whatever it looks like. Otherwise I >>> don't know how we can make "auto" happen. > > I agree that there needs to be a cost function, and as pagecache showed that's > independent of initial enablement. > >> >> I don't dispute that this sounds like it could be beneficial, but I see it as >> research to happen further down the road (as you say), and we don't know what >> that research might conclude. Also, I think the scope of this is bigger than >> anonymous memory - you would also likely want to look at the policy for page >> cache folio order too, since today that's based solely on readahead. So I see it >> as an optimization that is somewhat orthogonal to small-sized THP. > > Exactly my thoughts. > > The important thing is that we should plan ahead that we still have the option > to let the admin configure if we cannot make this work automatically in the kernel. > > What we'll need, nobody knows. Maybe it's a per-size priority, maybe it's a > single global toggle. > >> >> The proposed interface does not imply any preference order - it only states >> which sizes the user wants the kernel to select from, so I think there is lots >> of freedom to change this down the track if the kernel wants to start using the >> buddy allocator's state as a signal to make its decisions. > > Yes. > > [..] > >>>> Jup, same opinion here. But again, I'm very happy to hear other >>>> alternatives and why they are better. >>> >>> I'm not against David's proposal but I want to hear a lot more about >>> "lots of flexibility for growth" before I'm fully convinced. >> >> My point was that in an abstract sense, there are properties a user may wish to >> apply individually to a size, which is catered for by having a per-size >> directory into which we can add more files if/when requirements for new per-size >> properties arise. There are also properties that may be applied globally, for >> which we have the top-level transparent_hugepage directory where properties can >> be extended or added. > > Exactly, well said. > >> >> For your case around tighter integration with the buddy allocator, I could >> imagine a per-size file allowing the user to specify if the kernel should allow >> splitting a higher order to make a THP of that size (I'm not suggesting that's a >> good idea, I'm just pointing out that this sort of thing is possible with the >> interface). And we have discussed how the global enabled prpoerty could be >> extended to support "auto" [1]. >> >> But perhaps what we really need are lots more ideas for future directions for >> small-sized THP to allow us to evaluate this interface more widely. > > David R. motivated a future size-aware setting of the defrag option. As > discussed we might want something similar to shmem_enable. What will happen with > khugepaged, nobody knows yet :) > > I could imagine exposing per-size boolean read-only properties like > "native-hw-size" (PMD, cont-pte). But these things require much more thought. FWIW, the reason I opted for the "recommend" special case in the v5 posting was because that felt like an easy thing to also add to the command line in future. Having a separate file, native-hw-size, that the user has to read then enable through another file is not very command-line friendly, if you want the hw-preferred size(s) enabled from boot. Maybe the wider observation is "how does the proposed interface translate to the kernel command line if needed in future?".