On 05/12/2023 04:21, Barry Song wrote: > On Mon, Dec 4, 2023 at 11:21 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: >> >> In preparation for adding support for anonymous multi-size THP, >> introduce new sysfs structure that will be used to control the new >> behaviours. A new directory is added under transparent_hugepage for each >> supported THP size, and contains an `enabled` file, which can be set to >> "inherit" (to inherit the global setting), "always", "madvise" or >> "never". For now, the kernel still only supports PMD-sized anonymous >> THP, so only 1 directory is populated. >> >> The first half of the change converts transhuge_vma_suitable() and >> hugepage_vma_check() so that they take a bitfield of orders for which >> the user wants to determine support, and the functions filter out all >> the orders that can't be supported, given the current sysfs >> configuration and the VMA dimensions. If there is only 1 order set in >> the input then the output can continue to be treated like a boolean; >> this is the case for most call sites. The resulting functions are >> renamed to thp_vma_suitable_orders() and thp_vma_allowable_orders() >> respectively. >> >> The second half of the change implements the new sysfs interface. It has >> been done so that each supported THP size has a `struct thpsize`, which >> describes the relevant metadata and is itself a kobject. This is pretty >> minimal for now, but should make it easy to add new per-thpsize files to >> the interface if needed in future (e.g. per-size defrag). Rather than >> keep the `enabled` state directly in the struct thpsize, I've elected to >> directly encode it into huge_anon_orders_[always|madvise|inherit] >> bitfields since this reduces the amount of work required in >> thp_vma_allowable_orders() which is called for every page fault. >> >> See Documentation/admin-guide/mm/transhuge.rst, as modified by this >> commit, for details of how the new sysfs interface works. >> >> Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx> > > Reviewed-by: Barry Song <v-songbaohua@xxxxxxxx> Thanks! > >> -khugepaged will be automatically started when >> -transparent_hugepage/enabled is set to "always" or "madvise, and it'll >> -be automatically shutdown if it's set to "never". >> +khugepaged will be automatically started when one or more hugepage >> +sizes are enabled (either by directly setting "always" or "madvise", >> +or by setting "inherit" while the top-level enabled is set to "always" >> +or "madvise"), and it'll be automatically shutdown when the last >> +hugepage size is disabled (either by directly setting "never", or by >> +setting "inherit" while the top-level enabled is set to "never"). >> >> Khugepaged controls >> ------------------- >> >> +.. note:: >> + khugepaged currently only searches for opportunities to collapse to >> + PMD-sized THP and no attempt is made to collapse to other THP >> + sizes. > > For small-size THP, collapse is probably a bad idea. we like a one-shot > try in Android especially we are using a 64KB and less large folio size. if > PF succeeds in getting large folios, we map large folios, otherwise we > give up as those memories can be quite unstably swapped-out, swapped-in > and madvised to be DONTNEED. > > too many compactions will increase power consumption and decrease UI > response. Understood; that's very useful information for the Android context. Multiple people have made comments about eventually needing khugepaged (or something similar) support in the server context though to async collapse to contpte size. Actually one suggestion was a user space daemon that scans and collapses with MADV_COLLAPSE. I suspect the key will be to ensure whatever solution we go for is flexible and can be enabled/disabled/configured for the different environments. > > Thanks > Barry