On 05/12/2023 09:57, David Hildenbrand wrote: > On 05.12.23 10:50, Ryan Roberts wrote: >> On 05/12/2023 04:21, Barry Song wrote: >>> On Mon, Dec 4, 2023 at 11:21 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: >>>> >>>> In preparation for adding support for anonymous multi-size THP, >>>> introduce new sysfs structure that will be used to control the new >>>> behaviours. A new directory is added under transparent_hugepage for each >>>> supported THP size, and contains an `enabled` file, which can be set to >>>> "inherit" (to inherit the global setting), "always", "madvise" or >>>> "never". For now, the kernel still only supports PMD-sized anonymous >>>> THP, so only 1 directory is populated. >>>> >>>> The first half of the change converts transhuge_vma_suitable() and >>>> hugepage_vma_check() so that they take a bitfield of orders for which >>>> the user wants to determine support, and the functions filter out all >>>> the orders that can't be supported, given the current sysfs >>>> configuration and the VMA dimensions. If there is only 1 order set in >>>> the input then the output can continue to be treated like a boolean; >>>> this is the case for most call sites. The resulting functions are >>>> renamed to thp_vma_suitable_orders() and thp_vma_allowable_orders() >>>> respectively. >>>> >>>> The second half of the change implements the new sysfs interface. It has >>>> been done so that each supported THP size has a `struct thpsize`, which >>>> describes the relevant metadata and is itself a kobject. This is pretty >>>> minimal for now, but should make it easy to add new per-thpsize files to >>>> the interface if needed in future (e.g. per-size defrag). Rather than >>>> keep the `enabled` state directly in the struct thpsize, I've elected to >>>> directly encode it into huge_anon_orders_[always|madvise|inherit] >>>> bitfields since this reduces the amount of work required in >>>> thp_vma_allowable_orders() which is called for every page fault. >>>> >>>> See Documentation/admin-guide/mm/transhuge.rst, as modified by this >>>> commit, for details of how the new sysfs interface works. >>>> >>>> Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx> >>> >>> Reviewed-by: Barry Song <v-songbaohua@xxxxxxxx> >> >> Thanks! >> >>> >>>> -khugepaged will be automatically started when >>>> -transparent_hugepage/enabled is set to "always" or "madvise, and it'll >>>> -be automatically shutdown if it's set to "never". >>>> +khugepaged will be automatically started when one or more hugepage >>>> +sizes are enabled (either by directly setting "always" or "madvise", >>>> +or by setting "inherit" while the top-level enabled is set to "always" >>>> +or "madvise"), and it'll be automatically shutdown when the last >>>> +hugepage size is disabled (either by directly setting "never", or by >>>> +setting "inherit" while the top-level enabled is set to "never"). >>>> >>>> Khugepaged controls >>>> ------------------- >>>> >>>> +.. note:: >>>> + khugepaged currently only searches for opportunities to collapse to >>>> + PMD-sized THP and no attempt is made to collapse to other THP >>>> + sizes. >>> >>> For small-size THP, collapse is probably a bad idea. we like a one-shot >>> try in Android especially we are using a 64KB and less large folio size. if >>> PF succeeds in getting large folios, we map large folios, otherwise we >>> give up as those memories can be quite unstably swapped-out, swapped-in >>> and madvised to be DONTNEED. >>> >>> too many compactions will increase power consumption and decrease UI >>> response. >> >> Understood; that's very useful information for the Android context. Multiple >> people have made comments about eventually needing khugepaged (or something >> similar) support in the server context though to async collapse to contpte size. >> Actually one suggestion was a user space daemon that scans and collapses with >> MADV_COLLAPSE. I suspect the key will be to ensure whatever solution we go for >> is flexible and can be enabled/disabled/configured for the different >> environments. > > There certainly is interest for 2 MiB THP on arm64 64k where the THP size would > normally be 512 MiB. In that scenario, khugepaged makes perfect sense. Indeed