Hi All, Thanks for participating in the discussion yesterday - it finally feels like we are converging on the MVP feature set. Below are my notes from the call and a modified proposal for controls and stats. It would be great if we can continue to review and refine over email. I'm also planning to post an implementation within the next couple of weeks, which I hope will also accelerate convergence. Roadmap ------- Stage 1: (MVP) Propose to add minimal runtime controls and stats (as outlined below). There were no disagreements on the call about this feature set being either too little or too big for the initial submission. Stage 2: Focus on decreasing memory wastage. Plan A will attempt to do this automatically within the kernel (I highlighted some ideas in the slide pack which we didn't get time to cover). Plan B is to add more fine-grained controls to to fine tune things at memcg/process/vma level (TBD). I'm not covering this stage in this email. Naming ------ We may add large folio support to shmem in future, which may need some separate controls (TBD). As a result, consensus was to have generic name "large folio", which is specialized for anon memory. Then in future it could also be specialized for shmem. I'm going to reflect this in the kernel naming by changing LARGE_ANON_FOLIO to ANON_LARGE_FOLIO, that way it makes "LARGE_FOLIO" grepable. I'm also reflecting this in the sysfs controls. I'll create a directory '/sys/kernel/mm/large_folio' as the root. Within that there are 2 main options: - Put shared controls directly in this directory. Add a sub-directory 'anon' for anon-specific controls (and in future 'shmem'...) - Put all controls in the root directory and prefix the filename for anon-specific controls with 'anon' (e.g. anon_enabled). Given I don't think there will be many anon-specific controls (1 for now), and THP already uses the latter scheme, I'm proposing to go with the latter. Controls -------- Modified proposal, after discussion yesterday: - boot_param: anon_large_folio - =always|never|madvise - sets boot-up default for large_folio/anon_enabled - sysfs: /sys/kernel/mm/large_folio/anon_enabled - =always|never|madvise - sysfs: /sys/kernel/mm/large_folio/defrag - =always|defer|defer+madvise|madvise|never - Anticipate would be shared between anon and shmem if shmem added - this is already true for THP - Kirill suggested to drop and hardcode to "never" (GFP_TRANSHUGE_LIGHT) - Yu previously commented GFP_TRANSHUGE_LIGHT isn't always ideal [1] - So current series is hooking THP's defrag setting - Given we want to separate THP and LAF, I'm proposing to keep it - debugfs: /sys/kernel/debug/mm/large_folio/anon_orders - Comma-separated, descending list of orders to try - Default: arch_wants_pte_order(),PAGE_ALLOC_COSTLY_ORDER - 0 always implicitly appended to end - Max allowed is PMD_ORDER-1 - intended for developers to experiment - debugfs means we can change/remove it or promote it to sysfs later - MADV_NOHUGEPAGE is honored; LAF disabled for these VMAs - Required for correctness of existing use cases (live migration post copy) - New MADV_LARGEFOLIO madvise opcode - Like MADV_HUGEPAGE but for large folio Optional: DavidR suggested adding ability to set a VMA-specific LAF order, using process_madvise(): - Optionally accept LAF order through flags param of process_madvise(MADV_LARGEFOLIO) - When no LAF order passed, (or called with madvise()) use global LAF order Personally, I would prefer to avoid vma-specific laf order for an initial submission and instead defer the addition until clear need is identified. Thoughts? Stats ----- meminfo:AnonHugePages, smaps:AnonHugePages and memory.stat:anon_thp will continue to account THP only. I plan to add meminfo:AnonLargeFolio, smaps:AnonLargeFolio and memory.stat:anon_large_folio to account LAFs. Do I need to add counters to vmstat also? (e.g. large_folio_fault_alloc, large_folio_fault_fallback, etc) - would need to think about which counters and what they mean if so. Thanks, Ryan [1] https://lore.kernel.org/linux-mm/CAOUHufYWtsAU4PvKpVhzJUeQb9cd+BifY9KzgceBXHp2F2dDRg@xxxxxxxxxxxxxx/