ANON_LARGE_FOLIOS meeting follow-up & refined proposal

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi All,

Thanks for participating in the discussion yesterday - it finally feels like we
are converging on the MVP feature set. Below are my notes from the call and a
modified proposal for controls and stats. It would be great if we can continue
to review and refine over email. I'm also planning to post an implementation
within the next couple of weeks, which I hope will also accelerate convergence.


Roadmap
-------

Stage 1: (MVP) Propose to add minimal runtime controls and stats (as outlined
below). There were no disagreements on the call about this feature set being
either too little or too big for the initial submission.

Stage 2: Focus on decreasing memory wastage. Plan A will attempt to do this
automatically within the kernel (I highlighted some ideas in the slide pack
which we didn't get time to cover). Plan B is to add more fine-grained controls
to to fine tune things at memcg/process/vma level (TBD). I'm not covering this
stage in this email.


Naming
------

We may add large folio support to shmem in future, which may need some separate
controls (TBD). As a result, consensus was to have  generic name "large folio",
which is specialized for anon memory. Then in future it could also be
specialized for shmem.

I'm going to reflect this in the kernel naming by changing LARGE_ANON_FOLIO to
ANON_LARGE_FOLIO, that way it makes "LARGE_FOLIO" grepable.

I'm also reflecting this in the sysfs controls. I'll create a directory
'/sys/kernel/mm/large_folio' as the root. Within that there are 2 main options:

- Put shared controls directly in this directory. Add a sub-directory 'anon' for
  anon-specific controls (and in future 'shmem'...)
- Put all controls in the root directory and prefix the filename for
  anon-specific controls with 'anon' (e.g. anon_enabled).

Given I don't think there will be many anon-specific controls (1 for now), and
THP already uses the latter scheme, I'm proposing to go with the latter.


Controls
--------

Modified proposal, after discussion yesterday:

- boot_param: anon_large_folio
    - =always|never|madvise
    - sets boot-up default for large_folio/anon_enabled
- sysfs: /sys/kernel/mm/large_folio/anon_enabled
    - =always|never|madvise
- sysfs: /sys/kernel/mm/large_folio/defrag
    - =always|defer|defer+madvise|madvise|never
    - Anticipate would be shared between anon and shmem if shmem added
        - this is already true for THP
    - Kirill suggested to drop and hardcode to "never" (GFP_TRANSHUGE_LIGHT)
	- Yu previously commented GFP_TRANSHUGE_LIGHT isn't always ideal [1]
	- So current series is hooking THP's defrag setting
	- Given we want to separate THP and LAF, I'm proposing to keep it
- debugfs: /sys/kernel/debug/mm/large_folio/anon_orders
    - Comma-separated, descending list of orders to try
    - Default: arch_wants_pte_order(),PAGE_ALLOC_COSTLY_ORDER
    - 0 always implicitly appended to end
    - Max allowed is PMD_ORDER-1
    - intended for developers to experiment
    - debugfs means we can change/remove it or promote it to sysfs later
- MADV_NOHUGEPAGE is honored; LAF disabled for these VMAs
    - Required for correctness of existing use cases (live migration post copy)
- New MADV_LARGEFOLIO madvise opcode
    - Like MADV_HUGEPAGE but for large folio

Optional:

DavidR suggested adding ability to set a VMA-specific LAF order, using
process_madvise():
    - Optionally accept LAF order through flags param of 
      process_madvise(MADV_LARGEFOLIO)
    - When no LAF order passed, (or called with madvise()) use global LAF order

Personally, I would prefer to avoid vma-specific laf order for an initial
submission and instead defer the addition until clear need is identified.
Thoughts?


Stats
-----

meminfo:AnonHugePages, smaps:AnonHugePages and memory.stat:anon_thp will
continue to account THP only.

I plan to add meminfo:AnonLargeFolio, smaps:AnonLargeFolio and
memory.stat:anon_large_folio to account LAFs.

Do I need to add counters to vmstat also? (e.g. large_folio_fault_alloc,
large_folio_fault_fallback, etc) - would need to think about which counters and
what they mean if so.


Thanks,
Ryan


[1] https://lore.kernel.org/linux-mm/CAOUHufYWtsAU4PvKpVhzJUeQb9cd+BifY9KzgceBXHp2F2dDRg@xxxxxxxxxxxxxx/




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux