Re: ANON_LARGE_FOLIOS meeting follow-up & refined proposal

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 14/09/2023 09:16, Ryan Roberts wrote:
> Hi All,
> 
> Thanks for participating in the discussion yesterday - it finally feels like we
> are converging on the MVP feature set. Below are my notes from the call and a
> modified proposal for controls and stats. It would be great if we can continue
> to review and refine over email. I'm also planning to post an implementation
> within the next couple of weeks, which I hope will also accelerate convergence.

I never had any feedback on the below; I'm not sure if that means everyone is
happy or that nobody read it??

I've got an implementation for all of the below ready to go, with a few tweeks
to the details (the main change is that anon_orders is now a bitfield where each
set bit represents an order in the set, rather than the originally proposed
comma-separated list of orders).

BUT I've had yet another idea on the controls front, which would enable exposing
this to user space as an extension to transparent_hugepage, while continuing to
support THP as is and also be able to control THP and ALF (anon large folio)
usage independently. On reflection, I think it is cleaner to do it this way for
a couple of reasons:

  - We don't have to introduce a whole new feature (ALF) to the user. Most of
    the concepts and controls overlap a lot with THP anyway, so if we can make
    it look like an extension, I think it would be easier to communicate.

  - The approach I have in mind would make it easy to extend to orders greater
    than PMD_ORDER in future if that's a direction we want to eventually go.
    Because >PMD_ORDER implies multiple PMD entries, it would half belong to THP
    and half belong to ALF in the current proposal, which is nasty.

I'll lay out the new proposal now, but I suspect this will ultimately warrant
another mm alignment meeting...


Add 2 controls to sysfs:

/sys/kernel/mm/transparent_hugepage/anon_orders
  - bitfield where set bits are orders that will be tried during allocation
  - defaults to 1<<PMD_ORDER, which gives current THP behaviour with no ALF
  - For now, 1<<PMD_ORDER is highest settable bit, but easy to expand in future
  - To enable ALF, set the appropriate lower bits
  - To disable THP, clear 1<<PMD_ORDER
  - (In future we could add an "auto" option too)

/sys/kernel/mm/transparent_hugepage/anon_always_mask
  - orders in (anon_orders & anon_always_mask) are not subject to madvise
  - so when enabled=madvise, still try (anon_orders & anon_always_mask) orders
    as if enabled=always
  - defaults to 0 (all subject to madvise)


The defaults for those controls give you "legacy THP". But you can modify the
controls to generate policies like this:


THP only - existing behaviour (default):
----------------------------------------

anon_orders = 1<<PMD_ORDER
anon_always_mask = 0

thp prctl:            | dis       | ena       | ena       | ena
thp sysfs:            | X         | never     | madvise   | always
----------------------|-----------|-----------|-----------|-------------
no hint               | S         | S         | S         | THP>S
MADV_HUGEPAGE         | S         | S         | THP>S     | THP>S
MADV_NOHUGEPAGE       | S         | S         | S         | S


ALF only:
---------

anon_orders = 1<<3 (order-3 - example)
anon_always_mask = 0

thp prctl:            | dis       | ena       | ena       | ena
thp sysfs:            | X         | never     | madvise   | always
----------------------|-----------|-----------|-----------|-------------
no hint               | S         | S         | S         | ALF>S
MADV_HUGEPAGE         | S         | S         | ALF>S     | ALF>S
MADV_NOHUGEPAGE       | S         | S         | S         | S


THP and ALF:
------------

anon_orders = 1<<PMD_ORDER | 1<<3
anon_always_mask = 0 (default)

thp prctl:            | dis       | ena       | ena       | ena
thp sysfs:            | X         | never     | madvise   | always
----------------------|-----------|-----------|-----------|-------------
no hint               | S         | S         | S         | THP>ALF>S
MADV_HUGEPAGE         | S         | S         | THP>ALF>S | THP>ALF>S
MADV_NOHUGEPAGE       | S         | S         | S         | S


THP and ALF, with THP=always, ALF=advise:
-----------------------------------------

anon_orders = 1<<PMD_ORDER | 1<<3
anon_always_mask = 1<<PMD_ORDER

thp prctl:            | dis       | ena       | ena       | ena
thp sysfs:            | X         | never     | madvise   | always
----------------------|-----------|-----------|-----------|-------------
no hint               | S         | S         | THP>S     | THP>ALF>S
MADV_HUGEPAGE         | S         | S         | THP>ALF>S | THP>ALF>S
MADV_NOHUGEPAGE       | S         | S         | S         | S


THP and ALF, with THP=madvise, ALF=always:
------------------------------------------

anon_orders = 1<<PMD_ORDER | 1<<3
anon_always_mask = 1<<3

thp prctl:            | dis       | ena       | ena       | ena
thp sysfs:            | X         | never     | madvise   | always
----------------------|-----------|-----------|-----------|-------------
no hint               | S         | S         | ALF>S     | THP>ALF>S
MADV_HUGEPAGE         | S         | S         | THP>ALF>S | THP>ALF>S
MADV_NOHUGEPAGE       | S         | S         | S         | S


It does have the disadvantage that ALF is tied to MADV_HUGEPAGE, whereas the
below approach introduces a new, independent MADV_LARGEFOLIO. But personally I
don't see that as a major issue. And we could solve it in future by extending
MADV_HUGEPAGE to add a vma-specific set of orders, via the process_madvise flags.

Thoughts?

I'll hold off posting the implementation of the below for now, while we decide
if its better to head in this direction.

Thanks,
Ryan



> 
> 
> Roadmap
> -------
> 
> Stage 1: (MVP) Propose to add minimal runtime controls and stats (as outlined
> below). There were no disagreements on the call about this feature set being
> either too little or too big for the initial submission.
> 
> Stage 2: Focus on decreasing memory wastage. Plan A will attempt to do this
> automatically within the kernel (I highlighted some ideas in the slide pack
> which we didn't get time to cover). Plan B is to add more fine-grained controls
> to to fine tune things at memcg/process/vma level (TBD). I'm not covering this
> stage in this email.
> 
> 
> Naming
> ------
> 
> We may add large folio support to shmem in future, which may need some separate
> controls (TBD). As a result, consensus was to have  generic name "large folio",
> which is specialized for anon memory. Then in future it could also be
> specialized for shmem.
> 
> I'm going to reflect this in the kernel naming by changing LARGE_ANON_FOLIO to
> ANON_LARGE_FOLIO, that way it makes "LARGE_FOLIO" grepable.
> 
> I'm also reflecting this in the sysfs controls. I'll create a directory
> '/sys/kernel/mm/large_folio' as the root. Within that there are 2 main options:
> 
> - Put shared controls directly in this directory. Add a sub-directory 'anon' for
>   anon-specific controls (and in future 'shmem'...)
> - Put all controls in the root directory and prefix the filename for
>   anon-specific controls with 'anon' (e.g. anon_enabled).
> 
> Given I don't think there will be many anon-specific controls (1 for now), and
> THP already uses the latter scheme, I'm proposing to go with the latter.
> 
> 
> Controls
> --------
> 
> Modified proposal, after discussion yesterday:
> 
> - boot_param: anon_large_folio
>     - =always|never|madvise
>     - sets boot-up default for large_folio/anon_enabled
> - sysfs: /sys/kernel/mm/large_folio/anon_enabled
>     - =always|never|madvise
> - sysfs: /sys/kernel/mm/large_folio/defrag
>     - =always|defer|defer+madvise|madvise|never
>     - Anticipate would be shared between anon and shmem if shmem added
>         - this is already true for THP
>     - Kirill suggested to drop and hardcode to "never" (GFP_TRANSHUGE_LIGHT)
> 	- Yu previously commented GFP_TRANSHUGE_LIGHT isn't always ideal [1]
> 	- So current series is hooking THP's defrag setting
> 	- Given we want to separate THP and LAF, I'm proposing to keep it
> - debugfs: /sys/kernel/debug/mm/large_folio/anon_orders
>     - Comma-separated, descending list of orders to try
>     - Default: arch_wants_pte_order(),PAGE_ALLOC_COSTLY_ORDER
>     - 0 always implicitly appended to end
>     - Max allowed is PMD_ORDER-1
>     - intended for developers to experiment
>     - debugfs means we can change/remove it or promote it to sysfs later
> - MADV_NOHUGEPAGE is honored; LAF disabled for these VMAs
>     - Required for correctness of existing use cases (live migration post copy)
> - New MADV_LARGEFOLIO madvise opcode
>     - Like MADV_HUGEPAGE but for large folio
> 
> Optional:
> 
> DavidR suggested adding ability to set a VMA-specific LAF order, using
> process_madvise():
>     - Optionally accept LAF order through flags param of 
>       process_madvise(MADV_LARGEFOLIO)
>     - When no LAF order passed, (or called with madvise()) use global LAF order
> 
> Personally, I would prefer to avoid vma-specific laf order for an initial
> submission and instead defer the addition until clear need is identified.
> Thoughts?
> 
> 
> Stats
> -----
> 
> meminfo:AnonHugePages, smaps:AnonHugePages and memory.stat:anon_thp will
> continue to account THP only.
> 
> I plan to add meminfo:AnonLargeFolio, smaps:AnonLargeFolio and
> memory.stat:anon_large_folio to account LAFs.
> 
> Do I need to add counters to vmstat also? (e.g. large_folio_fault_alloc,
> large_folio_fault_fallback, etc) - would need to think about which counters and
> what they mean if so.
> 
> 
> Thanks,
> Ryan
> 
> 
> [1] https://lore.kernel.org/linux-mm/CAOUHufYWtsAU4PvKpVhzJUeQb9cd+BifY9KzgceBXHp2F2dDRg@xxxxxxxxxxxxxx/





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux