Re: [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios.

"Huang, Ying" <ying.huang@xxxxxxxxxxxxxxxxx> · Sat, 21 Dec 2024 13:18:04 +0800

Hi, Gregory,

Thanks for working on this!

Gregory Price <gourry@xxxxxxxxxx> writes:

> Unmapped page cache pages can be demoted to low-tier memory, but
> they can presently only be promoted in two conditions:
>     1) The page is fully swapped out and re-faulted
>     2) The page becomes mapped (and exposed to NUMA hint faults)
>
> This RFC proposes promoting unmapped page cache pages by using
> folio_mark_accessed as a hotness hint for unmapped pages.
>
> Patches 1-3
> 	allow NULL as valid input to migration prep interfaces
> 	for vmf/vma - which is not present in unmapped folios.
> Patch 4
> 	adds NUMA_HINT_PAGE_CACHE to vmstat
> Patch 5
> 	adds the promotion mechanism, along with a sysfs
> 	extension which defaults the behavior to off.
> 	/sys/kernel/mm/numa/pagecache_promotion_enabled
>
> Functional test showed that we are able to reclaim some performance
> in canned scenarios (a file gets demoted and becomes hot with 
> relatively little contention).  See test/overhead section below.
>
> v2
> - cleanup first commit to be accurate and take Ying's feedback
> - cleanup NUMA_HINT_ define usage
> - add NUMA_HINT_ type selection macro to keep code clean
> - mild comment updates
>
> Open Questions:
> ======
>    1) Should we also add a limit to how much can be forced onto
>       a single task's promotion list at any one time? This might
>       piggy-back on the existing TPP promotion limit (256MB?) and
>       would simply add something like task->promo_count.
>
>       Technically we are limited by the batch read-rate before a
>       TASK_RESUME occurs.
>
>    2) Should we exempt certain forms of folios, or add additional
>       knobs/levers in to deal with things like large folios?
>
>    3) We added NUMA_HINT_PAGE_CACHE to differentiate hint faults
>       so we could validate the behavior works as intended. Should
>       we just call this a NUMA_HINT_FAULT and not add a new hint?
>
>    4) Benchmark suggestions that can pressure 1TB memory. This is
>       not my typical wheelhouse, so if folks know of a useful
>       benchmark that can pressure my 1TB (768 DRAM / 256 CXL) setup,
>       I'd like to add additional measurements here.
>
> Development Notes
> =================
>
> During development, we explored the following proposals:
>
> 1) directly promoting within folio_mark_accessed (FMA)
>    Originally suggested by Johannes Weiner
>    https://lore.kernel.org/all/20240803094715.23900-1-gourry@xxxxxxxxxx/
>
>    This caused deadlocks due to the fact that the PTL was held
>    in a variety of cases - but in particular during task exit.
>    It also is incredibly inflexible and causes promotion-on-fault.
>    It was discussed that a deferral mechanism was preferred.
>
>
> 2) promoting in filemap.c locations (calls of FMA)
>    Originally proposed by Feng Tang and Ying Huang
>    https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/patch/?id=5f2e64ce75c0322602c2ec8c70b64bb69b1f1329
>
>    First, we saw this as less problematic than directly hooking FMA,
>    but we realized this has the potential to miss data in a variety of
>    locations: swap.c, memory.c, gup.c, ksm.c, paddr.c - etc.
>
>    Second, we discovered that the lock state of pages is very subtle,
>    and that these locations in filemap.c can be called in an atomic
>    context.  Prototypes lead to a variety of stalls and lockups.
>
>
> 3) a new LRU - originally proposed by Keith Busch
>    https://git.kernel.org/pub/scm/linux/kernel/git/kbusch/linux.git/patch/?id=6616afe9a722f6ebedbb27ade3848cf07b9a3af7
>
>    There are two issues with this approach: PG_promotable and reclaim.
>
>    First - PG_promotable has generally be discouraged.
>
>    Second - Attach this mechanism to an LRU is both backwards and
>    counter-intutive.  A promotable list is better served by a MOST
>    recently used list, and since LRUs are generally only shrank when
>    exposed to pressure it would require implementing a new promotion
>    list shrinker that runs separate from the existing reclaim logic.
>
>
> 4) Adding a separate kthread - suggested by many
>
>    This is - to an extent - a more general version of the LRU proposal.
>    We still have to track the folios - which likely requires the
>    addition of a page flag.  Additionally, this method would actually
>    contend pretty heavily with LRU behavior - i.e. we'd want to
>    throttle addition to the promotion candidate list in some scenarios.
>
>
> 5) Doing it in task work
>
>    This seemed to be the most realistic after considering the above.
>
>    We observe the following:
>     - FMA is an ideal hook for this and isolation is safe here
>     - the new promotion_candidate function is an ideal hook for new
>       filter logic (throttling, fairness, etc).
>     - isolated folios are either promoted or putback on task resume,
>       there are no additional concurrency mechanics to worry about
>     - The mechanic can be made optional via a sysfs hook to avoid
>       overhead in degenerate scenarios (thrashing).
>
>    We also piggy-backed on the numa_hint_fault_latency timestamp to
>    further throttle promotions to help avoid promotions on one or
>    two time accesses to a particular page.
>
>
> Test:
> ======
>
> Environment:
>     1.5-3.7GHz CPU, ~4000 BogoMIPS, 
>     1TB Machine with 768GB DRAM and 256GB CXL
>     A 64GB file being linearly read by 6-7 Python processes
>
> Goal:
>    Generate promotions. Demonstrate stability and measure overhead.
>
> System Settings:
>    echo 1 > /sys/kernel/mm/numa/demotion_enabled
>    echo 1 > /sys/kernel/mm/numa/pagecache_promotion_enabled
>    echo 2 > /proc/sys/kernel/numa_balancing
>    
> Each process took up ~128GB each, with anonymous memory growing and
> shrinking as python filled and released buffers with the 64GB data.
> This causes DRAM pressure to generate demotions, and file pages to
> "become hot" - and therefore be selected for promotion.
>
> First we ran with promotion disabled to show consistent overhead as
> a result of forcing a file out to CXL memory. We first ran a single
> reader to see uncontended performance, launched many readers to force
> demotions, then droppedb back to a single reader to observe.
>
> Single-reader DRAM: ~16.0-16.4s
> Single-reader CXL (after demotion):  ~16.8-17s

The difference is trivial.  This makes me thought that why we need this
patchset?

> Next we turned promotion on with only a single reader running.
>
> Before promotions:
>     Node 0 MemFree:        636478112 kB
>     Node 0 FilePages:      59009156 kB
>     Node 1 MemFree:        250336004 kB
>     Node 1 FilePages:      14979628 kB

Why are there some many file pages on node 1 even if there're a lot of
free pages on node 0?  You moved some file pages from node 0 to node 1?

> After promotions:
>     Node 0 MemFree:        632267268 kB
>     Node 0 FilePages:      72204968 kB
>     Node 1 MemFree:        262567056 kB
>     Node 1 FilePages:       2918768 kB
>
> Single-reader (after_promotion): ~16.5s
>
> Turning the promotion mechanism on when nothing had been demoted
> produced no appreciable overhead (memory allocation noise overpowers it)
>
> Read time did not change after turning promotion off after promotion
> occurred, which implies that the additional overhead is not coming from
> the promotion system itself - but likely other pages still trapped on
> the low tier.  Either way, this at least demonstrates the mechanism is
> not particularly harmful when there are no pages to promote - and the
> mechanism is valuable when a file actually is quite hot.
>
> Notability, it takes some time for the average read loop to come back
> down, and there still remains unpromoted file pages trapped in pagecache.
> This isn't entirely unexpected, there are many files which may have been
> demoted, and they may not be very hot.
>
>
> Overhead
> ======
> When promotion was tured on we saw a loop-runtime increate temporarily
>
> before: 16.8s
> during:
>   17.606216192245483
>   17.375206470489502
>   17.722095489501953
>   18.230552434921265
>   18.20712447166443
>   18.008254528045654
>   17.008427381515503
>   16.851454257965088
>   16.715774059295654
> stable: ~16.5s
>
> We measured overhead with a separate patch that simply measured the
> rdtsc value before/after calls in promotion_candidate and task work.
>
> e.g.:
> +       start = rdtsc();
>         list_for_each_entry_safe(folio, tmp, promo_list, lru) {
>                 list_del_init(&folio->lru);
>                 migrate_misplaced_folio(folio, NULL, nid);
> +               count++;
>         }
> +       atomic_long_add(rdtsc()-start, &promo_time);
> +       atomic_long_add(count, &promo_count);
>
> numa_migrate_prep: 93 - time(3969867917) count(42576860)
> migrate_misplaced_folio_prepare: 491 - time(3433174319) count(6985523)
> migrate_misplaced_folio: 1635 - time(11426529980) count(6985523)
>
> Thoughts on a good throttling heuristic would be appreciated here.

We do have a throttle mechanism already, for example, you can used

$ echo 100 > /proc/sys/kernel/numa_balancing_promote_rate_limit_MBps

to rate limit the promotion throughput under 100 MB/s for each DRAM
node.

> Suggested-by: Huang Ying <ying.huang@xxxxxxxxxxxxxxxxx>
> Suggested-by: Johannes Weiner <hannes@xxxxxxxxxxx>
> Suggested-by: Keith Busch <kbusch@xxxxxxxx>
> Suggested-by: Feng Tang <feng.tang@xxxxxxxxx>
> Signed-off-by: Gregory Price <gourry@xxxxxxxxxx>
>
> Gregory Price (5):
>   migrate: Allow migrate_misplaced_folio_prepare() to accept a NULL VMA.
>   memory: move conditionally defined enums use inside ifdef tags
>   memory: allow non-fault migration in numa_migrate_check path
>   vmstat: add page-cache numa hints
>   migrate,sysfs: add pagecache promotion
>
>  .../ABI/testing/sysfs-kernel-mm-numa          | 20 ++++++
>  include/linux/memory-tiers.h                  |  2 +
>  include/linux/migrate.h                       |  2 +
>  include/linux/sched.h                         |  3 +
>  include/linux/sched/numa_balancing.h          |  5 ++
>  include/linux/vm_event_item.h                 |  8 +++
>  init/init_task.c                              |  1 +
>  kernel/sched/fair.c                           | 26 +++++++-
>  mm/memory-tiers.c                             | 27 ++++++++
>  mm/memory.c                                   | 32 +++++-----
>  mm/mempolicy.c                                | 25 +++++---
>  mm/migrate.c                                  | 61 ++++++++++++++++++-
>  mm/swap.c                                     |  3 +
>  mm/vmstat.c                                   |  2 +
>  14 files changed, 193 insertions(+), 24 deletions(-)

---
Best Regards,
Huang, Ying