Re: [PATCH 0/3] mm,TPP: Enable promotion of unmapped pagecache

Gregory Price <gourry@xxxxxxxxxx> · Mon, 4 Nov 2024 13:12:57 -0500

On Mon, Sep 02, 2024 at 02:53:26PM +0800, Huang, Ying wrote:
> Gregory Price <gourry@xxxxxxxxxx> writes:
> 
> > On Mon, Aug 19, 2024 at 03:46:00PM +0800, Huang, Ying wrote:
> >> Gregory Price <gourry@xxxxxxxxxx> writes:
> >> 
> >> > Unmapped pagecache pages can be demoted to low-tier memory, but 
> >> > they can only be promoted if a process maps the pages into the
> >> > memory space (so that NUMA hint faults can be caught).  This can
> >> > cause significant performance degradation as the pagecache ages
> >> > and unmapped, cached files are accessed.
> >> >
> >> > This patch series enables the pagecache to request a promotion of
> >> > a folio when it is accessed via the pagecache.
> >> >
> >> > We add a new `numa_hint_page_cache` counter in vmstat to capture
> >> > information on when these migrations occur.
> >> 
> >> It appears that you will promote page cache page on the second access.
> >> Do you have some better way to identify hot pages from the not-so-hot
> >> pages?  How to balance between unmapped and mapped pages?  We have hot
> >> page selection for hot pages.
> >> 
> >> [snip]
> >> 
> >
> > I've since explored moving this down under a (referenced && active) check.
> >
> > This would be more like promotion on third access within an LRU shrink
> > round (the LRU should, in theory, hack off the active bits on some decent
> > time interval when the system is pressured).
> >
> > Barring adding new counters to folios to track hits, I don't see a clear
> > and obvious way way to track hotness.  The primary observation here is 
> > that pagecache is un-mapped, and so cannot use numa-fault hints.
> >
> > This is more complicated with MGLRU, but I'm saving that for after I
> > figure out the plan for plain old LRU.
> 
> Several years ago, we have tried to use the access time tracking
> mechanism of NUMA balancing to track the access time latency of unmapped
> file cache folios.  The original implementation is as follows,
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/commit/?h=tiering-0.8&id=5f2e64ce75c0322602c2ec8c70b64bb69b1f1329
> 
> What do you think about this?
> 

Coming back around to explore this topic a bit more, dug into this old
patch and the LRU patch by Keith - I'm struggling find a good option
that doesn't over-complicate or propose something contentious.

I did a browse through lore and did not see any discussion on this patch
or on Keith's LRU patch, so i presume discussion on this happened largely
off-list.  So if you have any context as to why this wasn't RFC'd officially
I would like more information.

My observations between these 3 proposals:

- The page-lock state is complex while trying interpose in mark_folio_accessed,
  meaning inline promotion inside that interface is a non-starter.

  We found one deadlock during task exit due to the PTL being held. 

  This worries me more generally, but we did find some success changing certain
  calls to mark_folio_accessed to mark_folio_accessed_and_promote - rather than
  modifying mark_folio_accessed. This ends up changing code in similar places
  to your hook - but catches a more conditions that mark a page accessed.

- For Keith's proposal, promotions via LRU requires memory pressure on the lower
  tier to cause a shrink and therefore promotions. I'm not well versed in LRU
  LRU sematics, but it seems we could try proactive reclaim here.

  Doing promote-reclaim and demote/swap/evict reclaim on the same triggers
  seems counter-intuitive.

- Doing promotions inline with access creates overhead.  I've seen some research
  suggesting 60us+ per migration - so aggressiveness could harm performance.

  Doing it async would alleviate inline access overheads - but it could also make
  promotion pointless if time-to-promote is to far from liveliness of the pages.

- Doing async-promotion may also require something like PG_PROMOTABLE (as proposed
  by Keith's patch), which will obviously be a very contentious topic.

tl;dr: I'm learning towards a solution like you have here, but we may need to
make a sysfs switch similar to demotion_enabled in case of poor performance due
to heuristically degenerate access patterns, and we may need to expose some
form of adjustable aggressiveness value to make it tunable.

Reading more into the code surrounding this and other migration logic, I also
think we should explore an optimization to mempolicy that tries to aggressively
keep certain classes of memory on the local node (RX memory and stack for example).

Other areas of reclaim try to actively prevent demoting this type of memory, so we
should try not to allocate it there in the first place.

~Gregory

> --
> Best Regards,
> Huang, Ying