On Mon, Sep 02, 2024 at 02:53:26PM +0800, Huang, Ying wrote: > Gregory Price <gourry@xxxxxxxxxx> writes: > > > On Mon, Aug 19, 2024 at 03:46:00PM +0800, Huang, Ying wrote: > >> Gregory Price <gourry@xxxxxxxxxx> writes: > >> > >> > Unmapped pagecache pages can be demoted to low-tier memory, but > >> > they can only be promoted if a process maps the pages into the > >> > memory space (so that NUMA hint faults can be caught). This can > >> > cause significant performance degradation as the pagecache ages > >> > and unmapped, cached files are accessed. > >> > > >> > This patch series enables the pagecache to request a promotion of > >> > a folio when it is accessed via the pagecache. > >> > > >> > We add a new `numa_hint_page_cache` counter in vmstat to capture > >> > information on when these migrations occur. > >> > >> It appears that you will promote page cache page on the second access. > >> Do you have some better way to identify hot pages from the not-so-hot > >> pages? How to balance between unmapped and mapped pages? We have hot > >> page selection for hot pages. > >> > >> [snip] > >> > > > > I've since explored moving this down under a (referenced && active) check. > > > > This would be more like promotion on third access within an LRU shrink > > round (the LRU should, in theory, hack off the active bits on some decent > > time interval when the system is pressured). > > > > Barring adding new counters to folios to track hits, I don't see a clear > > and obvious way way to track hotness. The primary observation here is > > that pagecache is un-mapped, and so cannot use numa-fault hints. > > > > This is more complicated with MGLRU, but I'm saving that for after I > > figure out the plan for plain old LRU. > > Several years ago, we have tried to use the access time tracking > mechanism of NUMA balancing to track the access time latency of unmapped > file cache folios. The original implementation is as follows, > > https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/commit/?h=tiering-0.8&id=5f2e64ce75c0322602c2ec8c70b64bb69b1f1329 > > What do you think about this? > Coming back around to explore this topic a bit more, dug into this old patch and the LRU patch by Keith - I'm struggling find a good option that doesn't over-complicate or propose something contentious. I did a browse through lore and did not see any discussion on this patch or on Keith's LRU patch, so i presume discussion on this happened largely off-list. So if you have any context as to why this wasn't RFC'd officially I would like more information. My observations between these 3 proposals: - The page-lock state is complex while trying interpose in mark_folio_accessed, meaning inline promotion inside that interface is a non-starter. We found one deadlock during task exit due to the PTL being held. This worries me more generally, but we did find some success changing certain calls to mark_folio_accessed to mark_folio_accessed_and_promote - rather than modifying mark_folio_accessed. This ends up changing code in similar places to your hook - but catches a more conditions that mark a page accessed. - For Keith's proposal, promotions via LRU requires memory pressure on the lower tier to cause a shrink and therefore promotions. I'm not well versed in LRU LRU sematics, but it seems we could try proactive reclaim here. Doing promote-reclaim and demote/swap/evict reclaim on the same triggers seems counter-intuitive. - Doing promotions inline with access creates overhead. I've seen some research suggesting 60us+ per migration - so aggressiveness could harm performance. Doing it async would alleviate inline access overheads - but it could also make promotion pointless if time-to-promote is to far from liveliness of the pages. - Doing async-promotion may also require something like PG_PROMOTABLE (as proposed by Keith's patch), which will obviously be a very contentious topic. tl;dr: I'm learning towards a solution like you have here, but we may need to make a sysfs switch similar to demotion_enabled in case of poor performance due to heuristically degenerate access patterns, and we may need to expose some form of adjustable aggressiveness value to make it tunable. Reading more into the code surrounding this and other migration logic, I also think we should explore an optimization to mempolicy that tries to aggressively keep certain classes of memory on the local node (RX memory and stack for example). Other areas of reclaim try to actively prevent demoting this type of memory, so we should try not to allocate it there in the first place. ~Gregory > -- > Best Regards, > Huang, Ying