Hi, Gregory, Gregory Price <gourry@xxxxxxxxxx> writes: > On Mon, Sep 02, 2024 at 02:53:26PM +0800, Huang, Ying wrote: >> Gregory Price <gourry@xxxxxxxxxx> writes: >> >> > On Mon, Aug 19, 2024 at 03:46:00PM +0800, Huang, Ying wrote: >> >> Gregory Price <gourry@xxxxxxxxxx> writes: >> >> >> >> > Unmapped pagecache pages can be demoted to low-tier memory, but >> >> > they can only be promoted if a process maps the pages into the >> >> > memory space (so that NUMA hint faults can be caught). This can >> >> > cause significant performance degradation as the pagecache ages >> >> > and unmapped, cached files are accessed. >> >> > >> >> > This patch series enables the pagecache to request a promotion of >> >> > a folio when it is accessed via the pagecache. >> >> > >> >> > We add a new `numa_hint_page_cache` counter in vmstat to capture >> >> > information on when these migrations occur. >> >> >> >> It appears that you will promote page cache page on the second access. >> >> Do you have some better way to identify hot pages from the not-so-hot >> >> pages? How to balance between unmapped and mapped pages? We have hot >> >> page selection for hot pages. >> >> >> >> [snip] >> >> >> > >> > I've since explored moving this down under a (referenced && active) check. >> > >> > This would be more like promotion on third access within an LRU shrink >> > round (the LRU should, in theory, hack off the active bits on some decent >> > time interval when the system is pressured). >> > >> > Barring adding new counters to folios to track hits, I don't see a clear >> > and obvious way way to track hotness. The primary observation here is >> > that pagecache is un-mapped, and so cannot use numa-fault hints. >> > >> > This is more complicated with MGLRU, but I'm saving that for after I >> > figure out the plan for plain old LRU. >> >> Several years ago, we have tried to use the access time tracking >> mechanism of NUMA balancing to track the access time latency of unmapped >> file cache folios. The original implementation is as follows, >> >> https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/commit/?h=tiering-0.8&id=5f2e64ce75c0322602c2ec8c70b64bb69b1f1329 >> >> What do you think about this? >> > > Coming back around to explore this topic a bit more, dug into this old > patch and the LRU patch by Keith - I'm struggling find a good option > that doesn't over-complicate or propose something contentious. > > > I did a browse through lore and did not see any discussion on this patch > or on Keith's LRU patch, so i presume discussion on this happened largely > off-list. So if you have any context as to why this wasn't RFC'd officially > I would like more information. Thanks for doing this. There's no much discussion offline. We just don't have enough time to work on the solution. > My observations between these 3 proposals: > > - The page-lock state is complex while trying interpose in mark_folio_accessed, > meaning inline promotion inside that interface is a non-starter. > > We found one deadlock during task exit due to the PTL being held. > > This worries me more generally, but we did find some success changing certain > calls to mark_folio_accessed to mark_folio_accessed_and_promote - rather than > modifying mark_folio_accessed. This ends up changing code in similar places > to your hook - but catches a more conditions that mark a page accessed. > > - For Keith's proposal, promotions via LRU requires memory pressure on the lower > tier to cause a shrink and therefore promotions. I'm not well versed in LRU > LRU sematics, but it seems we could try proactive reclaim here. > > Doing promote-reclaim and demote/swap/evict reclaim on the same triggers > seems counter-intuitive. IIUC, in TPP paper (https://arxiv.org/abs/2206.02878), a similar method is proposed for page promoting. I guess that it works together with proactive reclaiming. > - Doing promotions inline with access creates overhead. I've seen some research > suggesting 60us+ per migration - so aggressiveness could harm performance. > > Doing it async would alleviate inline access overheads - but it could also make > promotion pointless if time-to-promote is to far from liveliness of the pages. Async promotion needs to deal with the resource (CPU/memory) charging too. You do some work for a task, so you need to charge the consumed resource for the task. > - Doing async-promotion may also require something like PG_PROMOTABLE (as proposed > by Keith's patch), which will obviously be a very contentious topic. Some additional data structure can be used to record pages. > tl;dr: I'm learning towards a solution like you have here, but we may need to > make a sysfs switch similar to demotion_enabled in case of poor performance due > to heuristically degenerate access patterns, and we may need to expose some > form of adjustable aggressiveness value to make it tunable. Yes. We may need that, because the performance benefit may be lower than the overhead introduced. > Reading more into the code surrounding this and other migration logic, I also > think we should explore an optimization to mempolicy that tries to aggressively > keep certain classes of memory on the local node (RX memory and stack > for example). > > Other areas of reclaim try to actively prevent demoting this type of memory, so we > should try not to allocate it there in the first place. We have already used DRAM first allocation policy. So, we need to measure its effect firstly. -- Best Regards, Huang, Ying