On Tue, Nov 28, 2023 at 10:32:50AM +0100, Michal Hocko wrote: > On Mon 27-11-23 19:16:37, Dmitry Rokosov wrote: > > On Mon, Nov 27, 2023 at 01:50:22PM +0100, Michal Hocko wrote: > > > On Mon 27-11-23 14:36:44, Dmitry Rokosov wrote: > > > > On Mon, Nov 27, 2023 at 10:33:49AM +0100, Michal Hocko wrote: > > > > > On Thu 23-11-23 22:39:37, Dmitry Rokosov wrote: > > > > > > The shrink_memcg flow plays a crucial role in memcg reclamation. > > > > > > Currently, it is not possible to trace this point from non-direct > > > > > > reclaim paths. However, direct reclaim has its own tracepoint, so there > > > > > > is no issue there. In certain cases, when debugging memcg pressure, > > > > > > developers may need to identify all potential requests for memcg > > > > > > reclamation including kswapd(). The patchset introduces the tracepoints > > > > > > mm_vmscan_memcg_shrink_{begin|end}() to address this problem. > > > > > > > > > > > > Example of output in the kswapd context (non-direct reclaim): > > > > > > kswapd0-39 [001] ..... 240.356378: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16 > > > > > > kswapd0-39 [001] ..... 240.356396: mm_vmscan_memcg_shrink_end: nr_reclaimed=0 memcg=16 > > > > > > kswapd0-39 [001] ..... 240.356420: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16 > > > > > > kswapd0-39 [001] ..... 240.356454: mm_vmscan_memcg_shrink_end: nr_reclaimed=1 memcg=16 > > > > > > kswapd0-39 [001] ..... 240.356479: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16 > > > > > > kswapd0-39 [001] ..... 240.356506: mm_vmscan_memcg_shrink_end: nr_reclaimed=4 memcg=16 > > > > > > kswapd0-39 [001] ..... 240.356525: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16 > > > > > > kswapd0-39 [001] ..... 240.356593: mm_vmscan_memcg_shrink_end: nr_reclaimed=11 memcg=16 > > > > > > kswapd0-39 [001] ..... 240.356614: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16 > > > > > > kswapd0-39 [001] ..... 240.356738: mm_vmscan_memcg_shrink_end: nr_reclaimed=25 memcg=16 > > > > > > kswapd0-39 [001] ..... 240.356790: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16 > > > > > > kswapd0-39 [001] ..... 240.357125: mm_vmscan_memcg_shrink_end: nr_reclaimed=53 memcg=16 > > > > > > > > > > In the previous version I have asked why do we need this specific > > > > > tracepoint when we already do have trace_mm_vmscan_lru_shrink_{in}active > > > > > which already give you a very good insight. That includes the number of > > > > > reclaimed pages but also more. I do see that we do not include memcg id > > > > > of the reclaimed LRU, but that shouldn't be a big problem to add, no? > > > > > > > > >From my point of view, memcg reclaim includes two points: LRU shrink and > > > > slab shrink, as mentioned in the vmscan.c file. > > > > > > > > > > > > static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) > > > > ... > > > > reclaimed = sc->nr_reclaimed; > > > > scanned = sc->nr_scanned; > > > > > > > > shrink_lruvec(lruvec, sc); > > > > > > > > shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, > > > > sc->priority); > > > > ... > > > > > > > > So, both of these operations are important for understanding whether > > > > memcg reclaiming was successful or not, as well as its effectiveness. I > > > > believe it would be beneficial to summarize them, which is why I have > > > > created new tracepoints. > > > > > > This sounds like nice to have rather than must. Put it differently. If > > > you make existing reclaim trace points memcg aware (print memcg id) then > > > what prevents you from making analysis you need? > > > > You are right, nothing prevents me from making this analysis... but... > > > > This approach does have some disadvantages: > > 1) It requires more changes to vmscan. At the very least, the memcg > > object should be forwarded to all subfunctions for LRU and SLAB > > shrinkers. > > We should have lruvec or memcg available. lruvec_memcg() could be used > to get memcg from the lruvec. It might be more places to add the id but > arguably this would improve them to identify where the memory has been > scanned/reclaimed from. > Oh, thank you, didn't see this conversion function before... > > 2) With this approach, we will not have the ability to trace a situation > > where the kernel is requesting reclaim for a specific memcg, but due to > > limits issues, we are unable to run it. > > I do not follow. Could you be more specific please? > I'm referring to a situation where kswapd() or another kernel mm code requests some reclaim pages from memcg, but memcg rejects it due to limits checkers. This occurs in the shrink_node_memcgs() function. === mem_cgroup_calculate_protection(target_memcg, memcg); if (mem_cgroup_below_min(target_memcg, memcg)) { /* * Hard protection. * If there is no reclaimable memory, OOM. */ continue; } else if (mem_cgroup_below_low(target_memcg, memcg)) { /* * Soft protection. * Respect the protection only as long as * there is an unprotected supply * of reclaimable memory from other cgroups. */ if (!sc->memcg_low_reclaim) { sc->memcg_low_skipped = 1; continue; } memcg_memory_event(memcg, MEMCG_LOW); } === With separate shrink begin()/end() tracepoints we can detect such problem. > > 3) LRU and SLAB shrinkers are too common places to handle memcg-related > > tasks. Additionally, memcg can be disabled in the kernel configuration. > > Right. This could be all hidden in the tracing code. You simply do not > print memcg id when the controller is disabled. Or just simply print 0. > I do not really see any major problems with that. > > I would really prefer to focus on that direction rather than adding > another begin/end tracepoint which overalaps with existing begin/end > traces and provides much more limited information because I would bet we > will have somebody complaining that mere nr_reclaimed is not sufficient. Okay, I will try to prepare a new patch version with memcg printing from lruvec and slab tracepoints. Then Andrew should drop the previous patchsets, I suppose. Please advise on the correct workflow steps here. -- Thank you, Dmitry