On Thu, Apr 6, 2023 at 10:50 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > > On 06.04.23 16:07, Yosry Ahmed wrote: > > Thanks for taking a look, David! > > > > On Thu, Apr 6, 2023 at 3:31 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > >> > >> On 05.04.23 20:54, Yosry Ahmed wrote: > >>> We keep track of different types of reclaimed pages through > >>> reclaim_state->reclaimed_slab, and we add them to the reported number > >>> of reclaimed pages. For non-memcg reclaim, this makes sense. For memcg > >>> reclaim, we have no clue if those pages are charged to the memcg under > >>> reclaim. > >>> > >>> Slab pages are shared by different memcgs, so a freed slab page may have > >>> only been partially charged to the memcg under reclaim. The same goes for > >>> clean file pages from pruned inodes (on highmem systems) or xfs buffer > >>> pages, there is no simple way to currently link them to the memcg under > >>> reclaim. > >>> > >>> Stop reporting those freed pages as reclaimed pages during memcg reclaim. > >>> This should make the return value of writing to memory.reclaim, and may > >>> help reduce unnecessary reclaim retries during memcg charging. Writing to > >>> memory.reclaim on the root memcg is considered as cgroup_reclaim(), but > >>> for this case we want to include any freed pages, so use the > >>> global_reclaim() check instead of !cgroup_reclaim(). > >>> > >>> Generally, this should make the return value of > >>> try_to_free_mem_cgroup_pages() more accurate. In some limited cases (e.g. > >>> freed a slab page that was mostly charged to the memcg under reclaim), > >>> the return value of try_to_free_mem_cgroup_pages() can be underestimated, > >>> but this should be fine. The freed pages will be uncharged anyway, and we > >> > >> Can't we end up in extreme situations where > >> try_to_free_mem_cgroup_pages() returns close to 0 although a huge amount > >> of memory for that cgroup was freed up. > >> > >> Can you extend on why "this should be fine" ? > >> > >> I suspect that overestimation might be worse than underestimation. (see > >> my comment proposal below) > > > > In such extreme scenarios even though try_to_free_mem_cgroup_pages() > > would return an underestimated value, the freed memory for the cgroup > > will be uncharged. try_charge() (and most callers of > > try_to_free_mem_cgroup_pages()) do so in a retry loop, so even if > > try_to_free_mem_cgroup_pages() returns an underestimated value > > charging will succeed the next time around. > > > > The only case where this might be a problem is if it happens in the > > final retry, but I guess we need to be *really* unlucky for this > > extreme scenario to happen. One could argue that if we reach such a > > situation the cgroup will probably OOM soon anyway. > > > >> > >>> can charge the memcg the next time around as we usually do memcg reclaim > >>> in a retry loop. > >>> > >>> The next patch performs some cleanups around reclaim_state and adds an > >>> elaborate comment explaining this to the code. This patch is kept > >>> minimal for easy backporting. > >>> > >>> Signed-off-by: Yosry Ahmed <yosryahmed@xxxxxxxxxx> > >>> Cc: stable@xxxxxxxxxxxxxxx > >> > >> Fixes: ? > >> > >> Otherwise it's hard to judge how far to backport this. > > > > It's hard to judge. The issue has been there for a while, but > > memory.reclaim just made it more user visible. I think we can > > attribute it to per-object slab accounting, because before that any > > freed slab pages in cgroup reclaim would be entirely charged to that > > cgroup. > > > > Although in all fairness, other types of freed pages that use > > reclaim_state->reclaimed_slab cannot be attributed to the cgroup under > > reclaim have been there before that. I guess slab is the most > > significant among them tho, so for the purposes of backporting I > > guess: > > > > Fixes: f2fe7b09a52b ("mm: memcg/slab: charge individual slab objects > > instead of pages") > > > >> > >>> --- > >>> > >>> global_reclaim(sc) does not exist in kernels before 6.3. It can be > >>> replaced with: > >>> !cgroup_reclaim(sc) || mem_cgroup_is_root(sc->target_mem_cgroup) > >>> > >>> --- > >>> mm/vmscan.c | 8 +++++--- > >>> 1 file changed, 5 insertions(+), 3 deletions(-) > >>> > >>> diff --git a/mm/vmscan.c b/mm/vmscan.c > >>> index 9c1c5e8b24b8f..c82bd89f90364 100644 > >>> --- a/mm/vmscan.c > >>> +++ b/mm/vmscan.c > >>> @@ -5346,8 +5346,10 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc) > >>> vmpressure(sc->gfp_mask, memcg, false, sc->nr_scanned - scanned, > >>> sc->nr_reclaimed - reclaimed); > >>> > >>> - sc->nr_reclaimed += current->reclaim_state->reclaimed_slab; > >>> - current->reclaim_state->reclaimed_slab = 0; > >> > >> Worth adding a comment like > >> > >> /* > >> * Slab pages cannot universally be linked to a single memcg. So only > >> * account them as reclaimed during global reclaim. Note that we might > >> * underestimate the amount of memory reclaimed (but won't overestimate > >> * it). > >> */ > >> > >> but ... > >> > >>> + if (global_reclaim(sc)) { > >>> + sc->nr_reclaimed += current->reclaim_state->reclaimed_slab; > >>> + current->reclaim_state->reclaimed_slab = 0; > >>> + } > >>> > >>> return success ? MEMCG_LRU_YOUNG : 0; > >>> } > >>> @@ -6472,7 +6474,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) > >>> > >>> shrink_node_memcgs(pgdat, sc); > >>> > >> > >> ... do we want to factor the add+clear into a simple helper such that we > >> can have above comment there? > >> > >> static void cond_account_reclaimed_slab(reclaim_state, sc) > >> { > >> /* > >> * Slab pages cannot universally be linked to a single memcg. So > >> * only account them as reclaimed during global reclaim. Note > >> * that we might underestimate the amount of memory reclaimed > >> * (but won't overestimate it). > >> */ > >> if (global_reclaim(sc)) { > >> sc->nr_reclaimed += reclaim_state->reclaimed_slab; > >> reclaim_state->reclaimed_slab = 0; > >> } > >> } > >> > >> Yes, effective a couple LOC more, but still straight-forward for a > >> stable backport > > > > The next patch in the series performs some refactoring and cleanups, > > among which we add a helper called flush_reclaim_state() that does > > exactly that and contains a sizable comment. I left this outside of > > this patch in v5 to make the effective change as small as possible for > > backporting. Looks like it can be confusing tho without the comment. > > > > How about I pull this part to this patch as well for v6? > > As long as it's a helper similar to what I proposed, I think that makes > a lot of sense (and doesn't particularly bloat this patch). Sounds good to me, I will do that and respin. Thanks David! > > -- > Thanks, > > David / dhildenb > >