On Fri, Feb 3, 2023 at 4:26 PM Shakeel Butt <shakeelb@xxxxxxxxxx> wrote: > > On Fri, Feb 03, 2023 at 07:28:49AM -0800, Yosry Ahmed wrote: > > On Fri, Feb 3, 2023 at 7:11 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > > > > > On Thu, Feb 02, 2023 at 04:17:18PM -0800, Yosry Ahmed wrote: > > > > On Thu, Feb 2, 2023 at 4:01 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > > > > Patch 1 is just refactoring updating reclaim_state into a helper > > > > > > function, and renames reclaimed_slab to just reclaimed, with a comment > > > > > > describing its true purpose. > > > > > > > > > > > > Patch 2 ignores pages reclaimed outside of LRU reclaim in memcg reclaim. > > > > > > > > > > > > The original draft was a little bit different. It also kept track of > > > > > > uncharged objcg pages, and reported them only in memcg reclaim and only > > > > > > if the uncharged memcg is in the subtree of the memcg under reclaim. > > > > > > This was an attempt to make reporting of memcg reclaim even more > > > > > > accurate, but was dropped due to questionable complexity vs benefit > > > > > > tradeoff. It can be revived if there is interest. > > > > > > > > > > > > Yosry Ahmed (2): > > > > > > mm: vmscan: refactor updating reclaimed pages in reclaim_state > > > > > > mm: vmscan: ignore non-LRU-based reclaim in memcg reclaim > > > > > > > > > > > > fs/inode.c | 3 +-- > > > > > > > > > > Inodes and inode mapping pages are directly charged to the memcg > > > > > that allocated them and the shrinker is correctly marked as > > > > > SHRINKER_MEMCG_AWARE. Freeing the pages attached to the inode will > > > > > account them correctly to the related memcg, regardless of which > > > > > memcg is triggering the reclaim. Hence I'm not sure that skipping > > > > > the accounting of the reclaimed memory is even correct in this case; > > > > > > > > Please note that we are not skipping any accounting here. The pages > > > > are still uncharged from the memcgs they are charged to (the allocator > > > > memcgs as you pointed out). We just do not report them in the return > > > > value of try_to_free_mem_cgroup_pages(), to avoid over-reporting. > > > > > > I was wondering the same thing as Dave, reading through this. But > > > you're right, we'll catch the accounting during uncharge. Can you > > > please add a comment on the !cgroup_reclaim() explaining this? > > > > Sure! If we settle on this implementation I will send another version > > with a comment and fix the build problem in patch 2. > > > > > > > > There is one wrinkle with this, though. We have the following > > > (simplified) sequence during charging: > > > > > > nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages, > > > gfp_mask, reclaim_options); > > > > > > if (mem_cgroup_margin(mem_over_limit) >= nr_pages) > > > goto retry; > > > > > > /* > > > * Even though the limit is exceeded at this point, reclaim > > > * may have been able to free some pages. Retry the charge > > > * before killing the task. > > > * > > > * Only for regular pages, though: huge pages are rather > > > * unlikely to succeed so close to the limit, and we fall back > > > * to regular pages anyway in case of failure. > > > */ > > > if (nr_reclaimed && nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER)) > > > goto retry; > > > > > > So in the unlikely scenario where the first call doesn't make the > > > necessary headroom, and the shrinkers are the only thing that made > > > forward progress, we would OOM prematurely. > > > > > > Not that an OOM would seem that far away in that scenario, anyway. But I > > > remember long discussions with DavidR on probabilistic OOM regressions ;) > > > > > > > Above the if (nr_reclaimed...) check we have: > > > > if (gfp_mask & __GFP_NORETRY) > > goto nomem; > > > > , and below it we have: > > > > if (nr_retries--) > > goto retry; > > > > So IIUC we only prematurely OOM if we either have __GFP_NORETRY and > > cannot reclaim any LRU pages in the first try, or if the scenario > > where only shrinkers were successful to reclaim happens in the last > > retry. Right? > > > > We don't call oom-killer for __GFP_NORETRY. Also note that the retry > (from nr_retries) after the reclaim includes page_counter_try_charge(). > So, even if try_to_free_mem_cgroup_pages() have returned 0 after > reclaiming the slab memory of the memcg, the page_counter_try_charge() > should succeed if the reclaimed slab objects have created enough margin. > > > > > > I think the code should still be accounting for all pages that > > > > > belong to the memcg being scanned that are reclaimed, not ignoring > > > > > them altogether... > > > > > > > > 100% agree. Ideally I would want to: > > > > - For pruned inodes: report all freed pages for global reclaim, and > > > > only report pages charged to the memcg under reclaim for memcg > > > > reclaim. > > > > > > This only happens on highmem systems at this point, as elsewhere > > > populated inodes aren't on the shrinker LRUs anymore. We'd probably be > > > ok with a comment noting the inaccuracy in the proactive reclaim stats > > > for the time being, until somebody actually cares about that combination. > > > > Interesting, I did not realize this. I guess in this case we may get > > away with just ignoring non-LRU reclaimed pages in memcg reclaim > > completely, or go an extra bit and report uncharged objcg pages in > > memcg reclaim. See below. > > > > > > > > > - For slab: report all freed pages for global reclaim, and only report > > > > uncharged objcg pages from the memcg under reclaim for memcg reclaim. > > > > > > > > The only problem is that I thought people would think this is too much > > > > complexity and not worth it. If people agree this should be the > > > > approach to follow, I can prepare patches for this. I originally > > > > implemented this for slab pages, but held off on sending it. > > > > > > I'd be curious to see the code! > > > > I think it is small enough to paste here. Basically instead of just > > ignoring reclaim_state->reclaimed completely in patch 2, I counted > > uncharged objcg pages only in memcg reclaim instead of freed slab > > pages, and ignored pruned inode pages in memcg reclaim. So I guess we > > can go with either: > > - Just ignore freed slab pages and pages from pruned inodes in memcg > > reclaim (current RFC). > > - Ignore pruned inodes in memcg reclaim (as you explain above), and > > use the following diff instead of patch 2 for slab. > > - Use the following diff for slab AND properly report freed pages from > > pruned inodes if they are relevant to the memcg under reclaim. > > > > Let me know what you think is best. > > > > I would prefer the currect RFC instead of the other two options. Those > options are slowing down (and adding complexity) to the uncharge code > path for the accuracy which no one really need or should care about. > > > diff --git a/include/linux/swap.h b/include/linux/swap.h > > index bc1d8b326453..37f799901dfb 100644 > > --- a/include/linux/swap.h > > +++ b/include/linux/swap.h > > @@ -162,6 +162,7 @@ struct reclaim_state { > > }; > > > > void report_freed_pages(unsigned long pages); > > +bool report_uncharged_pages(unsigned long pages, struct mem_cgroup *memcg); > > > > #ifdef __KERNEL__ > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index ab457f0394ab..a886ace70648 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -3080,6 +3080,13 @@ static void obj_cgroup_uncharge_pages(struct > > obj_cgroup *objcg, > > memcg_account_kmem(memcg, -nr_pages); > > refill_stock(memcg, nr_pages); > > > > + /* > > + * If undergoing memcg reclaim, report uncharged pages and drain local > > + * stock to update the memcg usage. > > + */ > > + if (report_uncharged_pages(nr_pages, memcg)) > > + drain_local_stock(NULL); > > + > > css_put(&memcg->css); > > } > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 207998b16e5f..d4eced2b884b 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -204,17 +204,54 @@ static void set_task_reclaim_state(struct > > task_struct *task, > > task->reclaim_state = rs; > > } > > > > +static bool cgroup_reclaim(struct scan_control *sc); > > + > > /* > > * reclaim_report_freed_pages: report pages freed outside of LRU-based reclaim > > * @pages: number of pages freed > > * > > - * If the current process is undergoing a reclaim operation, > > + * If the current process is undergoing a non-cgroup reclaim operation, > > * increment the number of reclaimed pages by @pages. > > */ > > void report_freed_pages(unsigned long pages) > > { > > - if (current->reclaim_state) > > - current->reclaim_state->reclaimed += pages; > > + struct reclaim_state *rs = current->reclaim_state; > > + struct scan_control *sc; > > + > > + if (!rs) > > + return; > > + > > + sc = container_of(rs, struct scan_control, reclaim_state); > > + if (!cgroup_reclaim(sc)) > > + rs->reclaimed += pages; > > +} > > + > > +/* > > + * report_uncharged_pages: report pages uncharged outside of LRU-based reclaim > > + * @pages: number of pages uncharged > > + * @memcg: memcg pages were uncharged from > > + * > > + * If the current process is undergoing a cgroup reclaim operation, increment > > + * the number of reclaimed pages by @pages, if the memcg under > > reclaim is @memcg > > + * or an ancestor of it. > > + * > > + * Returns true if an update was made. > > + */ > > +bool report_uncharged_pages(unsigned long pages, struct mem_cgroup *memcg) > > +{ > > + struct reclaim_state *rs = current->reclaim_state; > > + struct scan_control *sc; > > + > > + if (!rs) > > + return false; > > + > > + sc = container_of(rs, struct scan_control, reclaim_state); > > + if (cgroup_reclaim(sc) && > > + mem_cgroup_is_descendant(memcg, sc->target_mem_cgroup)) { > > + rs->reclaimed += pages; > > + return true; > > + } > > + return false; > > } > > > > LIST_HEAD(shrinker_list); Any further thoughts on this, whether to refresh the current RFC with added comments (based on Johannes's feedback) and exporting report_freed_pages() (based on Matthew's feedback), or to send a new version with the code above that accurately counts objcg uncharged pages in memcg reclaim?