On Fri, Oct 25, 2024 at 10:05 AM JP Kobryn <inwardvessel@xxxxxxxxx> wrote: > > > On 10/25/24 12:40 AM, Yosry Ahmed wrote: > > On Thu, Oct 24, 2024 at 6:16 PM Shakeel Butt <shakeel.butt@xxxxxxxxx> wrote: > >> On Thu, Oct 24, 2024 at 05:57:25PM GMT, Yosry Ahmed wrote: > >>> On Thu, Oct 24, 2024 at 5:26 PM JP Kobryn <inwardvessel@xxxxxxxxx> wrote: > >>>> Make use of the flush tracepoint within memcontrol. > >>>> > >>>> Signed-off-by: JP Kobryn <inwardvessel@xxxxxxxxx> > >>> Is the intention to use tools like bpftrace to analyze where we flush > >>> the most? In this case, why can't we just attach to the fentry of > >>> do_flush_stats() and use the stack trace to find the path? > >>> > >>> We can also attach to mem_cgroup_flush_stats(), and the difference in > >>> counts between the two will be the number of skipped flushes. > >>> > >> All these functions can get inlined and then we can not really attach > >> easily. We can somehow find the offset in the inlined places and try to > >> use kprobe but it is prohibitive when have to do for multiple kernels > >> built with fdo/bolt. > >> > >> Please note that tracepoints are not really API, so we can remove them > >> in future if we see no usage for them. > > That's fair, but can we just add two tracepoints? This seems enough to > > collect necessary data, and prevent proliferation of tracepoints and > > the addition of the enum. > > > > I am thinking one in mem_cgroup_flush_stats() and one in > > do_flush_stats(), e.g. trace_mem_cgroup_flush_stats() and > > trace_do_flush_stats(). Although the name of the latter is too > > generic, maybe we should rename the function first to add mem_cgroup_* > > or memcg_*. > > > > WDYT? > > Hmmm, I think if we did that we wouldn't get accurate info on when the > flush was skipped. Comparing the number of hits between > mem_cgroup_flush_stats() and do_flush_stats() to determine the number of > skips doesn't seem reliable because of the places where do_flush_stats() > is called outside of mem_cgroup_flush_stats(). There would be situations > where a skip occurs, but meanwhile each call to do_flush_stats() outside > of mem_cgroup_flush_stats() would effectively subtract that skip, making > it appear that a skip did not occur. You're underestimating the power of BPF, my friend :) We can count the number of flushes in task local storages, in which case we can get a very accurate representation because the counters are per-task, so we know exactly when we skipped, but.. > > Maybe as a middle ground we could remove the trace calls for the zswap > and periodic cases, since no skips can occur there. We could then just > leave one trace call in mem_cgroup_flush_stats() and instead of an enum > we can pass a bool saying skipped or not. Something like this: > > mem_cgroup_flush_stats() > > { > > bool needs_flush = memcg_vmstats_needs_flush(...); > > trace_memcg_flush_stats(memcg, needs_flush); > > if (needs_flush) > > do_flush_stats(...); > > } > > > Yosry/Shakeel, do you have any thoughts on whether we should keep the > trace calls for obj_cgroup_may_zswap() and periodic workqueue cases? ..with that being said, I do like having a single tracepoint. I think with some refactoring we can end up with a single tracepoint and more data. We can even capture the cases where we force a flush but we don't really need to flush. We can even add vmstats->stats_updates to the tracepoint to know exactly how many updates we have when we flush. What about the following: diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 7845c64a2c570..be0e7f52ad11a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -584,8 +584,14 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) } } -static void do_flush_stats(struct mem_cgroup *memcg) +static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg, bool force) { + bool needs_flush = memcg_vmstats_needs_flush(memcg->vmstats); + + trace_memcg_flush_stats(memcg, needs_flush, force, ...); + if (!force && !needs_flush) + return; + if (mem_cgroup_is_root(memcg)) WRITE_ONCE(flush_last_time, jiffies_64); @@ -609,8 +615,7 @@ void mem_cgroup_flush_stats(struct mem_cgroup *memcg) if (!memcg) memcg = root_mem_cgroup; - if (memcg_vmstats_needs_flush(memcg->vmstats)) - do_flush_stats(memcg); + __mem_cgroup_flush_stats(memcg, false); } void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup *memcg) @@ -626,7 +631,7 @@ static void flush_memcg_stats_dwork(struct work_struct *w) * Deliberately ignore memcg_vmstats_needs_flush() here so that flushing * in latency-sensitive paths is as cheap as possible. */ - do_flush_stats(root_mem_cgroup); + __mem_cgroup_flush_stats(root_mem_cgroup, true); queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME); } @@ -5272,11 +5277,8 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg) break; } - /* - * mem_cgroup_flush_stats() ignores small changes. Use - * do_flush_stats() directly to get accurate stats for charging. - */ - do_flush_stats(memcg); + /* Force a flush to get accurate stats for charging */ + __mem_cgroup_flush_stats(memcg, true); pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE; if (pages < max) continue;