Re: [PATCH 2/2] memcg: use memcg flush tracepoint

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Fri, 25 Oct 2024 10:53:39 -0700

On Fri, Oct 25, 2024 at 10:05 AM JP Kobryn <inwardvessel@xxxxxxxxx> wrote:
>
>
> On 10/25/24 12:40 AM, Yosry Ahmed wrote:
> > On Thu, Oct 24, 2024 at 6:16 PM Shakeel Butt <shakeel.butt@xxxxxxxxx> wrote:
> >> On Thu, Oct 24, 2024 at 05:57:25PM GMT, Yosry Ahmed wrote:
> >>> On Thu, Oct 24, 2024 at 5:26 PM JP Kobryn <inwardvessel@xxxxxxxxx> wrote:
> >>>> Make use of the flush tracepoint within memcontrol.
> >>>>
> >>>> Signed-off-by: JP Kobryn <inwardvessel@xxxxxxxxx>
> >>> Is the intention to use tools like bpftrace to analyze where we flush
> >>> the most? In this case, why can't we just attach to the fentry of
> >>> do_flush_stats() and use the stack trace to find the path?
> >>>
> >>> We can also attach to mem_cgroup_flush_stats(), and the difference in
> >>> counts between the two will be the number of skipped flushes.
> >>>
> >> All these functions can get inlined and then we can not really attach
> >> easily. We can somehow find the offset in the inlined places and try to
> >> use kprobe but it is prohibitive when have to do for multiple kernels
> >> built with fdo/bolt.
> >>
> >> Please note that tracepoints are not really API, so we can remove them
> >> in future if we see no usage for them.
> > That's fair, but can we just add two tracepoints? This seems enough to
> > collect necessary data, and prevent proliferation of tracepoints and
> > the addition of the enum.
> >
> > I am thinking one in mem_cgroup_flush_stats() and one in
> > do_flush_stats(), e.g. trace_mem_cgroup_flush_stats() and
> > trace_do_flush_stats(). Although the name of the latter is too
> > generic, maybe we should rename the function first to add mem_cgroup_*
> > or memcg_*.
> >
> > WDYT?
>
> Hmmm, I think if we did that we wouldn't get accurate info on when the
> flush was skipped. Comparing the number of hits between
> mem_cgroup_flush_stats() and do_flush_stats() to determine the number of
> skips doesn't seem reliable because of the places where do_flush_stats()
> is called outside of mem_cgroup_flush_stats(). There would be situations
> where a skip occurs, but meanwhile each call to do_flush_stats() outside
> of mem_cgroup_flush_stats() would effectively subtract that skip, making
> it appear that a skip did not occur.

You're underestimating the power of BPF, my friend :) We can count the
number of flushes in task local storages, in which case we can get a
very accurate representation because the counters are per-task, so we
know exactly when we skipped, but..

>
> Maybe as a middle ground we could remove the trace calls for the zswap
> and periodic cases, since no skips can occur there. We could then just
> leave one trace call in mem_cgroup_flush_stats() and instead of an enum
> we can pass a bool saying skipped or not. Something like this:
>
> mem_cgroup_flush_stats()
>
> {
>
>      bool needs_flush = memcg_vmstats_needs_flush(...);
>
>      trace_memcg_flush_stats(memcg, needs_flush);
>
>      if (needs_flush)
>
>          do_flush_stats(...);
>
> }
>
>
> Yosry/Shakeel, do you have any thoughts on whether we should keep the
> trace calls for obj_cgroup_may_zswap() and periodic workqueue cases?

..with that being said, I do like having a single tracepoint. I think
with some refactoring we can end up with a single tracepoint and more
data. We can even capture the cases where we force a flush but we
don't really need to flush. We can even add vmstats->stats_updates to
the tracepoint to know exactly how many updates we have when we flush.

What about the following:

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7845c64a2c570..be0e7f52ad11a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -584,8 +584,14 @@ static inline void memcg_rstat_updated(struct
mem_cgroup *memcg, int val)
        }
 }

-static void do_flush_stats(struct mem_cgroup *memcg)
+static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg, bool force)
 {
+       bool needs_flush = memcg_vmstats_needs_flush(memcg->vmstats);
+
+       trace_memcg_flush_stats(memcg, needs_flush, force, ...);
+       if (!force && !needs_flush)
+               return;
+
        if (mem_cgroup_is_root(memcg))
                WRITE_ONCE(flush_last_time, jiffies_64);

@@ -609,8 +615,7 @@ void mem_cgroup_flush_stats(struct mem_cgroup *memcg)
        if (!memcg)
                memcg = root_mem_cgroup;

-       if (memcg_vmstats_needs_flush(memcg->vmstats))
-               do_flush_stats(memcg);
+       __mem_cgroup_flush_stats(memcg, false);
 }

 void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup *memcg)
@@ -626,7 +631,7 @@ static void flush_memcg_stats_dwork(struct work_struct *w)
         * Deliberately ignore memcg_vmstats_needs_flush() here so that flushing
         * in latency-sensitive paths is as cheap as possible.
         */
-       do_flush_stats(root_mem_cgroup);
+       __mem_cgroup_flush_stats(root_mem_cgroup, true);
        queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME);
 }

@@ -5272,11 +5277,8 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
                        break;
                }

-               /*
-                * mem_cgroup_flush_stats() ignores small changes. Use
-                * do_flush_stats() directly to get accurate stats for charging.
-                */
-               do_flush_stats(memcg);
+               /* Force a flush to get accurate stats for charging */
+               __mem_cgroup_flush_stats(memcg, true);
                pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE;
                if (pages < max)
                        continue;