On 10/25/24 10:53 AM, Yosry Ahmed wrote:
On Fri, Oct 25, 2024 at 10:05 AM JP Kobryn <inwardvessel@xxxxxxxxx> wrote:
On 10/25/24 12:40 AM, Yosry Ahmed wrote:
On Thu, Oct 24, 2024 at 6:16 PM Shakeel Butt <shakeel.butt@xxxxxxxxx> wrote:
On Thu, Oct 24, 2024 at 05:57:25PM GMT, Yosry Ahmed wrote:
On Thu, Oct 24, 2024 at 5:26 PM JP Kobryn <inwardvessel@xxxxxxxxx> wrote:
Make use of the flush tracepoint within memcontrol.
Signed-off-by: JP Kobryn <inwardvessel@xxxxxxxxx>
Is the intention to use tools like bpftrace to analyze where we flush
the most? In this case, why can't we just attach to the fentry of
do_flush_stats() and use the stack trace to find the path?
We can also attach to mem_cgroup_flush_stats(), and the difference in
counts between the two will be the number of skipped flushes.
All these functions can get inlined and then we can not really attach
easily. We can somehow find the offset in the inlined places and try to
use kprobe but it is prohibitive when have to do for multiple kernels
built with fdo/bolt.
Please note that tracepoints are not really API, so we can remove them
in future if we see no usage for them.
That's fair, but can we just add two tracepoints? This seems enough to
collect necessary data, and prevent proliferation of tracepoints and
the addition of the enum.
I am thinking one in mem_cgroup_flush_stats() and one in
do_flush_stats(), e.g. trace_mem_cgroup_flush_stats() and
trace_do_flush_stats(). Although the name of the latter is too
generic, maybe we should rename the function first to add mem_cgroup_*
or memcg_*.
WDYT?
Hmmm, I think if we did that we wouldn't get accurate info on when the
flush was skipped. Comparing the number of hits between
mem_cgroup_flush_stats() and do_flush_stats() to determine the number of
skips doesn't seem reliable because of the places where do_flush_stats()
is called outside of mem_cgroup_flush_stats(). There would be situations
where a skip occurs, but meanwhile each call to do_flush_stats() outside
of mem_cgroup_flush_stats() would effectively subtract that skip, making
it appear that a skip did not occur.
You're underestimating the power of BPF, my friend :) We can count the
number of flushes in task local storages, in which case we can get a
very accurate representation because the counters are per-task, so we
know exactly when we skipped, but..
Interesting. Thanks for explaining.
Maybe as a middle ground we could remove the trace calls for the zswap
and periodic cases, since no skips can occur there. We could then just
leave one trace call in mem_cgroup_flush_stats() and instead of an enum
we can pass a bool saying skipped or not. Something like this:
mem_cgroup_flush_stats()
{
bool needs_flush = memcg_vmstats_needs_flush(...);
trace_memcg_flush_stats(memcg, needs_flush);
if (needs_flush)
do_flush_stats(...);
}
Yosry/Shakeel, do you have any thoughts on whether we should keep the
trace calls for obj_cgroup_may_zswap() and periodic workqueue cases?
..with that being said, I do like having a single tracepoint. I think
with some refactoring we can end up with a single tracepoint and more
data. We can even capture the cases where we force a flush but we
don't really need to flush. We can even add vmstats->stats_updates to
the tracepoint to know exactly how many updates we have when we flush.
What about the following:
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7845c64a2c570..be0e7f52ad11a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -584,8 +584,14 @@ static inline void memcg_rstat_updated(struct
mem_cgroup *memcg, int val)
}
}
-static void do_flush_stats(struct mem_cgroup *memcg)
+static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg, bool force)
{
+ bool needs_flush = memcg_vmstats_needs_flush(memcg->vmstats);
+
+ trace_memcg_flush_stats(memcg, needs_flush, force, ...);
+ if (!force && !needs_flush)
+ return;
+
if (mem_cgroup_is_root(memcg))
WRITE_ONCE(flush_last_time, jiffies_64);
@@ -609,8 +615,7 @@ void mem_cgroup_flush_stats(struct mem_cgroup *memcg)
if (!memcg)
memcg = root_mem_cgroup;
- if (memcg_vmstats_needs_flush(memcg->vmstats))
- do_flush_stats(memcg);
+ __mem_cgroup_flush_stats(memcg, false);
}
void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup *memcg)
@@ -626,7 +631,7 @@ static void flush_memcg_stats_dwork(struct work_struct *w)
* Deliberately ignore memcg_vmstats_needs_flush() here so that flushing
* in latency-sensitive paths is as cheap as possible.
*/
- do_flush_stats(root_mem_cgroup);
+ __mem_cgroup_flush_stats(root_mem_cgroup, true);
queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME);
}
@@ -5272,11 +5277,8 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
break;
}
- /*
- * mem_cgroup_flush_stats() ignores small changes. Use
- * do_flush_stats() directly to get accurate stats for charging.
- */
- do_flush_stats(memcg);
+ /* Force a flush to get accurate stats for charging */
+ __mem_cgroup_flush_stats(memcg, true);
pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE;
if (pages < max)
continue;
I like the additional info. I'll incorporate this into v2.