Re: [PATCH] cgroup/rstat: Reduce cpu_lock hold time in cgroup_rstat_flush_locked()

Waiman Long <longman@xxxxxxxxxx> · Wed, 1 Nov 2023 18:03:46 -0400

On 11/1/23 15:11, Yosry Ahmed wrote:
On Wed, Nov 1, 2023 at 9:09 AM Waiman Long <longman@xxxxxxxxxx> wrote:
When cgroup_rstat_updated() isn't being called concurrently with
cgroup_rstat_flush_locked(), its run time is pretty short. When
both are called concurrently, the cgroup_rstat_updated() run time
can spike to a pretty high value due to high cpu_lock hold time in
cgroup_rstat_flush_locked(). This can be problematic if the task calling
cgroup_rstat_updated() is a realtime task running on an isolated CPU
with a strict latency requirement. The cgroup_rstat_updated() call can
happens when there is a page fault even though the task is running in
user space most of the time.

The percpu cpu_lock is used to protect the update tree -
updated_next and updated_children. This protection is only needed
when cgroup_rstat_cpu_pop_updated() is being called. The subsequent
flushing operation which can take a much longer time does not need
that protection.

To reduce the cpu_lock hold time, we need to perform all the
cgroup_rstat_cpu_pop_updated() calls up front with the lock
released afterward before doing any flushing. This patch adds a new
cgroup_rstat_flush_list() function to do just that and return a singly
linked list of cgroup_rstat_cpu structures to be flushed.

By adding some instrumentation code to measure the maximum elapsed times
of the new cgroup_rstat_flush_list() function and each cpu iteration
of cgroup_rstat_flush_locked() around the old cpu_lock lock/unlock pair
on a 2-socket x86-64 server running parallel kernel build, the maximum
elapsed times are 31us and 118us respectively. The maximum cpu_lock
hold time is now reduced to about 1/4 of the original.
This sounds promising. It's smart to empty the whole tree while
holding the lock, then do the flush only under cgroup_rstat_lock.
Thanks for doing this.

Signed-off-by: Waiman Long <longman@xxxxxxxxxx>
---
  include/linux/cgroup-defs.h |  7 +++++
  kernel/cgroup/rstat.c       | 57 +++++++++++++++++++++++++++----------
  2 files changed, 49 insertions(+), 15 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 265da00a1a8b..22adb94ebb74 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -368,6 +368,13 @@ struct cgroup_rstat_cpu {
          */
         struct cgroup *updated_children;        /* terminated by self cgroup */
         struct cgroup *updated_next;            /* NULL iff not on the list */
+
+       /*
+        * A singly-linked list of cgroup_rstat_cpu structures to be flushed.
+        * Protected by cgroup_rstat_lock.
+        */
+       struct cgroup_rstat_cpu *flush_next;
+       struct cgroup *cgroup;                  /* Cgroup back pointer */
Why are we linking struct cgroup_rstat_cpu instead of directly linking
struct cgroup? AFAICT we only ever use the cgroup back pointer during
flushing anyway, right?
You are right.
Given that only one cpu can be flushed at a time, I think it should be
okay to run the list directly through struct cgroup, and have all cpus
reuse it. That pointer would essentially be scratch space for the
flushing code to use. This should also save a bit of memory:
O(cgroups) vs O(cgroups * cpus). It's not a lot either way though.

I think this may also simplify the code a bit.
Moving the flush_next pointer to struct cgroup does save a bit of 
memory. Thanks for the suggestion. I will do that in the next version.

  };

  struct cgroup_freezer_state {
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index d80d7a608141..93ef2795a68d 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -145,6 +145,42 @@ static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos,
         return pos;
  }

+/*
+ * Return a list of cgroup_rstat_cpu structures to be flushed
+ */
+static struct cgroup_rstat_cpu *cgroup_rstat_flush_list(struct cgroup *root,
nit: the name sounds like the function will flush a list, rather than
return a list of cgroups to be flushed. What about
cgroup_rstat_updated_list?

I am not good at naming. cgroup_rstat_updated_list looks good to me.

Cheers,
Longman