Re: [PATCH 2/2] sched/psi: iterate through cgroups directly

Kairui Song <ryncsn@xxxxxxxxx> · Thu, 16 Feb 2023 01:49:22 +0800

Kairui Song <ryncsn@xxxxxxxxx> 于2023年2月10日周五 00:08写道：
> Johannes Weiner <hannes@xxxxxxxxxxx> 于2023年2月9日周四 03:20写道：
> > On Wed, Feb 08, 2023 at 06:29:56PM +0100, Michal Koutný wrote:
> > > On Thu, Feb 09, 2023 at 12:16:54AM +0800, Kairui Song <ryncsn@xxxxxxxxx> wrote:
> > > > Signed-off-by: Kairui Song <kasong@xxxxxxxxxxx>
> > > > Signed-off-by: Kairui Song <ryncsn@xxxxxxxxx>
> > >
> > > Typo?
> > >
> > > > -static inline struct psi_group *task_psi_group(struct task_struct *task)
> > > > +static inline struct psi_group *psi_iter_first(struct task_struct *task, void **iter)
> > > >  {
> > > >  #ifdef CONFIG_CGROUPS
> > > > -   if (static_branch_likely(&psi_cgroups_enabled))
> > > > -           return cgroup_psi(task_dfl_cgroup(task));
> > > > +   if (static_branch_likely(&psi_cgroups_enabled)) {
> > > > +           struct cgroup *cgroup = task_dfl_cgroup(task);
> > > > +
> > > > +           *iter = cgroup_parent(cgroup);
> > >
> > > This seems to skip a cgroup level -- maybe that's the observed
> > > performance gain?
> >
> > Hm, I don't think it does. It sets up *iter to point to the parent for
> > the _next() call, but it returns task_dfl_cgroup()->psi. The next call
> > does the same: cgroup = *iter, *iter = parent, return cgroup->psi.
> >
> > It could be a bit more readable to have *iter always point to the
> > current cgroup - but no strong preference either way from me:
> >
> > psi_groups_first(task, iter)
> > {
> > #ifdef CONFIG_CGROUPS
> >         if (static_branch_likely(&psi_cgroups_enabled)) {
> >                 struct cgroup *cgroup = task_dfl_cgroup(task);
> >
> >                 *iter = cgroup;
> >                 return cgroup_psi(cgroup);
> >         }
> > #endif
> >         return &psi_system;
> > }
> >
> > psi_groups_next(iter)
> > {
> > #ifdef CONFIG_CGROUPS
> >         if (static_branch_likely(&psi_cgroups_enabled)) {
> >                 struct cgroup *cgroup = *iter;
> >
> >                 if (cgroup) {
> >                         *iter = cgroup_parent(cgroup);
> >                         return cgroup_psi(cgroup);
> >                 }
> >         }
> >         return NULL;
> > #endif
> > }
> > psi_groups_next(iter)
> > {
> > #ifdef CONFIG_CGROUPS
> >         if (static_branch_likely(&psi_cgroups_enabled)) {
> >                 struct cgroup *cgroup = *iter;
> >
> >                 if (cgroup) {
> >                         *iter = cgroup_parent(cgroup);
> >                         return cgroup_psi(cgroup);
> >                 }
> >         }
> >         return NULL;
> > #endif
> > }
>
> It should be like this, right? For psi_groups_next, retrieving cgroup
> parent should be done before "if (cgroup)".
> + psi_groups_next(iter)
> + {
> + #ifdef CONFIG_CGROUPS
> +         if (static_branch_likely(&psi_cgroups_enabled)) {
> +                 struct cgroup *cgroup = *iter;
> +
> +                 cgroup = cgroup_parent(cgroup);
> +                 if (cgroup) {
> +                         *iter = cgroup;
> +                         return cgroup_psi(cgroup);
> +                 }
> +         }
> +         return NULL;
> + #endif
> + }
>
> Thanks for the suggestion!
>
> I think your style is better indeed.
>
> I tried to re-benchmark the code just in case, and found it seems my
> previous benchmark result is not accurate enough now, some results
> changed after I did a rebase to latest master, or maybe just 100 times
> of perfpipe is not enough to get a stable result.
>
> With a few times of retest, the final conclusion of the commit message
> is still valid, will post V2 later just after more test.

Hi, I just ran perf bench test for a few days on a machine to get an
stable average for latest rebased patch.
And found it is indeed faster when the cgroup hierarchy is not too deep:

With rootcg:
55718.9 op/s (unpatched) compared to 55862.2 (patched)
With 5 levels:
49975.5 op/s (unpatched) compared to 50778.5 op/s (patched)

Previous tests are a bit biased since I only run the test for 100 * 3
times, or maybe it is sensitive to some random kernel structure
changes.

But I ignored one important thing in my previous test, that the
performance actually drops heavily with deeper levers of cgroups:
With 8 levels:
48902.4 op/s (unpatched) compared to 47476.6 op/s (patched)
With 50 levels of cgroup:
25605.7375 op/s (unpatched) compared to 20417.275 op/s (patched)

That's a terrible regression :( , guess when there are too many cgroup
structures, CPU cache can't hold them, and since the *parent shares
same cacheline with rest of psi_group struct (like the bool enblaed,
which are access every time), so unpatched will be faster.

Sorry for missing such a critical point in v1, let's drop this series
until there is a better way to optimize it.