Re: [PATCH V8 1/2] cgroup/rstat: Avoid flushing if there is an ongoing overlapping flush

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Tue, 30 Jul 2024 11:54:04 -0700

[..]
> >> +static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop,
> >> +                                      bool already_contended)
> >>          __acquires(&cgroup_rstat_lock)
> >>   {
> >> -       bool contended;
> >> +       bool locked = false;
> >>
> >> -       contended = !spin_trylock_irq(&cgroup_rstat_lock);
> >> -       if (contended) {
> >> -               trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, contended);
> >> +       if (already_contended) /* Skip trylock if already contended */
> >> +               locked = __cgroup_rstat_trylock(cgrp, cpu_in_loop);
> >
> > Should this be the other way around?
> >
>
> I think it is correct, but I used it wrong in once place, in
> cgroup_rstat_flush_hold(), as cgroup_rstat_trylock_flusher() returning
> false doesn't mean it was already_contended, but that ongoing flusher
> "skipped" (and waited for) a flush.  I need to correct this.

Something isn't adding up here as well. The comment says skip trylock
if already contended, then if already_contended is true we do a
trylock. Am I confusing myself here? :)

>
>
> >> +
> >> +       if (!locked) {
> >>                  spin_lock_irq(&cgroup_rstat_lock);
> >> +               trace_cgroup_rstat_locked(cgrp, cpu_in_loop, true);
> >>          }
> >> -       trace_cgroup_rstat_locked(cgrp, cpu_in_loop, contended);
> >>   }
> >>
> >>   static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
> >> @@ -299,6 +316,72 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
> >>          spin_unlock_irq(&cgroup_rstat_lock);
> >>   }
> >>
> >> +#define MAX_WAIT       msecs_to_jiffies(100)
> >> +/**
> >> + * cgroup_rstat_trylock_flusher - Trylock that checks for on ongoing flusher
> >> + * @cgrp: target cgroup
> >> + *
> >> + * Function return value follow trylock semantics. Returning true when lock is
> >> + * obtained. Returning false when not locked and it detected flushing can be
> >> + * skipped as another ongoing flusher took care of the flush.
> >> + */
> >> +static bool cgroup_rstat_trylock_flusher(struct cgroup *cgrp)
> >> +{
> >> +       struct cgroup *ongoing;
> >> +       bool locked;
> >> +
> >> +       /*
> >> +        * Check if ongoing flusher is already taking care of this, if
> >> +        * we are a descendant skip work, but wait for ongoing flusher
> >> +        * to complete work.
> >> +        */
> >> +retry:
> >> +       ongoing = READ_ONCE(cgrp_rstat_ongoing_flusher);
> >> +       if (ongoing && cgroup_is_descendant(cgrp, ongoing)) {
> >
> > The discussion about cgrp_rstat_ongoing_flusher possibly going away in
> > parallel never reached a conclusion AFAICT. Should we use
> > cgroup_tryget() here to get a ref on 'ongoing' until wait completes?
>
> I like the cgroup_tryget() idea, but I though we needed to do this on
> the 'cgrp' that will be waiting in the completion queue on 'ongoing'?

I think we need to protect the 'ongoing' cgroup because it could go
away between reading cgrp_rstat_ongoing_flusher and using it to check
if we are a descendant and wait for its completion.

The lifetime of 'cgrp' should be handled by the caller, right?

>
> > This shouldn't add much complexity AFAICT.
>
> >
> > I think just using RCU here wouldn't be enough as we can flush rstat
> > after the RCU grace period when a cgroup is being freed.
> >
> >> +               wait_for_completion_interruptible_timeout(
> >> +                       &ongoing->flush_done, MAX_WAIT);
> >> +               /* TODO: Add tracepoint here */
> >> +               return false;
> >> +       }
> >> +
> >> +       locked = __cgroup_rstat_trylock(cgrp, -1);
> >> +       if (!locked) {
> >> +               /* Contended: Handle losing race for ongoing flusher */
> >> +               if (!ongoing && READ_ONCE(cgrp_rstat_ongoing_flusher))
> >> +                       goto retry;
> >> +
> >> +               __cgroup_rstat_lock(cgrp, -1, true);
> >> +       }
> >> +       /*
> >> +        * Obtained lock, record this cgrp as the ongoing flusher.
> >> +        * Due to lock yielding, we might obtain lock while another
> >> +        * ongoing flusher (that isn't a parent) owns ongoing_flusher.
> >> +        */
> >> +       ongoing = READ_ONCE(cgrp_rstat_ongoing_flusher);
> >> +       if (!ongoing) {
> >
> > I think we don't need protection here since we never dereference
> > 'cgrp_rstat_ongoing_flusher', but I think it may be clearer to
> > directly check it to make this obvious:
> >
> > if (!READ_ONCE(cgrp_rstat_ongoing_flusher)) {
> >
>
> Makes sense, but I use the 'ongoing' variable in the next patch in a
> tracepoint to diagnose this happening.
>
> > Perhaps we can also explicitly mention in the comment why we do not
> > need any protection here, but I am not sure how helpful that will be.
> >
> >> +               /*
> >> +                * Limit to top-level as lock yielding allows others to obtain
> >> +                * lock without being ongoing_flusher. Leading to cgroup that
> >> +                * isn't descendant to obtain lock via yielding. So, prefer
> >> +                * ongoing_flusher with many descendants.
> >> +                */
> >> +               if (cgrp->level < 2) {
> >
> > This covers roots and top-level cgroups under them, right? Did them
> > improve the numbers you were observing?
> >
>
> The numbers from prod improved significantly, then cadvisor and kswapd
> collide.

Do you mean that the numbers improved compared to without this patch,
or compared with allowing all cgroups to be the ongoing flusher?

The latter was my question. For the former, I definitely agree this
patch improves things based on the data.

> But cadvisor still flush a couple of level 1 cgroups, and can still
> cause lock contention for level 0 and other non-decendent cgroups.
>
> 11:30:08 @ongoing_flusher_yield[0]: 68
> @ongoing_flusher_cnt[kswapd11]: 4
> @ongoing_flusher_cnt[kswapd2]: 4
> @ongoing_flusher_cnt[kswapd5]: 4
> @ongoing_flusher_cnt[handled_race]: 4
> @ongoing_flusher_cnt[kswapd9]: 5
> @ongoing_flusher_cnt[kswapd7]: 5
> @ongoing_flusher_cnt[kswapd4]: 5
> @ongoing_flusher_cnt[kswapd6]: 5
> @ongoing_flusher_cnt[kswapd1]: 5
> @ongoing_flusher_cnt[kswapd8]: 6
> @ongoing_flusher_cnt[kswapd10]: 6
> @ongoing_flusher_cnt[kswapd3]: 6
> @ongoing_flusher_cnt[kswapd0]: 6
> @ongoing_flusher_cnt[cadvisor]: 8
> @ongoing_flusher_cnt[all]: 65
> @cnt[tracepoint:cgroup:cgroup_ongoing_flusher_yield]: 4
> @cnt[tracepoint:cgroup:cgroup_rstat_lock_contended]: 26
> @cnt[tracepoint:cgroup:cgroup_ongoing_flusher_wait]: 60
> @cnt[kfunc:vmlinux:cgroup_rstat_flush_locked]: 475
> @cnt[tracepoint:cgroup:cgroup_rstat_locked]: 1953
> @lock_contended[normal, 1]: 2
> @lock_contended[normal, 3]: 8
> @lock_contended[normal, 0]: 16
>
> We see that level 0 observe lock_contended 16 times, but we should
> subtract 4 as that was "handled_race" cases. So 12 times, the root-cgrp
> was spinning on the lock. In total (26-4) 22 times flushers contented on
> the lock.  Given 475 flushes happen within this 1 sec period, every 2.1
> ms, then I do call it it a significant reduction for lock contention.

Agreed :)

>
>
> > AFAICT, we can remove this restriction completely if/when we use a
> > mutex and support a single ongoing flusher. If so, let's explicitly
> > mention this, perhaps:
> >
>
> Well... I'm still not convinced that it makes sense to have level >= 2
> be the ongoing flusher.
>
> E.g. if a level 2 cgroup becomes ongoing flusher, and kswapd starts 12
> NUMA flushes at the same time, then the code will have these 12 kswapd
> threads spin on the lock, until ongoing flusher finishes. That is likely
> what happened above (for a level 1).  These 12 spinning (root) flushers
> will not recheck ongoing_flusher and will all flush the root
> (unnecessarily 11 times).

Hmm regardless of whether or not the level-2 cgroup becomes the
ongoing flusher, the kswapd threads will all spin on the lock anyway
since none of them can be the ongoing flusher until the level-2 cgroup
finishes. Right?

Is the scenario you have in mind that the level-2 cgroup starts
flushing at the same time as kswapd, so there is a race on who gets to
be the ongoing flusher? In this case as well, whoever gets the lock
will be the ongoing flusher anyway.

Not allowing whoever is holding the lock to be the ongoing flusher
based on level is only useful when we can have multiple ongoing
flushers (with lock yielding). Right?

Perhaps I am missing something here.

>
> So, I don't think it is a good idea to have anything else that the root
> as the ongoing flusher.
>
> Can you explain/convince me why having sub-cgroups as ongoing flusher is
> an advantage?

I just don't see the benefit of the special casing here as I mentioned
above. If I missed something please let me know.

>
> --Jesper
>
> > XXX: Remove this restriction if/when lock yielding is removed
> >
[..]