Re: [PATCH V2] blk-cgroup: Flush stats before releasing blkcg_gq

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Wed, 24 May 2023 19:55:59 -0700

On Wed, May 24, 2023 at 7:50 PM Waiman Long <longman@xxxxxxxxxx> wrote:
>
> On 5/24/23 22:04, Ming Lei wrote:
> > On Wed, May 24, 2023 at 01:28:41PM -0400, Waiman Long wrote:
> >> On 5/24/23 11:43, Waiman Long wrote:
> >>> On 5/24/23 00:26, Ming Lei wrote:
> >>>> On Wed, May 24, 2023 at 12:19:57AM -0400, Waiman Long wrote:
> >>>>> On 5/23/23 23:51, Ming Lei wrote:
> >>>>>> As noted by Michal, the blkg_iostat_set's in the lockless list hold
> >>>>>> reference to blkg's to protect against their removal. Those blkg's
> >>>>>> hold reference to blkcg. When a cgroup is being destroyed,
> >>>>>> cgroup_rstat_flush() is only called at css_release_work_fn() which
> >>>>>> is called when the blkcg reference count reaches 0. This circular
> >>>>>> dependency will prevent blkcg and some blkgs from being freed after
> >>>>>> they are made offline.
> >>>>>>
> >>>>>> It is less a problem if the cgroup to be destroyed also has other
> >>>>>> controllers like memory that will call cgroup_rstat_flush() which will
> >>>>>> clean up the reference count. If block is the only
> >>>>>> controller that uses
> >>>>>> rstat, these offline blkcg and blkgs may never be freed leaking more
> >>>>>> and more memory over time.
> >>>>>>
> >>>>>> To prevent this potential memory leak:
> >>>>>>
> >>>>>> - flush blkcg per-cpu stats list in __blkg_release(), when no new stat
> >>>>>> can be added
> >>>>>>
> >>>>>> - don't grab bio->bi_blkg reference when adding the stats into blkcg's
> >>>>>> per-cpu stat list since all stats are guaranteed to be consumed before
> >>>>>> releasing blkg instance, and grabbing blkg reference for stats was the
> >>>>>> most fragile part of original patch
> >>>>>>
> >>>>>> Based on Waiman's patch:
> >>>>>>
> >>>>>> https://lore.kernel.org/linux-block/20221215033132.230023-3-longman@xxxxxxxxxx/
> >>>>>>
> >>>>>>
> >>>>>> Fixes: 3b8cc6298724 ("blk-cgroup: Optimize blkcg_rstat_flush()")
> >>>>>> Cc: Waiman Long <longman@xxxxxxxxxx>
> >>>>>> Cc: Tejun Heo <tj@xxxxxxxxxx>
> >>>>>> Cc: mkoutny@xxxxxxxx
> >>>>>> Cc: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
> >>>>>> Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx>
> >>>>>> ---
> >>>>>> V2:
> >>>>>>      - remove kernel/cgroup change, and call blkcg_rstat_flush()
> >>>>>>      to flush stat directly
> >>>>>>
> >>>>>>     block/blk-cgroup.c | 29 +++++++++++++++++++++--------
> >>>>>>     1 file changed, 21 insertions(+), 8 deletions(-)
> >>>>>>
> >>>>>> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> >>>>>> index 0ce64dd73cfe..ed0eb8896972 100644
> >>>>>> --- a/block/blk-cgroup.c
> >>>>>> +++ b/block/blk-cgroup.c
> >>>>>> @@ -34,6 +34,8 @@
> >>>>>>     #include "blk-ioprio.h"
> >>>>>>     #include "blk-throttle.h"
> >>>>>> +static void __blkcg_rstat_flush(struct blkcg *blkcg, int cpu);
> >>>>>> +
> >>>>>>     /*
> >>>>>>      * blkcg_pol_mutex protects blkcg_policy[] and policy
> >>>>>> [de]activation.
> >>>>>>      * blkcg_pol_register_mutex nests outside of it and
> >>>>>> synchronizes entire
> >>>>>> @@ -163,10 +165,21 @@ static void blkg_free(struct blkcg_gq *blkg)
> >>>>>>     static void __blkg_release(struct rcu_head *rcu)
> >>>>>>     {
> >>>>>>         struct blkcg_gq *blkg = container_of(rcu, struct
> >>>>>> blkcg_gq, rcu_head);
> >>>>>> +    struct blkcg *blkcg = blkg->blkcg;
> >>>>>> +    int cpu;
> >>>>>>     #ifdef CONFIG_BLK_CGROUP_PUNT_BIO
> >>>>>>         WARN_ON(!bio_list_empty(&blkg->async_bios));
> >>>>>>     #endif
> >>>>>> +    /*
> >>>>>> +     * Flush all the non-empty percpu lockless lists before releasing
> >>>>>> +     * us, given these stat belongs to us.
> >>>>>> +     *
> >>>>>> +     * cgroup locks aren't needed here since __blkcg_rstat_flush just
> >>>>>> +     * propagates delta into blkg parent, which is live now.
> >>>>>> +     */
> >>>>>> +    for_each_possible_cpu(cpu)
> >>>>>> +        __blkcg_rstat_flush(blkcg, cpu);
> >>>>>>         /* release the blkcg and parent blkg refs this blkg
> >>>>>> has been holding */
> >>>>>>         css_put(&blkg->blkcg->css);
> >>>>>> @@ -951,17 +964,12 @@ static void blkcg_iostat_update(struct
> >>>>>> blkcg_gq *blkg, struct blkg_iostat *cur,
> >>>>>> u64_stats_update_end_irqrestore(&blkg->iostat.sync, flags);
> >>>>>>     }
> >>>>>> -static void blkcg_rstat_flush(struct cgroup_subsys_state
> >>>>>> *css, int cpu)
> >>>>>> +static void __blkcg_rstat_flush(struct blkcg *blkcg, int cpu)
> >>>>>>     {
> >>>>>> -    struct blkcg *blkcg = css_to_blkcg(css);
> >>>>>>         struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
> >>>>>>         struct llist_node *lnode;
> >>>>>>         struct blkg_iostat_set *bisc, *next_bisc;
> >>>>>> -    /* Root-level stats are sourced from system-wide IO stats */
> >>>>>> -    if (!cgroup_parent(css->cgroup))
> >>>>>> -        return;
> >>>>>> -
> >>>>>>         rcu_read_lock();
> >>>>>>         lnode = llist_del_all(lhead);
> >>>>>> @@ -991,13 +999,19 @@ static void blkcg_rstat_flush(struct
> >>>>>> cgroup_subsys_state *css, int cpu)
> >>>>>>             if (parent && parent->parent)
> >>>>>>                 blkcg_iostat_update(parent, &blkg->iostat.cur,
> >>>>>>                             &blkg->iostat.last);
> >>>>>> -        percpu_ref_put(&blkg->refcnt);
> >>>>>>         }
> >>>>>>     out:
> >>>>>>         rcu_read_unlock();
> >>>>>>     }
> >>>>>> +static void blkcg_rstat_flush(struct cgroup_subsys_state
> >>>>>> *css, int cpu)
> >>>>>> +{
> >>>>>> +    /* Root-level stats are sourced from system-wide IO stats */
> >>>>>> +    if (cgroup_parent(css->cgroup))
> >>>>>> +        __blkcg_rstat_flush(css_to_blkcg(css), cpu);
> >>>>>> +}
> >>>>>> +
> >>>>> I think it may not safe to call __blkcg_rstat_flus() directly
> >>>>> without taking
> >>>>> the cgroup_rstat_cpu_lock. That is why I added a helper to
> >>>>> kernel/cgroup/rstat.c in my patch to meet the locking requirement.
> >>>> All stats are removed from llist_del_all(), and the local list is
> >>>> iterated, then each blkg & its parent is touched in
> >>>> __blkcg_rstat_flus(), so
> >>>> can you explain it a bit why cgroup locks are needed? For protecting
> >>>> what?
> >>> You are right. The llist_del_all() call in blkcg_rstat_flush() is
> >>> atomic, so it is safe for concurrent execution which is what the
> >>> cgroup_rstat_cpu_lock protects against. That may not be the case for
> >>> rstat callbacks of other controllers. So I will suggest you to add a
> >>> comment to clarify that point. Other than that, you patch looks good to
> >>> me.
> >>>
> >>> Reviewed: Waiman Long <longman@xxxxxxxxxx>
> >> After some more thought, I need to retract my reviewed-by tag for now. There
> >> is a slight possibility that blkcg_iostat_update() in blkcg_rstat_flush()
> >> can happen concurrently which will corrupt the sequence count.
> > llist_del_all() moves all 'bis' into one local list, and bis is one percpu
> > variable of blkg, so in theory same bis won't be flushed at the same
> > time. And one bis should be touched in just one of stat flush code path
> > because of llist_del_all().
> >
> > So 'bis' still can be thought as being flushed in serialized way.
> >
> > However, blk_cgroup_bio_start() runs concurrently with blkcg_rstat_flush(),
> > so once bis->lqueued is cleared in blkcg_rstat_flush(), this same bis
> > could be added to the percpu llist and __blkcg_rstat_flush() from blkg_release()
> > follows. This should be the only chance for concurrent stats update.
>
> That is why I have in mind. A __blkcg_rstat_flush() can be from
> blkg_release() and another one from the regular cgroup_rstat_flush*().
>
>
> >
> > But, blkg_release() is run in RCU callback, so the previous flush has
> > been done, but new flush can come, and other blkg's stat could be added
> > with same story above.
> >
> >> One way to
> >> avoid that is to synchronize it by cgroup_rstat_cpu_lock. Another way is to
> >> use the bisc->lqueued for synchronization.
> > I'd avoid the external cgroup lock here.
> >
> >> In that case, you need to move
> >> WRITE_ONCE(bisc->lqueued, false) in blkcg_rstat_flush() to the end after all
> >> the  blkcg_iostat_update() call with smp_store_release() and replace the
> >> READ_ONCE(bis->lqueued) check in blk_cgroup_bio_start() with
> >> smp_load_acquire().
> > This way looks doable, but I guess it still can't avoid concurrent update on parent
> > stat, such as when  __blkcg_rstat_flush() from blkg_release() is
> > in-progress, another sibling blkg's bis is added, meantime
> > blkcg_rstat_flush() is called.
> I realized that the use of cgroup_rstat_cpu_lock or the alternative was
> not safe enough for preventing concurrent parent blkg rstat update.
> >
> > Another way is to add blkcg->stat_lock for covering __blkcg_rstat_flush(), what
> > do you think of this way?
>
> I am thinking of adding a raw spinlock into blkg and take it when doing
> blkcg_iostat_update(). This can guarantee no concurrent update to rstat
> data. It has to be a raw spinlock as it will be under the
> cgroup_rstat_cpu_lock raw spinlock.

Hi Waiman,

I don't have context about blkcg, but isn't this exactly what
cgroup_rstat_lock does? Is it too expensive to just call
cgroup_rstat_flush () here?

>
> The use of u64_stats_fetch_begin/u64_stats_fetch_retry in retrieving the
> rstat data from bisc can happen concurrently with update without further
> synchronization. However, there can be at most one update at any time.
>
> Cheers,
> Longman
>
> >
> >
> > Thanks,
> > Ming
> >
>