Re: [PATCH V2] blk-cgroup: Flush stats before releasing blkcg_gq

Ming Lei <ming.lei@xxxxxxxxxx> · Thu, 25 May 2023 10:04:22 +0800

On Wed, May 24, 2023 at 01:28:41PM -0400, Waiman Long wrote:
> On 5/24/23 11:43, Waiman Long wrote:
> > On 5/24/23 00:26, Ming Lei wrote:
> > > On Wed, May 24, 2023 at 12:19:57AM -0400, Waiman Long wrote:
> > > > On 5/23/23 23:51, Ming Lei wrote:
> > > > > As noted by Michal, the blkg_iostat_set's in the lockless list hold
> > > > > reference to blkg's to protect against their removal. Those blkg's
> > > > > hold reference to blkcg. When a cgroup is being destroyed,
> > > > > cgroup_rstat_flush() is only called at css_release_work_fn() which
> > > > > is called when the blkcg reference count reaches 0. This circular
> > > > > dependency will prevent blkcg and some blkgs from being freed after
> > > > > they are made offline.
> > > > > 
> > > > > It is less a problem if the cgroup to be destroyed also has other
> > > > > controllers like memory that will call cgroup_rstat_flush() which will
> > > > > clean up the reference count. If block is the only
> > > > > controller that uses
> > > > > rstat, these offline blkcg and blkgs may never be freed leaking more
> > > > > and more memory over time.
> > > > > 
> > > > > To prevent this potential memory leak:
> > > > > 
> > > > > - flush blkcg per-cpu stats list in __blkg_release(), when no new stat
> > > > > can be added
> > > > > 
> > > > > - don't grab bio->bi_blkg reference when adding the stats into blkcg's
> > > > > per-cpu stat list since all stats are guaranteed to be consumed before
> > > > > releasing blkg instance, and grabbing blkg reference for stats was the
> > > > > most fragile part of original patch
> > > > > 
> > > > > Based on Waiman's patch:
> > > > > 
> > > > > https://lore.kernel.org/linux-block/20221215033132.230023-3-longman@xxxxxxxxxx/
> > > > > 
> > > > > 
> > > > > Fixes: 3b8cc6298724 ("blk-cgroup: Optimize blkcg_rstat_flush()")
> > > > > Cc: Waiman Long <longman@xxxxxxxxxx>
> > > > > Cc: Tejun Heo <tj@xxxxxxxxxx>
> > > > > Cc: mkoutny@xxxxxxxx
> > > > > Cc: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
> > > > > Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx>
> > > > > ---
> > > > > V2:
> > > > >     - remove kernel/cgroup change, and call blkcg_rstat_flush()
> > > > >     to flush stat directly
> > > > > 
> > > > >    block/blk-cgroup.c | 29 +++++++++++++++++++++--------
> > > > >    1 file changed, 21 insertions(+), 8 deletions(-)
> > > > > 
> > > > > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> > > > > index 0ce64dd73cfe..ed0eb8896972 100644
> > > > > --- a/block/blk-cgroup.c
> > > > > +++ b/block/blk-cgroup.c
> > > > > @@ -34,6 +34,8 @@
> > > > >    #include "blk-ioprio.h"
> > > > >    #include "blk-throttle.h"
> > > > > +static void __blkcg_rstat_flush(struct blkcg *blkcg, int cpu);
> > > > > +
> > > > >    /*
> > > > >     * blkcg_pol_mutex protects blkcg_policy[] and policy
> > > > > [de]activation.
> > > > >     * blkcg_pol_register_mutex nests outside of it and
> > > > > synchronizes entire
> > > > > @@ -163,10 +165,21 @@ static void blkg_free(struct blkcg_gq *blkg)
> > > > >    static void __blkg_release(struct rcu_head *rcu)
> > > > >    {
> > > > >        struct blkcg_gq *blkg = container_of(rcu, struct
> > > > > blkcg_gq, rcu_head);
> > > > > +    struct blkcg *blkcg = blkg->blkcg;
> > > > > +    int cpu;
> > > > >    #ifdef CONFIG_BLK_CGROUP_PUNT_BIO
> > > > >        WARN_ON(!bio_list_empty(&blkg->async_bios));
> > > > >    #endif
> > > > > +    /*
> > > > > +     * Flush all the non-empty percpu lockless lists before releasing
> > > > > +     * us, given these stat belongs to us.
> > > > > +     *
> > > > > +     * cgroup locks aren't needed here since __blkcg_rstat_flush just
> > > > > +     * propagates delta into blkg parent, which is live now.
> > > > > +     */
> > > > > +    for_each_possible_cpu(cpu)
> > > > > +        __blkcg_rstat_flush(blkcg, cpu);
> > > > >        /* release the blkcg and parent blkg refs this blkg
> > > > > has been holding */
> > > > >        css_put(&blkg->blkcg->css);
> > > > > @@ -951,17 +964,12 @@ static void blkcg_iostat_update(struct
> > > > > blkcg_gq *blkg, struct blkg_iostat *cur,
> > > > > u64_stats_update_end_irqrestore(&blkg->iostat.sync, flags);
> > > > >    }
> > > > > -static void blkcg_rstat_flush(struct cgroup_subsys_state
> > > > > *css, int cpu)
> > > > > +static void __blkcg_rstat_flush(struct blkcg *blkcg, int cpu)
> > > > >    {
> > > > > -    struct blkcg *blkcg = css_to_blkcg(css);
> > > > >        struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
> > > > >        struct llist_node *lnode;
> > > > >        struct blkg_iostat_set *bisc, *next_bisc;
> > > > > -    /* Root-level stats are sourced from system-wide IO stats */
> > > > > -    if (!cgroup_parent(css->cgroup))
> > > > > -        return;
> > > > > -
> > > > >        rcu_read_lock();
> > > > >        lnode = llist_del_all(lhead);
> > > > > @@ -991,13 +999,19 @@ static void blkcg_rstat_flush(struct
> > > > > cgroup_subsys_state *css, int cpu)
> > > > >            if (parent && parent->parent)
> > > > >                blkcg_iostat_update(parent, &blkg->iostat.cur,
> > > > >                            &blkg->iostat.last);
> > > > > -        percpu_ref_put(&blkg->refcnt);
> > > > >        }
> > > > >    out:
> > > > >        rcu_read_unlock();
> > > > >    }
> > > > > +static void blkcg_rstat_flush(struct cgroup_subsys_state
> > > > > *css, int cpu)
> > > > > +{
> > > > > +    /* Root-level stats are sourced from system-wide IO stats */
> > > > > +    if (cgroup_parent(css->cgroup))
> > > > > +        __blkcg_rstat_flush(css_to_blkcg(css), cpu);
> > > > > +}
> > > > > +
> > > > I think it may not safe to call __blkcg_rstat_flus() directly
> > > > without taking
> > > > the cgroup_rstat_cpu_lock. That is why I added a helper to
> > > > kernel/cgroup/rstat.c in my patch to meet the locking requirement.
> > > All stats are removed from llist_del_all(), and the local list is
> > > iterated, then each blkg & its parent is touched in
> > > __blkcg_rstat_flus(), so
> > > can you explain it a bit why cgroup locks are needed? For protecting
> > > what?
> > 
> > You are right. The llist_del_all() call in blkcg_rstat_flush() is
> > atomic, so it is safe for concurrent execution which is what the
> > cgroup_rstat_cpu_lock protects against. That may not be the case for
> > rstat callbacks of other controllers. So I will suggest you to add a
> > comment to clarify that point. Other than that, you patch looks good to
> > me.
> > 
> > Reviewed: Waiman Long <longman@xxxxxxxxxx>
> 
> After some more thought, I need to retract my reviewed-by tag for now. There
> is a slight possibility that blkcg_iostat_update() in blkcg_rstat_flush()
> can happen concurrently which will corrupt the sequence count.

llist_del_all() moves all 'bis' into one local list, and bis is one percpu
variable of blkg, so in theory same bis won't be flushed at the same
time. And one bis should be touched in just one of stat flush code path
because of llist_del_all().

So 'bis' still can be thought as being flushed in serialized way.

However, blk_cgroup_bio_start() runs concurrently with blkcg_rstat_flush(),
so once bis->lqueued is cleared in blkcg_rstat_flush(), this same bis
could be added to the percpu llist and __blkcg_rstat_flush() from blkg_release()
follows. This should be the only chance for concurrent stats update.

But, blkg_release() is run in RCU callback, so the previous flush has
been done, but new flush can come, and other blkg's stat could be added
with same story above.

> One way to
> avoid that is to synchronize it by cgroup_rstat_cpu_lock. Another way is to
> use the bisc->lqueued for synchronization.

I'd avoid the external cgroup lock here.

> In that case, you need to move
> WRITE_ONCE(bisc->lqueued, false) in blkcg_rstat_flush() to the end after all
> the  blkcg_iostat_update() call with smp_store_release() and replace the
> READ_ONCE(bis->lqueued) check in blk_cgroup_bio_start() with
> smp_load_acquire().

This way looks doable, but I guess it still can't avoid concurrent update on parent
stat, such as when  __blkcg_rstat_flush() from blkg_release() is
in-progress, another sibling blkg's bis is added, meantime
blkcg_rstat_flush() is called.

Another way is to add blkcg->stat_lock for covering __blkcg_rstat_flush(), what
do you think of this way?

Thanks,
Ming