Re: [PATCH V2] blk-cgroup: Flush stats before releasing blkcg_gq

Ming Lei <ming.lei@xxxxxxxxxx> · Thu, 25 May 2023 11:47:48 +0800

On Wed, May 24, 2023 at 10:50:29PM -0400, Waiman Long wrote:
> On 5/24/23 22:04, Ming Lei wrote:
> > On Wed, May 24, 2023 at 01:28:41PM -0400, Waiman Long wrote:
> > > On 5/24/23 11:43, Waiman Long wrote:
> > > > On 5/24/23 00:26, Ming Lei wrote:
> > > > > On Wed, May 24, 2023 at 12:19:57AM -0400, Waiman Long wrote:
> > > > > > On 5/23/23 23:51, Ming Lei wrote:
> > > > > > > As noted by Michal, the blkg_iostat_set's in the lockless list hold
> > > > > > > reference to blkg's to protect against their removal. Those blkg's
> > > > > > > hold reference to blkcg. When a cgroup is being destroyed,
> > > > > > > cgroup_rstat_flush() is only called at css_release_work_fn() which
> > > > > > > is called when the blkcg reference count reaches 0. This circular
> > > > > > > dependency will prevent blkcg and some blkgs from being freed after
> > > > > > > they are made offline.
> > > > > > > 
> > > > > > > It is less a problem if the cgroup to be destroyed also has other
> > > > > > > controllers like memory that will call cgroup_rstat_flush() which will
> > > > > > > clean up the reference count. If block is the only
> > > > > > > controller that uses
> > > > > > > rstat, these offline blkcg and blkgs may never be freed leaking more
> > > > > > > and more memory over time.
> > > > > > > 
> > > > > > > To prevent this potential memory leak:
> > > > > > > 
> > > > > > > - flush blkcg per-cpu stats list in __blkg_release(), when no new stat
> > > > > > > can be added
> > > > > > > 
> > > > > > > - don't grab bio->bi_blkg reference when adding the stats into blkcg's
> > > > > > > per-cpu stat list since all stats are guaranteed to be consumed before
> > > > > > > releasing blkg instance, and grabbing blkg reference for stats was the
> > > > > > > most fragile part of original patch
> > > > > > > 
> > > > > > > Based on Waiman's patch:
> > > > > > > 
> > > > > > > https://lore.kernel.org/linux-block/20221215033132.230023-3-longman@xxxxxxxxxx/
> > > > > > > 
> > > > > > > 
> > > > > > > Fixes: 3b8cc6298724 ("blk-cgroup: Optimize blkcg_rstat_flush()")
> > > > > > > Cc: Waiman Long <longman@xxxxxxxxxx>
> > > > > > > Cc: Tejun Heo <tj@xxxxxxxxxx>
> > > > > > > Cc: mkoutny@xxxxxxxx
> > > > > > > Cc: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
> > > > > > > Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx>
> > > > > > > ---
> > > > > > > V2:
> > > > > > >      - remove kernel/cgroup change, and call blkcg_rstat_flush()
> > > > > > >      to flush stat directly
> > > > > > > 
> > > > > > >     block/blk-cgroup.c | 29 +++++++++++++++++++++--------
> > > > > > >     1 file changed, 21 insertions(+), 8 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> > > > > > > index 0ce64dd73cfe..ed0eb8896972 100644
> > > > > > > --- a/block/blk-cgroup.c
> > > > > > > +++ b/block/blk-cgroup.c
> > > > > > > @@ -34,6 +34,8 @@
> > > > > > >     #include "blk-ioprio.h"
> > > > > > >     #include "blk-throttle.h"
> > > > > > > +static void __blkcg_rstat_flush(struct blkcg *blkcg, int cpu);
> > > > > > > +
> > > > > > >     /*
> > > > > > >      * blkcg_pol_mutex protects blkcg_policy[] and policy
> > > > > > > [de]activation.
> > > > > > >      * blkcg_pol_register_mutex nests outside of it and
> > > > > > > synchronizes entire
> > > > > > > @@ -163,10 +165,21 @@ static void blkg_free(struct blkcg_gq *blkg)
> > > > > > >     static void __blkg_release(struct rcu_head *rcu)
> > > > > > >     {
> > > > > > >         struct blkcg_gq *blkg = container_of(rcu, struct
> > > > > > > blkcg_gq, rcu_head);
> > > > > > > +    struct blkcg *blkcg = blkg->blkcg;
> > > > > > > +    int cpu;
> > > > > > >     #ifdef CONFIG_BLK_CGROUP_PUNT_BIO
> > > > > > >         WARN_ON(!bio_list_empty(&blkg->async_bios));
> > > > > > >     #endif
> > > > > > > +    /*
> > > > > > > +     * Flush all the non-empty percpu lockless lists before releasing
> > > > > > > +     * us, given these stat belongs to us.
> > > > > > > +     *
> > > > > > > +     * cgroup locks aren't needed here since __blkcg_rstat_flush just
> > > > > > > +     * propagates delta into blkg parent, which is live now.
> > > > > > > +     */
> > > > > > > +    for_each_possible_cpu(cpu)
> > > > > > > +        __blkcg_rstat_flush(blkcg, cpu);
> > > > > > >         /* release the blkcg and parent blkg refs this blkg
> > > > > > > has been holding */
> > > > > > >         css_put(&blkg->blkcg->css);
> > > > > > > @@ -951,17 +964,12 @@ static void blkcg_iostat_update(struct
> > > > > > > blkcg_gq *blkg, struct blkg_iostat *cur,
> > > > > > > u64_stats_update_end_irqrestore(&blkg->iostat.sync, flags);
> > > > > > >     }
> > > > > > > -static void blkcg_rstat_flush(struct cgroup_subsys_state
> > > > > > > *css, int cpu)
> > > > > > > +static void __blkcg_rstat_flush(struct blkcg *blkcg, int cpu)
> > > > > > >     {
> > > > > > > -    struct blkcg *blkcg = css_to_blkcg(css);
> > > > > > >         struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
> > > > > > >         struct llist_node *lnode;
> > > > > > >         struct blkg_iostat_set *bisc, *next_bisc;
> > > > > > > -    /* Root-level stats are sourced from system-wide IO stats */
> > > > > > > -    if (!cgroup_parent(css->cgroup))
> > > > > > > -        return;
> > > > > > > -
> > > > > > >         rcu_read_lock();
> > > > > > >         lnode = llist_del_all(lhead);
> > > > > > > @@ -991,13 +999,19 @@ static void blkcg_rstat_flush(struct
> > > > > > > cgroup_subsys_state *css, int cpu)
> > > > > > >             if (parent && parent->parent)
> > > > > > >                 blkcg_iostat_update(parent, &blkg->iostat.cur,
> > > > > > >                             &blkg->iostat.last);
> > > > > > > -        percpu_ref_put(&blkg->refcnt);
> > > > > > >         }
> > > > > > >     out:
> > > > > > >         rcu_read_unlock();
> > > > > > >     }
> > > > > > > +static void blkcg_rstat_flush(struct cgroup_subsys_state
> > > > > > > *css, int cpu)
> > > > > > > +{
> > > > > > > +    /* Root-level stats are sourced from system-wide IO stats */
> > > > > > > +    if (cgroup_parent(css->cgroup))
> > > > > > > +        __blkcg_rstat_flush(css_to_blkcg(css), cpu);
> > > > > > > +}
> > > > > > > +
> > > > > > I think it may not safe to call __blkcg_rstat_flus() directly
> > > > > > without taking
> > > > > > the cgroup_rstat_cpu_lock. That is why I added a helper to
> > > > > > kernel/cgroup/rstat.c in my patch to meet the locking requirement.
> > > > > All stats are removed from llist_del_all(), and the local list is
> > > > > iterated, then each blkg & its parent is touched in
> > > > > __blkcg_rstat_flus(), so
> > > > > can you explain it a bit why cgroup locks are needed? For protecting
> > > > > what?
> > > > You are right. The llist_del_all() call in blkcg_rstat_flush() is
> > > > atomic, so it is safe for concurrent execution which is what the
> > > > cgroup_rstat_cpu_lock protects against. That may not be the case for
> > > > rstat callbacks of other controllers. So I will suggest you to add a
> > > > comment to clarify that point. Other than that, you patch looks good to
> > > > me.
> > > > 
> > > > Reviewed: Waiman Long <longman@xxxxxxxxxx>
> > > After some more thought, I need to retract my reviewed-by tag for now. There
> > > is a slight possibility that blkcg_iostat_update() in blkcg_rstat_flush()
> > > can happen concurrently which will corrupt the sequence count.
> > llist_del_all() moves all 'bis' into one local list, and bis is one percpu
> > variable of blkg, so in theory same bis won't be flushed at the same
> > time. And one bis should be touched in just one of stat flush code path
> > because of llist_del_all().
> > 
> > So 'bis' still can be thought as being flushed in serialized way.
> > 
> > However, blk_cgroup_bio_start() runs concurrently with blkcg_rstat_flush(),
> > so once bis->lqueued is cleared in blkcg_rstat_flush(), this same bis
> > could be added to the percpu llist and __blkcg_rstat_flush() from blkg_release()
> > follows. This should be the only chance for concurrent stats update.
> 
> That is why I have in mind. A __blkcg_rstat_flush() can be from
> blkg_release() and another one from the regular cgroup_rstat_flush*().

But the two blkg can't be same because blkg_release() is run from rcu
callback.

> 
> 
> > 
> > But, blkg_release() is run in RCU callback, so the previous flush has
> > been done, but new flush can come, and other blkg's stat could be added
> > with same story above.
> > 
> > > One way to
> > > avoid that is to synchronize it by cgroup_rstat_cpu_lock. Another way is to
> > > use the bisc->lqueued for synchronization.
> > I'd avoid the external cgroup lock here.
> > 
> > > In that case, you need to move
> > > WRITE_ONCE(bisc->lqueued, false) in blkcg_rstat_flush() to the end after all
> > > the  blkcg_iostat_update() call with smp_store_release() and replace the
> > > READ_ONCE(bis->lqueued) check in blk_cgroup_bio_start() with
> > > smp_load_acquire().
> > This way looks doable, but I guess it still can't avoid concurrent update on parent
> > stat, such as when  __blkcg_rstat_flush() from blkg_release() is
> > in-progress, another sibling blkg's bis is added, meantime
> > blkcg_rstat_flush() is called.
> I realized that the use of cgroup_rstat_cpu_lock or the alternative was not
> safe enough for preventing concurrent parent blkg rstat update.

Indeed.

> > 
> > Another way is to add blkcg->stat_lock for covering __blkcg_rstat_flush(), what
> > do you think of this way?
> 
> I am thinking of adding a raw spinlock into blkg and take it when doing
> blkcg_iostat_update(). This can guarantee no concurrent update to rstat
> data. It has to be a raw spinlock as it will be under the
> cgroup_rstat_cpu_lock raw spinlock.

We still need to be careful since both child and parent are updated
in the following blkcg_iostat_update().

                if (parent && parent->parent)
                        blkcg_iostat_update(parent, &blkg->iostat.cur,
                                            &blkg->iostat.last);

But the concurrent update on parent blkg can be avoided by holding
parent->stat_lock only.

thanks,
Ming