On Wed, May 24, 2023 at 12:19:57AM -0400, Waiman Long wrote: > On 5/23/23 23:51, Ming Lei wrote: > > As noted by Michal, the blkg_iostat_set's in the lockless list hold > > reference to blkg's to protect against their removal. Those blkg's > > hold reference to blkcg. When a cgroup is being destroyed, > > cgroup_rstat_flush() is only called at css_release_work_fn() which > > is called when the blkcg reference count reaches 0. This circular > > dependency will prevent blkcg and some blkgs from being freed after > > they are made offline. > > > > It is less a problem if the cgroup to be destroyed also has other > > controllers like memory that will call cgroup_rstat_flush() which will > > clean up the reference count. If block is the only controller that uses > > rstat, these offline blkcg and blkgs may never be freed leaking more > > and more memory over time. > > > > To prevent this potential memory leak: > > > > - flush blkcg per-cpu stats list in __blkg_release(), when no new stat > > can be added > > > > - don't grab bio->bi_blkg reference when adding the stats into blkcg's > > per-cpu stat list since all stats are guaranteed to be consumed before > > releasing blkg instance, and grabbing blkg reference for stats was the > > most fragile part of original patch > > > > Based on Waiman's patch: > > > > https://lore.kernel.org/linux-block/20221215033132.230023-3-longman@xxxxxxxxxx/ > > > > Fixes: 3b8cc6298724 ("blk-cgroup: Optimize blkcg_rstat_flush()") > > Cc: Waiman Long <longman@xxxxxxxxxx> > > Cc: Tejun Heo <tj@xxxxxxxxxx> > > Cc: mkoutny@xxxxxxxx > > Cc: Yosry Ahmed <yosryahmed@xxxxxxxxxx> > > Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx> > > --- > > V2: > > - remove kernel/cgroup change, and call blkcg_rstat_flush() > > to flush stat directly > > > > block/blk-cgroup.c | 29 +++++++++++++++++++++-------- > > 1 file changed, 21 insertions(+), 8 deletions(-) > > > > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c > > index 0ce64dd73cfe..ed0eb8896972 100644 > > --- a/block/blk-cgroup.c > > +++ b/block/blk-cgroup.c > > @@ -34,6 +34,8 @@ > > #include "blk-ioprio.h" > > #include "blk-throttle.h" > > +static void __blkcg_rstat_flush(struct blkcg *blkcg, int cpu); > > + > > /* > > * blkcg_pol_mutex protects blkcg_policy[] and policy [de]activation. > > * blkcg_pol_register_mutex nests outside of it and synchronizes entire > > @@ -163,10 +165,21 @@ static void blkg_free(struct blkcg_gq *blkg) > > static void __blkg_release(struct rcu_head *rcu) > > { > > struct blkcg_gq *blkg = container_of(rcu, struct blkcg_gq, rcu_head); > > + struct blkcg *blkcg = blkg->blkcg; > > + int cpu; > > #ifdef CONFIG_BLK_CGROUP_PUNT_BIO > > WARN_ON(!bio_list_empty(&blkg->async_bios)); > > #endif > > + /* > > + * Flush all the non-empty percpu lockless lists before releasing > > + * us, given these stat belongs to us. > > + * > > + * cgroup locks aren't needed here since __blkcg_rstat_flush just > > + * propagates delta into blkg parent, which is live now. > > + */ > > + for_each_possible_cpu(cpu) > > + __blkcg_rstat_flush(blkcg, cpu); > > /* release the blkcg and parent blkg refs this blkg has been holding */ > > css_put(&blkg->blkcg->css); > > @@ -951,17 +964,12 @@ static void blkcg_iostat_update(struct blkcg_gq *blkg, struct blkg_iostat *cur, > > u64_stats_update_end_irqrestore(&blkg->iostat.sync, flags); > > } > > -static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu) > > +static void __blkcg_rstat_flush(struct blkcg *blkcg, int cpu) > > { > > - struct blkcg *blkcg = css_to_blkcg(css); > > struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu); > > struct llist_node *lnode; > > struct blkg_iostat_set *bisc, *next_bisc; > > - /* Root-level stats are sourced from system-wide IO stats */ > > - if (!cgroup_parent(css->cgroup)) > > - return; > > - > > rcu_read_lock(); > > lnode = llist_del_all(lhead); > > @@ -991,13 +999,19 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu) > > if (parent && parent->parent) > > blkcg_iostat_update(parent, &blkg->iostat.cur, > > &blkg->iostat.last); > > - percpu_ref_put(&blkg->refcnt); > > } > > out: > > rcu_read_unlock(); > > } > > +static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu) > > +{ > > + /* Root-level stats are sourced from system-wide IO stats */ > > + if (cgroup_parent(css->cgroup)) > > + __blkcg_rstat_flush(css_to_blkcg(css), cpu); > > +} > > + > > I think it may not safe to call __blkcg_rstat_flus() directly without taking > the cgroup_rstat_cpu_lock. That is why I added a helper to > kernel/cgroup/rstat.c in my patch to meet the locking requirement. All stats are removed from llist_del_all(), and the local list is iterated, then each blkg & its parent is touched in __blkcg_rstat_flus(), so can you explain it a bit why cgroup locks are needed? For protecting what? Thanks, Ming