On Wed, May 24, 2023 at 10:50:29PM -0400, Waiman Long wrote: > On 5/24/23 22:04, Ming Lei wrote: > > On Wed, May 24, 2023 at 01:28:41PM -0400, Waiman Long wrote: > > > On 5/24/23 11:43, Waiman Long wrote: > > > > On 5/24/23 00:26, Ming Lei wrote: > > > > > On Wed, May 24, 2023 at 12:19:57AM -0400, Waiman Long wrote: > > > > > > On 5/23/23 23:51, Ming Lei wrote: > > > > > > > As noted by Michal, the blkg_iostat_set's in the lockless list hold > > > > > > > reference to blkg's to protect against their removal. Those blkg's > > > > > > > hold reference to blkcg. When a cgroup is being destroyed, > > > > > > > cgroup_rstat_flush() is only called at css_release_work_fn() which > > > > > > > is called when the blkcg reference count reaches 0. This circular > > > > > > > dependency will prevent blkcg and some blkgs from being freed after > > > > > > > they are made offline. > > > > > > > > > > > > > > It is less a problem if the cgroup to be destroyed also has other > > > > > > > controllers like memory that will call cgroup_rstat_flush() which will > > > > > > > clean up the reference count. If block is the only > > > > > > > controller that uses > > > > > > > rstat, these offline blkcg and blkgs may never be freed leaking more > > > > > > > and more memory over time. > > > > > > > > > > > > > > To prevent this potential memory leak: > > > > > > > > > > > > > > - flush blkcg per-cpu stats list in __blkg_release(), when no new stat > > > > > > > can be added > > > > > > > > > > > > > > - don't grab bio->bi_blkg reference when adding the stats into blkcg's > > > > > > > per-cpu stat list since all stats are guaranteed to be consumed before > > > > > > > releasing blkg instance, and grabbing blkg reference for stats was the > > > > > > > most fragile part of original patch > > > > > > > > > > > > > > Based on Waiman's patch: > > > > > > > > > > > > > > https://lore.kernel.org/linux-block/20221215033132.230023-3-longman@xxxxxxxxxx/ > > > > > > > > > > > > > > > > > > > > > Fixes: 3b8cc6298724 ("blk-cgroup: Optimize blkcg_rstat_flush()") > > > > > > > Cc: Waiman Long <longman@xxxxxxxxxx> > > > > > > > Cc: Tejun Heo <tj@xxxxxxxxxx> > > > > > > > Cc: mkoutny@xxxxxxxx > > > > > > > Cc: Yosry Ahmed <yosryahmed@xxxxxxxxxx> > > > > > > > Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx> > > > > > > > --- > > > > > > > V2: > > > > > > > - remove kernel/cgroup change, and call blkcg_rstat_flush() > > > > > > > to flush stat directly > > > > > > > > > > > > > > block/blk-cgroup.c | 29 +++++++++++++++++++++-------- > > > > > > > 1 file changed, 21 insertions(+), 8 deletions(-) > > > > > > > > > > > > > > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c > > > > > > > index 0ce64dd73cfe..ed0eb8896972 100644 > > > > > > > --- a/block/blk-cgroup.c > > > > > > > +++ b/block/blk-cgroup.c > > > > > > > @@ -34,6 +34,8 @@ > > > > > > > #include "blk-ioprio.h" > > > > > > > #include "blk-throttle.h" > > > > > > > +static void __blkcg_rstat_flush(struct blkcg *blkcg, int cpu); > > > > > > > + > > > > > > > /* > > > > > > > * blkcg_pol_mutex protects blkcg_policy[] and policy > > > > > > > [de]activation. > > > > > > > * blkcg_pol_register_mutex nests outside of it and > > > > > > > synchronizes entire > > > > > > > @@ -163,10 +165,21 @@ static void blkg_free(struct blkcg_gq *blkg) > > > > > > > static void __blkg_release(struct rcu_head *rcu) > > > > > > > { > > > > > > > struct blkcg_gq *blkg = container_of(rcu, struct > > > > > > > blkcg_gq, rcu_head); > > > > > > > + struct blkcg *blkcg = blkg->blkcg; > > > > > > > + int cpu; > > > > > > > #ifdef CONFIG_BLK_CGROUP_PUNT_BIO > > > > > > > WARN_ON(!bio_list_empty(&blkg->async_bios)); > > > > > > > #endif > > > > > > > + /* > > > > > > > + * Flush all the non-empty percpu lockless lists before releasing > > > > > > > + * us, given these stat belongs to us. > > > > > > > + * > > > > > > > + * cgroup locks aren't needed here since __blkcg_rstat_flush just > > > > > > > + * propagates delta into blkg parent, which is live now. > > > > > > > + */ > > > > > > > + for_each_possible_cpu(cpu) > > > > > > > + __blkcg_rstat_flush(blkcg, cpu); > > > > > > > /* release the blkcg and parent blkg refs this blkg > > > > > > > has been holding */ > > > > > > > css_put(&blkg->blkcg->css); > > > > > > > @@ -951,17 +964,12 @@ static void blkcg_iostat_update(struct > > > > > > > blkcg_gq *blkg, struct blkg_iostat *cur, > > > > > > > u64_stats_update_end_irqrestore(&blkg->iostat.sync, flags); > > > > > > > } > > > > > > > -static void blkcg_rstat_flush(struct cgroup_subsys_state > > > > > > > *css, int cpu) > > > > > > > +static void __blkcg_rstat_flush(struct blkcg *blkcg, int cpu) > > > > > > > { > > > > > > > - struct blkcg *blkcg = css_to_blkcg(css); > > > > > > > struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu); > > > > > > > struct llist_node *lnode; > > > > > > > struct blkg_iostat_set *bisc, *next_bisc; > > > > > > > - /* Root-level stats are sourced from system-wide IO stats */ > > > > > > > - if (!cgroup_parent(css->cgroup)) > > > > > > > - return; > > > > > > > - > > > > > > > rcu_read_lock(); > > > > > > > lnode = llist_del_all(lhead); > > > > > > > @@ -991,13 +999,19 @@ static void blkcg_rstat_flush(struct > > > > > > > cgroup_subsys_state *css, int cpu) > > > > > > > if (parent && parent->parent) > > > > > > > blkcg_iostat_update(parent, &blkg->iostat.cur, > > > > > > > &blkg->iostat.last); > > > > > > > - percpu_ref_put(&blkg->refcnt); > > > > > > > } > > > > > > > out: > > > > > > > rcu_read_unlock(); > > > > > > > } > > > > > > > +static void blkcg_rstat_flush(struct cgroup_subsys_state > > > > > > > *css, int cpu) > > > > > > > +{ > > > > > > > + /* Root-level stats are sourced from system-wide IO stats */ > > > > > > > + if (cgroup_parent(css->cgroup)) > > > > > > > + __blkcg_rstat_flush(css_to_blkcg(css), cpu); > > > > > > > +} > > > > > > > + > > > > > > I think it may not safe to call __blkcg_rstat_flus() directly > > > > > > without taking > > > > > > the cgroup_rstat_cpu_lock. That is why I added a helper to > > > > > > kernel/cgroup/rstat.c in my patch to meet the locking requirement. > > > > > All stats are removed from llist_del_all(), and the local list is > > > > > iterated, then each blkg & its parent is touched in > > > > > __blkcg_rstat_flus(), so > > > > > can you explain it a bit why cgroup locks are needed? For protecting > > > > > what? > > > > You are right. The llist_del_all() call in blkcg_rstat_flush() is > > > > atomic, so it is safe for concurrent execution which is what the > > > > cgroup_rstat_cpu_lock protects against. That may not be the case for > > > > rstat callbacks of other controllers. So I will suggest you to add a > > > > comment to clarify that point. Other than that, you patch looks good to > > > > me. > > > > > > > > Reviewed: Waiman Long <longman@xxxxxxxxxx> > > > After some more thought, I need to retract my reviewed-by tag for now. There > > > is a slight possibility that blkcg_iostat_update() in blkcg_rstat_flush() > > > can happen concurrently which will corrupt the sequence count. > > llist_del_all() moves all 'bis' into one local list, and bis is one percpu > > variable of blkg, so in theory same bis won't be flushed at the same > > time. And one bis should be touched in just one of stat flush code path > > because of llist_del_all(). > > > > So 'bis' still can be thought as being flushed in serialized way. > > > > However, blk_cgroup_bio_start() runs concurrently with blkcg_rstat_flush(), > > so once bis->lqueued is cleared in blkcg_rstat_flush(), this same bis > > could be added to the percpu llist and __blkcg_rstat_flush() from blkg_release() > > follows. This should be the only chance for concurrent stats update. > > That is why I have in mind. A __blkcg_rstat_flush() can be from > blkg_release() and another one from the regular cgroup_rstat_flush*(). But the two blkg can't be same because blkg_release() is run from rcu callback. > > > > > > But, blkg_release() is run in RCU callback, so the previous flush has > > been done, but new flush can come, and other blkg's stat could be added > > with same story above. > > > > > One way to > > > avoid that is to synchronize it by cgroup_rstat_cpu_lock. Another way is to > > > use the bisc->lqueued for synchronization. > > I'd avoid the external cgroup lock here. > > > > > In that case, you need to move > > > WRITE_ONCE(bisc->lqueued, false) in blkcg_rstat_flush() to the end after all > > > the blkcg_iostat_update() call with smp_store_release() and replace the > > > READ_ONCE(bis->lqueued) check in blk_cgroup_bio_start() with > > > smp_load_acquire(). > > This way looks doable, but I guess it still can't avoid concurrent update on parent > > stat, such as when __blkcg_rstat_flush() from blkg_release() is > > in-progress, another sibling blkg's bis is added, meantime > > blkcg_rstat_flush() is called. > I realized that the use of cgroup_rstat_cpu_lock or the alternative was not > safe enough for preventing concurrent parent blkg rstat update. Indeed. > > > > Another way is to add blkcg->stat_lock for covering __blkcg_rstat_flush(), what > > do you think of this way? > > I am thinking of adding a raw spinlock into blkg and take it when doing > blkcg_iostat_update(). This can guarantee no concurrent update to rstat > data. It has to be a raw spinlock as it will be under the > cgroup_rstat_cpu_lock raw spinlock. We still need to be careful since both child and parent are updated in the following blkcg_iostat_update(). if (parent && parent->parent) blkcg_iostat_update(parent, &blkg->iostat.cur, &blkg->iostat.last); But the concurrent update on parent blkg can be avoided by holding parent->stat_lock only. thanks, Ming