On Wed, May 24, 2023 at 7:50 PM Waiman Long <longman@xxxxxxxxxx> wrote: > > On 5/24/23 22:04, Ming Lei wrote: > > On Wed, May 24, 2023 at 01:28:41PM -0400, Waiman Long wrote: > >> On 5/24/23 11:43, Waiman Long wrote: > >>> On 5/24/23 00:26, Ming Lei wrote: > >>>> On Wed, May 24, 2023 at 12:19:57AM -0400, Waiman Long wrote: > >>>>> On 5/23/23 23:51, Ming Lei wrote: > >>>>>> As noted by Michal, the blkg_iostat_set's in the lockless list hold > >>>>>> reference to blkg's to protect against their removal. Those blkg's > >>>>>> hold reference to blkcg. When a cgroup is being destroyed, > >>>>>> cgroup_rstat_flush() is only called at css_release_work_fn() which > >>>>>> is called when the blkcg reference count reaches 0. This circular > >>>>>> dependency will prevent blkcg and some blkgs from being freed after > >>>>>> they are made offline. > >>>>>> > >>>>>> It is less a problem if the cgroup to be destroyed also has other > >>>>>> controllers like memory that will call cgroup_rstat_flush() which will > >>>>>> clean up the reference count. If block is the only > >>>>>> controller that uses > >>>>>> rstat, these offline blkcg and blkgs may never be freed leaking more > >>>>>> and more memory over time. > >>>>>> > >>>>>> To prevent this potential memory leak: > >>>>>> > >>>>>> - flush blkcg per-cpu stats list in __blkg_release(), when no new stat > >>>>>> can be added > >>>>>> > >>>>>> - don't grab bio->bi_blkg reference when adding the stats into blkcg's > >>>>>> per-cpu stat list since all stats are guaranteed to be consumed before > >>>>>> releasing blkg instance, and grabbing blkg reference for stats was the > >>>>>> most fragile part of original patch > >>>>>> > >>>>>> Based on Waiman's patch: > >>>>>> > >>>>>> https://lore.kernel.org/linux-block/20221215033132.230023-3-longman@xxxxxxxxxx/ > >>>>>> > >>>>>> > >>>>>> Fixes: 3b8cc6298724 ("blk-cgroup: Optimize blkcg_rstat_flush()") > >>>>>> Cc: Waiman Long <longman@xxxxxxxxxx> > >>>>>> Cc: Tejun Heo <tj@xxxxxxxxxx> > >>>>>> Cc: mkoutny@xxxxxxxx > >>>>>> Cc: Yosry Ahmed <yosryahmed@xxxxxxxxxx> > >>>>>> Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx> > >>>>>> --- > >>>>>> V2: > >>>>>> - remove kernel/cgroup change, and call blkcg_rstat_flush() > >>>>>> to flush stat directly > >>>>>> > >>>>>> block/blk-cgroup.c | 29 +++++++++++++++++++++-------- > >>>>>> 1 file changed, 21 insertions(+), 8 deletions(-) > >>>>>> > >>>>>> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c > >>>>>> index 0ce64dd73cfe..ed0eb8896972 100644 > >>>>>> --- a/block/blk-cgroup.c > >>>>>> +++ b/block/blk-cgroup.c > >>>>>> @@ -34,6 +34,8 @@ > >>>>>> #include "blk-ioprio.h" > >>>>>> #include "blk-throttle.h" > >>>>>> +static void __blkcg_rstat_flush(struct blkcg *blkcg, int cpu); > >>>>>> + > >>>>>> /* > >>>>>> * blkcg_pol_mutex protects blkcg_policy[] and policy > >>>>>> [de]activation. > >>>>>> * blkcg_pol_register_mutex nests outside of it and > >>>>>> synchronizes entire > >>>>>> @@ -163,10 +165,21 @@ static void blkg_free(struct blkcg_gq *blkg) > >>>>>> static void __blkg_release(struct rcu_head *rcu) > >>>>>> { > >>>>>> struct blkcg_gq *blkg = container_of(rcu, struct > >>>>>> blkcg_gq, rcu_head); > >>>>>> + struct blkcg *blkcg = blkg->blkcg; > >>>>>> + int cpu; > >>>>>> #ifdef CONFIG_BLK_CGROUP_PUNT_BIO > >>>>>> WARN_ON(!bio_list_empty(&blkg->async_bios)); > >>>>>> #endif > >>>>>> + /* > >>>>>> + * Flush all the non-empty percpu lockless lists before releasing > >>>>>> + * us, given these stat belongs to us. > >>>>>> + * > >>>>>> + * cgroup locks aren't needed here since __blkcg_rstat_flush just > >>>>>> + * propagates delta into blkg parent, which is live now. > >>>>>> + */ > >>>>>> + for_each_possible_cpu(cpu) > >>>>>> + __blkcg_rstat_flush(blkcg, cpu); > >>>>>> /* release the blkcg and parent blkg refs this blkg > >>>>>> has been holding */ > >>>>>> css_put(&blkg->blkcg->css); > >>>>>> @@ -951,17 +964,12 @@ static void blkcg_iostat_update(struct > >>>>>> blkcg_gq *blkg, struct blkg_iostat *cur, > >>>>>> u64_stats_update_end_irqrestore(&blkg->iostat.sync, flags); > >>>>>> } > >>>>>> -static void blkcg_rstat_flush(struct cgroup_subsys_state > >>>>>> *css, int cpu) > >>>>>> +static void __blkcg_rstat_flush(struct blkcg *blkcg, int cpu) > >>>>>> { > >>>>>> - struct blkcg *blkcg = css_to_blkcg(css); > >>>>>> struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu); > >>>>>> struct llist_node *lnode; > >>>>>> struct blkg_iostat_set *bisc, *next_bisc; > >>>>>> - /* Root-level stats are sourced from system-wide IO stats */ > >>>>>> - if (!cgroup_parent(css->cgroup)) > >>>>>> - return; > >>>>>> - > >>>>>> rcu_read_lock(); > >>>>>> lnode = llist_del_all(lhead); > >>>>>> @@ -991,13 +999,19 @@ static void blkcg_rstat_flush(struct > >>>>>> cgroup_subsys_state *css, int cpu) > >>>>>> if (parent && parent->parent) > >>>>>> blkcg_iostat_update(parent, &blkg->iostat.cur, > >>>>>> &blkg->iostat.last); > >>>>>> - percpu_ref_put(&blkg->refcnt); > >>>>>> } > >>>>>> out: > >>>>>> rcu_read_unlock(); > >>>>>> } > >>>>>> +static void blkcg_rstat_flush(struct cgroup_subsys_state > >>>>>> *css, int cpu) > >>>>>> +{ > >>>>>> + /* Root-level stats are sourced from system-wide IO stats */ > >>>>>> + if (cgroup_parent(css->cgroup)) > >>>>>> + __blkcg_rstat_flush(css_to_blkcg(css), cpu); > >>>>>> +} > >>>>>> + > >>>>> I think it may not safe to call __blkcg_rstat_flus() directly > >>>>> without taking > >>>>> the cgroup_rstat_cpu_lock. That is why I added a helper to > >>>>> kernel/cgroup/rstat.c in my patch to meet the locking requirement. > >>>> All stats are removed from llist_del_all(), and the local list is > >>>> iterated, then each blkg & its parent is touched in > >>>> __blkcg_rstat_flus(), so > >>>> can you explain it a bit why cgroup locks are needed? For protecting > >>>> what? > >>> You are right. The llist_del_all() call in blkcg_rstat_flush() is > >>> atomic, so it is safe for concurrent execution which is what the > >>> cgroup_rstat_cpu_lock protects against. That may not be the case for > >>> rstat callbacks of other controllers. So I will suggest you to add a > >>> comment to clarify that point. Other than that, you patch looks good to > >>> me. > >>> > >>> Reviewed: Waiman Long <longman@xxxxxxxxxx> > >> After some more thought, I need to retract my reviewed-by tag for now. There > >> is a slight possibility that blkcg_iostat_update() in blkcg_rstat_flush() > >> can happen concurrently which will corrupt the sequence count. > > llist_del_all() moves all 'bis' into one local list, and bis is one percpu > > variable of blkg, so in theory same bis won't be flushed at the same > > time. And one bis should be touched in just one of stat flush code path > > because of llist_del_all(). > > > > So 'bis' still can be thought as being flushed in serialized way. > > > > However, blk_cgroup_bio_start() runs concurrently with blkcg_rstat_flush(), > > so once bis->lqueued is cleared in blkcg_rstat_flush(), this same bis > > could be added to the percpu llist and __blkcg_rstat_flush() from blkg_release() > > follows. This should be the only chance for concurrent stats update. > > That is why I have in mind. A __blkcg_rstat_flush() can be from > blkg_release() and another one from the regular cgroup_rstat_flush*(). > > > > > > But, blkg_release() is run in RCU callback, so the previous flush has > > been done, but new flush can come, and other blkg's stat could be added > > with same story above. > > > >> One way to > >> avoid that is to synchronize it by cgroup_rstat_cpu_lock. Another way is to > >> use the bisc->lqueued for synchronization. > > I'd avoid the external cgroup lock here. > > > >> In that case, you need to move > >> WRITE_ONCE(bisc->lqueued, false) in blkcg_rstat_flush() to the end after all > >> the blkcg_iostat_update() call with smp_store_release() and replace the > >> READ_ONCE(bis->lqueued) check in blk_cgroup_bio_start() with > >> smp_load_acquire(). > > This way looks doable, but I guess it still can't avoid concurrent update on parent > > stat, such as when __blkcg_rstat_flush() from blkg_release() is > > in-progress, another sibling blkg's bis is added, meantime > > blkcg_rstat_flush() is called. > I realized that the use of cgroup_rstat_cpu_lock or the alternative was > not safe enough for preventing concurrent parent blkg rstat update. > > > > Another way is to add blkcg->stat_lock for covering __blkcg_rstat_flush(), what > > do you think of this way? > > I am thinking of adding a raw spinlock into blkg and take it when doing > blkcg_iostat_update(). This can guarantee no concurrent update to rstat > data. It has to be a raw spinlock as it will be under the > cgroup_rstat_cpu_lock raw spinlock. Hi Waiman, I don't have context about blkcg, but isn't this exactly what cgroup_rstat_lock does? Is it too expensive to just call cgroup_rstat_flush () here? > > The use of u64_stats_fetch_begin/u64_stats_fetch_retry in retrieving the > rstat data from bisc can happen concurrently with update without further > synchronization. However, there can be at most one update at any time. > > Cheers, > Longman > > > > > > > Thanks, > > Ming > > >