On Wed, May 24, 2023 at 8:04 PM Waiman Long <longman@xxxxxxxxxx> wrote: > > On 5/24/23 22:55, Yosry Ahmed wrote: > > On Wed, May 24, 2023 at 7:50 PM Waiman Long <longman@xxxxxxxxxx> wrote: > >> On 5/24/23 22:04, Ming Lei wrote: > >>> On Wed, May 24, 2023 at 01:28:41PM -0400, Waiman Long wrote: > >>>> On 5/24/23 11:43, Waiman Long wrote: > >>>>> On 5/24/23 00:26, Ming Lei wrote: > >>>>>> On Wed, May 24, 2023 at 12:19:57AM -0400, Waiman Long wrote: > >>>>>>> On 5/23/23 23:51, Ming Lei wrote: > >>>>>>>> As noted by Michal, the blkg_iostat_set's in the lockless list hold > >>>>>>>> reference to blkg's to protect against their removal. Those blkg's > >>>>>>>> hold reference to blkcg. When a cgroup is being destroyed, > >>>>>>>> cgroup_rstat_flush() is only called at css_release_work_fn() which > >>>>>>>> is called when the blkcg reference count reaches 0. This circular > >>>>>>>> dependency will prevent blkcg and some blkgs from being freed after > >>>>>>>> they are made offline. > >>>>>>>> > >>>>>>>> It is less a problem if the cgroup to be destroyed also has other > >>>>>>>> controllers like memory that will call cgroup_rstat_flush() which will > >>>>>>>> clean up the reference count. If block is the only > >>>>>>>> controller that uses > >>>>>>>> rstat, these offline blkcg and blkgs may never be freed leaking more > >>>>>>>> and more memory over time. > >>>>>>>> > >>>>>>>> To prevent this potential memory leak: > >>>>>>>> > >>>>>>>> - flush blkcg per-cpu stats list in __blkg_release(), when no new stat > >>>>>>>> can be added > >>>>>>>> > >>>>>>>> - don't grab bio->bi_blkg reference when adding the stats into blkcg's > >>>>>>>> per-cpu stat list since all stats are guaranteed to be consumed before > >>>>>>>> releasing blkg instance, and grabbing blkg reference for stats was the > >>>>>>>> most fragile part of original patch > >>>>>>>> > >>>>>>>> Based on Waiman's patch: > >>>>>>>> > >>>>>>>> https://lore.kernel.org/linux-block/20221215033132.230023-3-longman@xxxxxxxxxx/ > >>>>>>>> > >>>>>>>> > >>>>>>>> Fixes: 3b8cc6298724 ("blk-cgroup: Optimize blkcg_rstat_flush()") > >>>>>>>> Cc: Waiman Long <longman@xxxxxxxxxx> > >>>>>>>> Cc: Tejun Heo <tj@xxxxxxxxxx> > >>>>>>>> Cc: mkoutny@xxxxxxxx > >>>>>>>> Cc: Yosry Ahmed <yosryahmed@xxxxxxxxxx> > >>>>>>>> Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx> > >>>>>>>> --- > >>>>>>>> V2: > >>>>>>>> - remove kernel/cgroup change, and call blkcg_rstat_flush() > >>>>>>>> to flush stat directly > >>>>>>>> > >>>>>>>> block/blk-cgroup.c | 29 +++++++++++++++++++++-------- > >>>>>>>> 1 file changed, 21 insertions(+), 8 deletions(-) > >>>>>>>> > >>>>>>>> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c > >>>>>>>> index 0ce64dd73cfe..ed0eb8896972 100644 > >>>>>>>> --- a/block/blk-cgroup.c > >>>>>>>> +++ b/block/blk-cgroup.c > >>>>>>>> @@ -34,6 +34,8 @@ > >>>>>>>> #include "blk-ioprio.h" > >>>>>>>> #include "blk-throttle.h" > >>>>>>>> +static void __blkcg_rstat_flush(struct blkcg *blkcg, int cpu); > >>>>>>>> + > >>>>>>>> /* > >>>>>>>> * blkcg_pol_mutex protects blkcg_policy[] and policy > >>>>>>>> [de]activation. > >>>>>>>> * blkcg_pol_register_mutex nests outside of it and > >>>>>>>> synchronizes entire > >>>>>>>> @@ -163,10 +165,21 @@ static void blkg_free(struct blkcg_gq *blkg) > >>>>>>>> static void __blkg_release(struct rcu_head *rcu) > >>>>>>>> { > >>>>>>>> struct blkcg_gq *blkg = container_of(rcu, struct > >>>>>>>> blkcg_gq, rcu_head); > >>>>>>>> + struct blkcg *blkcg = blkg->blkcg; > >>>>>>>> + int cpu; > >>>>>>>> #ifdef CONFIG_BLK_CGROUP_PUNT_BIO > >>>>>>>> WARN_ON(!bio_list_empty(&blkg->async_bios)); > >>>>>>>> #endif > >>>>>>>> + /* > >>>>>>>> + * Flush all the non-empty percpu lockless lists before releasing > >>>>>>>> + * us, given these stat belongs to us. > >>>>>>>> + * > >>>>>>>> + * cgroup locks aren't needed here since __blkcg_rstat_flush just > >>>>>>>> + * propagates delta into blkg parent, which is live now. > >>>>>>>> + */ > >>>>>>>> + for_each_possible_cpu(cpu) > >>>>>>>> + __blkcg_rstat_flush(blkcg, cpu); > >>>>>>>> /* release the blkcg and parent blkg refs this blkg > >>>>>>>> has been holding */ > >>>>>>>> css_put(&blkg->blkcg->css); > >>>>>>>> @@ -951,17 +964,12 @@ static void blkcg_iostat_update(struct > >>>>>>>> blkcg_gq *blkg, struct blkg_iostat *cur, > >>>>>>>> u64_stats_update_end_irqrestore(&blkg->iostat.sync, flags); > >>>>>>>> } > >>>>>>>> -static void blkcg_rstat_flush(struct cgroup_subsys_state > >>>>>>>> *css, int cpu) > >>>>>>>> +static void __blkcg_rstat_flush(struct blkcg *blkcg, int cpu) > >>>>>>>> { > >>>>>>>> - struct blkcg *blkcg = css_to_blkcg(css); > >>>>>>>> struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu); > >>>>>>>> struct llist_node *lnode; > >>>>>>>> struct blkg_iostat_set *bisc, *next_bisc; > >>>>>>>> - /* Root-level stats are sourced from system-wide IO stats */ > >>>>>>>> - if (!cgroup_parent(css->cgroup)) > >>>>>>>> - return; > >>>>>>>> - > >>>>>>>> rcu_read_lock(); > >>>>>>>> lnode = llist_del_all(lhead); > >>>>>>>> @@ -991,13 +999,19 @@ static void blkcg_rstat_flush(struct > >>>>>>>> cgroup_subsys_state *css, int cpu) > >>>>>>>> if (parent && parent->parent) > >>>>>>>> blkcg_iostat_update(parent, &blkg->iostat.cur, > >>>>>>>> &blkg->iostat.last); > >>>>>>>> - percpu_ref_put(&blkg->refcnt); > >>>>>>>> } > >>>>>>>> out: > >>>>>>>> rcu_read_unlock(); > >>>>>>>> } > >>>>>>>> +static void blkcg_rstat_flush(struct cgroup_subsys_state > >>>>>>>> *css, int cpu) > >>>>>>>> +{ > >>>>>>>> + /* Root-level stats are sourced from system-wide IO stats */ > >>>>>>>> + if (cgroup_parent(css->cgroup)) > >>>>>>>> + __blkcg_rstat_flush(css_to_blkcg(css), cpu); > >>>>>>>> +} > >>>>>>>> + > >>>>>>> I think it may not safe to call __blkcg_rstat_flus() directly > >>>>>>> without taking > >>>>>>> the cgroup_rstat_cpu_lock. That is why I added a helper to > >>>>>>> kernel/cgroup/rstat.c in my patch to meet the locking requirement. > >>>>>> All stats are removed from llist_del_all(), and the local list is > >>>>>> iterated, then each blkg & its parent is touched in > >>>>>> __blkcg_rstat_flus(), so > >>>>>> can you explain it a bit why cgroup locks are needed? For protecting > >>>>>> what? > >>>>> You are right. The llist_del_all() call in blkcg_rstat_flush() is > >>>>> atomic, so it is safe for concurrent execution which is what the > >>>>> cgroup_rstat_cpu_lock protects against. That may not be the case for > >>>>> rstat callbacks of other controllers. So I will suggest you to add a > >>>>> comment to clarify that point. Other than that, you patch looks good to > >>>>> me. > >>>>> > >>>>> Reviewed: Waiman Long <longman@xxxxxxxxxx> > >>>> After some more thought, I need to retract my reviewed-by tag for now. There > >>>> is a slight possibility that blkcg_iostat_update() in blkcg_rstat_flush() > >>>> can happen concurrently which will corrupt the sequence count. > >>> llist_del_all() moves all 'bis' into one local list, and bis is one percpu > >>> variable of blkg, so in theory same bis won't be flushed at the same > >>> time. And one bis should be touched in just one of stat flush code path > >>> because of llist_del_all(). > >>> > >>> So 'bis' still can be thought as being flushed in serialized way. > >>> > >>> However, blk_cgroup_bio_start() runs concurrently with blkcg_rstat_flush(), > >>> so once bis->lqueued is cleared in blkcg_rstat_flush(), this same bis > >>> could be added to the percpu llist and __blkcg_rstat_flush() from blkg_release() > >>> follows. This should be the only chance for concurrent stats update. > >> That is why I have in mind. A __blkcg_rstat_flush() can be from > >> blkg_release() and another one from the regular cgroup_rstat_flush*(). > >> > >> > >>> But, blkg_release() is run in RCU callback, so the previous flush has > >>> been done, but new flush can come, and other blkg's stat could be added > >>> with same story above. > >>> > >>>> One way to > >>>> avoid that is to synchronize it by cgroup_rstat_cpu_lock. Another way is to > >>>> use the bisc->lqueued for synchronization. > >>> I'd avoid the external cgroup lock here. > >>> > >>>> In that case, you need to move > >>>> WRITE_ONCE(bisc->lqueued, false) in blkcg_rstat_flush() to the end after all > >>>> the blkcg_iostat_update() call with smp_store_release() and replace the > >>>> READ_ONCE(bis->lqueued) check in blk_cgroup_bio_start() with > >>>> smp_load_acquire(). > >>> This way looks doable, but I guess it still can't avoid concurrent update on parent > >>> stat, such as when __blkcg_rstat_flush() from blkg_release() is > >>> in-progress, another sibling blkg's bis is added, meantime > >>> blkcg_rstat_flush() is called. > >> I realized that the use of cgroup_rstat_cpu_lock or the alternative was > >> not safe enough for preventing concurrent parent blkg rstat update. > >>> Another way is to add blkcg->stat_lock for covering __blkcg_rstat_flush(), what > >>> do you think of this way? > >> I am thinking of adding a raw spinlock into blkg and take it when doing > >> blkcg_iostat_update(). This can guarantee no concurrent update to rstat > >> data. It has to be a raw spinlock as it will be under the > >> cgroup_rstat_cpu_lock raw spinlock. > > Hi Waiman, > > > > I don't have context about blkcg, but isn't this exactly what > > cgroup_rstat_lock does? Is it too expensive to just call > > cgroup_rstat_flush () here? > > I have thought about that too. However, in my test, just calling > cgroup_rstat_flush() in blkcg_destroy_blkgs() did not prevent dying > blkcgs from increasing meaning that there are still some extra > references blocking its removal. I haven't figured out exactly why that > is the case. There may still be some races that we have not fully > understood yet. On the other hand, Ming's patch is verified to not do > that since it does not take extra blkg references. So I am leaning on > his patch now. I just have to make sure that there is no concurrent > rstat update. Oh I didn't mean to change Ming's approach of not taking extra blkg references, just replacing: for_each_possible_cpu(cpu) __blkcg_rstat_flush(blkcg, cpu); with: cgroup_rstat_flush(blkcg->css.cgroup); in the current patch, so that we get the protection from cgroup_rstat_lock against concurrent rstat updates. > > Cheers, > Longman >