On Thu, May 25, 2023 at 12:06:07PM -0400, Waiman Long wrote: > On 5/25/23 12:01, Waiman Long wrote: > > As noted by Michal, the blkg_iostat_set's in the lockless list hold > > reference to blkg's to protect against their removal. Those blkg's > > hold reference to blkcg. When a cgroup is being destroyed, > > cgroup_rstat_flush() is only called at css_release_work_fn() which > > is called when the blkcg reference count reaches 0. This circular > > dependency will prevent blkcg and some blkgs from being freed after > > they are made offline. > > > > It is less a problem if the cgroup to be destroyed also has other > > controllers like memory that will call cgroup_rstat_flush() which will > > clean up the reference count. If block is the only controller that uses > > rstat, these offline blkcg and blkgs may never be freed leaking more > > and more memory over time. > > > > To prevent this potential memory leak: > > > > - flush blkcg per-cpu stats list in __blkg_release(), when no new stat > > can be added to avoid use-after-free of the percpu blkg_iostat_set in > > futue cgroup_rstat_flush*() calls. > > > > - add a cgroup_rstat_flush_acquire() helper and call it to acquire > > cgroup_rstat_lock to block concurrent execution of other > > cgroup_rstat_flush*() calls > > > > - don't grab bio->bi_blkg reference when adding the stats into blkcg's > > per-cpu stat list since all stats are guaranteed to be consumed before > > releasing blkg instance, and grabbing blkg reference for stats was > > the most fragile part of original patch > > > > Based on Waiman's patch: > > > > https://lore.kernel.org/linux-block/20221215033132.230023-3-longman@xxxxxxxxxx/ > > > > Fixes: 3b8cc6298724 ("blk-cgroup: Optimize blkcg_rstat_flush()") > > Cc: stable@xxxxxxxxxxxxxxx > > Reported-by: Jay Shin <jaeshin@xxxxxxxxxx> > > Cc: Waiman Long <longman@xxxxxxxxxx> > > Cc: Tejun Heo <tj@xxxxxxxxxx> > > Cc: mkoutny@xxxxxxxx > > Cc: Yosry Ahmed <yosryahmed@xxxxxxxxxx> > > Co-developed-by: Ming Lei <ming.lei@xxxxxxxxxx> > > Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx> > > Signed-off-by: Waiman Long <longman@xxxxxxxxxx> > > --- > > block/blk-cgroup.c | 57 +++++++++++++++++++++++++++++++----------- > > include/linux/cgroup.h | 1 + > > kernel/cgroup/rstat.c | 15 ++++++++++- > > 3 files changed, 57 insertions(+), 16 deletions(-) > > This is my counter-proposal to Ming's v3 patch. The major difference is that > I used the existing cgroup_rstat_lock instead of adding a new internal lock. > This minimizes performance impact to existing cgroup_rstat_flush*() call The added internal lock has ~zero perf impact on rstat flush cause the lock won't be contended basically. > while achieving the same objective. I am fine with Ming current v3 patch if > we decide to go that way. As I mentioned, the main motivation with internal lock is to make the fix as simple as possible since cross-subsystem change isn't involved, and I am fine with any following cleanup or improvement on current blkg rstat flush. Another benefit with this internal lock is that race in blkcg_reset_stats() can be avoided. Thanks, Ming