On Thu, May 25, 2023 at 12:02:01AM -0400, Waiman Long wrote: > > On 5/24/23 23:47, Ming Lei wrote: > > On Wed, May 24, 2023 at 10:50:29PM -0400, Waiman Long wrote: > > > On 5/24/23 22:04, Ming Lei wrote: > > > > On Wed, May 24, 2023 at 01:28:41PM -0400, Waiman Long wrote: > > > > > On 5/24/23 11:43, Waiman Long wrote: > > > > > > > > > llist_del_all() moves all 'bis' into one local list, and bis is one percpu > > > > variable of blkg, so in theory same bis won't be flushed at the same > > > > time. And one bis should be touched in just one of stat flush code path > > > > because of llist_del_all(). > > > > > > > > So 'bis' still can be thought as being flushed in serialized way. > > > > > > > > However, blk_cgroup_bio_start() runs concurrently with blkcg_rstat_flush(), > > > > so once bis->lqueued is cleared in blkcg_rstat_flush(), this same bis > > > > could be added to the percpu llist and __blkcg_rstat_flush() from blkg_release() > > > > follows. This should be the only chance for concurrent stats update. > > > That is why I have in mind. A __blkcg_rstat_flush() can be from > > > blkg_release() and another one from the regular cgroup_rstat_flush*(). > > But the two blkg can't be same because blkg_release() is run from rcu > > callback. > > I am not talking about the parent blkg which can have a concurrent update. > It is probably safe at the blkg level, but the lockless list may contain > another bisc from a blkg that is not being destroyed. > > > > > > > > > > > But, blkg_release() is run in RCU callback, so the previous flush has > > > > been done, but new flush can come, and other blkg's stat could be added > > > > with same story above. > > > > > > > > > One way to > > > > > avoid that is to synchronize it by cgroup_rstat_cpu_lock. Another way is to > > > > > use the bisc->lqueued for synchronization. > > > > I'd avoid the external cgroup lock here. > > > > > > > > > In that case, you need to move > > > > > WRITE_ONCE(bisc->lqueued, false) in blkcg_rstat_flush() to the end after all > > > > > the blkcg_iostat_update() call with smp_store_release() and replace the > > > > > READ_ONCE(bis->lqueued) check in blk_cgroup_bio_start() with > > > > > smp_load_acquire(). > > > > This way looks doable, but I guess it still can't avoid concurrent update on parent > > > > stat, such as when __blkcg_rstat_flush() from blkg_release() is > > > > in-progress, another sibling blkg's bis is added, meantime > > > > blkcg_rstat_flush() is called. > > > I realized that the use of cgroup_rstat_cpu_lock or the alternative was not > > > safe enough for preventing concurrent parent blkg rstat update. > > Indeed. > > > > > > Another way is to add blkcg->stat_lock for covering __blkcg_rstat_flush(), what > > > > do you think of this way? > > > I am thinking of adding a raw spinlock into blkg and take it when doing > > > blkcg_iostat_update(). This can guarantee no concurrent update to rstat > > > data. It has to be a raw spinlock as it will be under the > > > cgroup_rstat_cpu_lock raw spinlock. > > We still need to be careful since both child and parent are updated > > in the following blkcg_iostat_update(). > > > > if (parent && parent->parent) > > blkcg_iostat_update(parent, &blkg->iostat.cur, > > &blkg->iostat.last); > > > > But the concurrent update on parent blkg can be avoided by holding > > parent->stat_lock only. > > That is true. Perhaps we can use Yosry's suggestion of adding a helper to > acquire and release cgroup_rstat_lock before calling __blkcg_rstat_flush(). > That will also guarantee no concurrent update. Yeah, probably add the following API: void cgroup_rstat_css_flush(struct cgroup_subsys_state *css) { spin_lock_irq(&cgroup_rstat_lock); for_each_possible_cpu(cpu) { raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu); raw_spin_lock_irq(cpu_lock); css->ss->css_rstat_flush(css, cpu); raw_spin_unlock_irq(cpu_lock); } spin_unlock_irq(&cgroup_rstat_lock); } We should make the fix as simple as possible given it needs backport to -stable, another choice is to add global blkg_stat_lock. Which approach is better? Thanks, Ming