Re: [PATCH V2] blk-cgroup: Flush stats before releasing blkcg_gq

Ming Lei <ming.lei@xxxxxxxxxx> · Thu, 25 May 2023 12:05:57 +0800

On Thu, May 25, 2023 at 12:02:01AM -0400, Waiman Long wrote:
> 
> On 5/24/23 23:47, Ming Lei wrote:
> > On Wed, May 24, 2023 at 10:50:29PM -0400, Waiman Long wrote:
> > > On 5/24/23 22:04, Ming Lei wrote:
> > > > On Wed, May 24, 2023 at 01:28:41PM -0400, Waiman Long wrote:
> > > > > On 5/24/23 11:43, Waiman Long wrote:
> > > > > 
> > > > llist_del_all() moves all 'bis' into one local list, and bis is one percpu
> > > > variable of blkg, so in theory same bis won't be flushed at the same
> > > > time. And one bis should be touched in just one of stat flush code path
> > > > because of llist_del_all().
> > > > 
> > > > So 'bis' still can be thought as being flushed in serialized way.
> > > > 
> > > > However, blk_cgroup_bio_start() runs concurrently with blkcg_rstat_flush(),
> > > > so once bis->lqueued is cleared in blkcg_rstat_flush(), this same bis
> > > > could be added to the percpu llist and __blkcg_rstat_flush() from blkg_release()
> > > > follows. This should be the only chance for concurrent stats update.
> > > That is why I have in mind. A __blkcg_rstat_flush() can be from
> > > blkg_release() and another one from the regular cgroup_rstat_flush*().
> > But the two blkg can't be same because blkg_release() is run from rcu
> > callback.
> 
> I am not talking about the parent blkg which can have a concurrent update.
> It is probably safe at the blkg level, but the lockless list may contain
> another bisc from a blkg that is not being destroyed.
> 
> 
> > 
> > > 
> > > > But, blkg_release() is run in RCU callback, so the previous flush has
> > > > been done, but new flush can come, and other blkg's stat could be added
> > > > with same story above.
> > > > 
> > > > > One way to
> > > > > avoid that is to synchronize it by cgroup_rstat_cpu_lock. Another way is to
> > > > > use the bisc->lqueued for synchronization.
> > > > I'd avoid the external cgroup lock here.
> > > > 
> > > > > In that case, you need to move
> > > > > WRITE_ONCE(bisc->lqueued, false) in blkcg_rstat_flush() to the end after all
> > > > > the  blkcg_iostat_update() call with smp_store_release() and replace the
> > > > > READ_ONCE(bis->lqueued) check in blk_cgroup_bio_start() with
> > > > > smp_load_acquire().
> > > > This way looks doable, but I guess it still can't avoid concurrent update on parent
> > > > stat, such as when  __blkcg_rstat_flush() from blkg_release() is
> > > > in-progress, another sibling blkg's bis is added, meantime
> > > > blkcg_rstat_flush() is called.
> > > I realized that the use of cgroup_rstat_cpu_lock or the alternative was not
> > > safe enough for preventing concurrent parent blkg rstat update.
> > Indeed.
> > 
> > > > Another way is to add blkcg->stat_lock for covering __blkcg_rstat_flush(), what
> > > > do you think of this way?
> > > I am thinking of adding a raw spinlock into blkg and take it when doing
> > > blkcg_iostat_update(). This can guarantee no concurrent update to rstat
> > > data. It has to be a raw spinlock as it will be under the
> > > cgroup_rstat_cpu_lock raw spinlock.
> > We still need to be careful since both child and parent are updated
> > in the following blkcg_iostat_update().
> > 
> >                  if (parent && parent->parent)
> >                          blkcg_iostat_update(parent, &blkg->iostat.cur,
> >                                              &blkg->iostat.last);
> > 
> > But the concurrent update on parent blkg can be avoided by holding
> > parent->stat_lock only.
> 
> That is true. Perhaps we can use Yosry's suggestion of adding a helper to
> acquire and release cgroup_rstat_lock before calling __blkcg_rstat_flush().
> That will also guarantee no concurrent update.

Yeah, probably add the following API:

void cgroup_rstat_css_flush(struct cgroup_subsys_state *css)
{
    spin_lock_irq(&cgroup_rstat_lock);

    for_each_possible_cpu(cpu) {
        raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);

        raw_spin_lock_irq(cpu_lock);
        css->ss->css_rstat_flush(css, cpu);
        raw_spin_unlock_irq(cpu_lock);
    }
    spin_unlock_irq(&cgroup_rstat_lock);
}

We should make the fix as simple as possible given it needs backport
to -stable, another choice is to add global blkg_stat_lock. Which
approach is better?

Thanks, 
Ming