Hello. On Thu, Jun 02, 2022 at 03:20:20PM -0400, Waiman Long <longman@xxxxxxxxxx> wrote: > As it is likely that not all the percpu blkg_iostat_set's has been > updated since the last flush, those stale blkg_iostat_set's don't need > to be flushed in this case. Yes, there's no point to flush stats for idle devices if there can be many of them. Good idea. > +static struct llist_node *fetch_delete_blkcg_llist(struct llist_head *lhead) > +{ > + return xchg(&lhead->first, &llist_last); > +} > + > +static struct llist_node *fetch_delete_lnode_next(struct llist_node *lnode) > +{ > + struct llist_node *next = READ_ONCE(lnode->next); > + struct blkcg_gq *blkg = llist_entry(lnode, struct blkg_iostat_set, > + lnode)->blkg; > + > + WRITE_ONCE(lnode->next, NULL); > + percpu_ref_put(&blkg->refcnt); > + return next; > +} Idea/just asking: would it make sense to generalize this into llist.c (this is basically llist_del_first() + llist_del_all() with a sentinel)? For the sake of reusability. > +#define blkcg_llist_for_each_entry_safe(pos, node, nxt) \ > + for (; (node != &llist_last) && \ > + (pos = llist_entry(node, struct blkg_iostat_set, lnode), \ > + nxt = fetch_delete_lnode_next(node), true); \ > + node = nxt) > + It's good hygiene to parenthesize the args. > @@ -2011,9 +2092,16 @@ void blk_cgroup_bio_start(struct bio *bio) > } > bis->cur.ios[rwd]++; > > + if (!READ_ONCE(bis->lnode.next)) { > + struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu); > + > + llist_add(&bis->lnode, lhead); > + percpu_ref_get(&bis->blkg->refcnt); > + } > + When a blkg's cgroup is rmdir'd, what happens with the lhead list? We have cgroup_rstat_exit() in css_free_rwork_fn() that ultimately flushes rstats. init_and_link_css however adds reference form blkcg->css to cgroup->css. The blkcg->css would be (transitively) pinned by the lhead list and hence would prevent the final flush (when refs drop to zero). Seems like a cyclic dependency. Luckily, there's also per-subsys flushing in css_release which could be moved after rmdir (offlining) but before last ref is gone: diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index adb820e98f24..d830e6a8fb3b 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -5165,11 +5165,6 @@ static void css_release_work_fn(struct work_struct *work) if (ss) { /* css release path */ - if (!list_empty(&css->rstat_css_node)) { - cgroup_rstat_flush(cgrp); - list_del_rcu(&css->rstat_css_node); - } - cgroup_idr_replace(&ss->css_idr, NULL, css->id); if (ss->css_released) ss->css_released(css); @@ -5279,6 +5274,11 @@ static void offline_css(struct cgroup_subsys_state *css) css->flags &= ~CSS_ONLINE; RCU_INIT_POINTER(css->cgroup->subsys[ss->id], NULL); + if (!list_empty(&css->rstat_css_node)) { + cgroup_rstat_flush(css->cgrp); + list_del_rcu(&css->rstat_css_node); + } + wake_up_all(&css->cgroup->offline_waitq); } (not tested) > u64_stats_update_end_irqrestore(&bis->sync, flags); > if (cgroup_subsys_on_dfl(io_cgrp_subsys)) > - cgroup_rstat_updated(bio->bi_blkg->blkcg->css.cgroup, cpu); > + cgroup_rstat_updated(blkcg->css.cgroup, cpu); Maybe bundle the lhead list maintenace with cgroup_rstat_updated() under cgroup_subsys_on_dfl()? The stats can be read on v1 anyway. Thanks, Michal