Hello, On Wed, Jan 18, 2023 at 08:31:52PM +0800, Yu Kuai wrote: > From: Yu Kuai <yukuai3@xxxxxxxxxx> > > Currently parent pd can be freed before child pd: > > t1: remove cgroup C1 > blkcg_destroy_blkgs > blkg_destroy > list_del_init(&blkg->q_node) > // remove blkg from queue list > percpu_ref_kill(&blkg->refcnt) > blkg_release > call_rcu > > t2: from t1 > __blkg_release > blkg_free > schedule_work > t4: deactivate policy > blkcg_deactivate_policy > pd_free_fn > // parent of C1 is freed first > t3: from t2 > blkg_free_workfn > pd_free_fn > > If policy(for example, ioc_timer_fn() from iocost) access parent pd from > child pd after pd_offline_fn(), then UAF can be triggered. > > Fix the problem by delaying 'list_del_init(&blkg->q_node)' from > blkg_destroy() to blkg_free_workfn(), and use a new disk level mutex to ^ using > protect blkg_free_workfn() and blkcg_deactivate_policy). ^ ^ synchronize? () > @@ -118,16 +118,26 @@ static void blkg_free_workfn(struct work_struct *work) > { > struct blkcg_gq *blkg = container_of(work, struct blkcg_gq, > free_work); > + struct request_queue *q = blkg->q; > int i; > > + if (q) > + mutex_lock(&q->blkcg_mutex); A comment explaining what the above is synchronizing would be useful. > + > for (i = 0; i < BLKCG_MAX_POLS; i++) > if (blkg->pd[i]) > blkcg_policy[i]->pd_free_fn(blkg->pd[i]); > > if (blkg->parent) > blkg_put(blkg->parent); > - if (blkg->q) > - blk_put_queue(blkg->q); > + > + if (q) { > + if (!list_empty(&blkg->q_node)) We can drop the above if. > + list_del_init(&blkg->q_node); > + mutex_unlock(&q->blkcg_mutex); > + blk_put_queue(q); > + } > + > free_percpu(blkg->iostat_cpu); > percpu_ref_exit(&blkg->refcnt); > kfree(blkg); > @@ -462,9 +472,14 @@ static void blkg_destroy(struct blkcg_gq *blkg) > lockdep_assert_held(&blkg->q->queue_lock); > lockdep_assert_held(&blkcg->lock); > > - /* Something wrong if we are trying to remove same group twice */ > - WARN_ON_ONCE(list_empty(&blkg->q_node)); > - WARN_ON_ONCE(hlist_unhashed(&blkg->blkcg_node)); > + /* > + * blkg is removed from queue list in blkg_free_workfn(), hence this > + * function can be called from blkcg_destroy_blkgs() first, and then > + * before blkg_free_workfn(), this function can be called again in > + * blkg_destroy_all(). How about? * blkg stays on the queue list until blkg_free_workfn(), hence this * function can be called from blkcg_destroy_blkgs() first and again * from blkg_destroy_all() before blkg_free_workfn(). > + */ > + if (hlist_unhashed(&blkg->blkcg_node)) > + return; > > for (i = 0; i < BLKCG_MAX_POLS; i++) { > struct blkcg_policy *pol = blkcg_policy[i]; > @@ -478,8 +493,11 @@ static void blkg_destroy(struct blkcg_gq *blkg) > > blkg->online = false; > > + /* > + * Delay deleting list blkg->q_node to blkg_free_workfn() to synchronize > + * pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy(). > + */ So, it'd be better to add a more comprehensive comment in blkg_free_workfn() explaining why we need this synchronization and how it works and then point to it from here. Other than comments, it looks great to me. Thanks a lot for your patience and seeing it through. -- tejun