Patch "sched/fair: Prevent dead task groups from regaining cfs_rq's" has been added to the 5.15-stable tree

Sasha Levin <sashal@xxxxxxxxxx> · Sun, 21 Nov 2021 18:05:28 -0500

This is a note to let you know that I've just added the patch titled

    sched/fair: Prevent dead task groups from regaining cfs_rq's

to the 5.15-stable tree which can be found at:
    http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
     sched-fair-prevent-dead-task-groups-from-regaining-c.patch
and it can be found in the queue-5.15 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let <stable@xxxxxxxxxxxxxxx> know about it.



commit 88d9f588dc46d057ccb16d8756a9b7dc107a0201
Author: Mathias Krause <minipli@xxxxxxxxxxxxxx>
Date:   Wed Nov 3 20:06:13 2021 +0100

    sched/fair: Prevent dead task groups from regaining cfs_rq's
    
    [ Upstream commit b027789e5e50494c2325cc70c8642e7fd6059479 ]
    
    Kevin is reporting crashes which point to a use-after-free of a cfs_rq
    in update_blocked_averages(). Initial debugging revealed that we've
    live cfs_rq's (on_list=1) in an about to be kfree()'d task group in
    free_fair_sched_group(). However, it was unclear how that can happen.
    
    His kernel config happened to lead to a layout of struct sched_entity
    that put the 'my_q' member directly into the middle of the object
    which makes it incidentally overlap with SLUB's freelist pointer.
    That, in combination with SLAB_FREELIST_HARDENED's freelist pointer
    mangling, leads to a reliable access violation in form of a #GP which
    made the UAF fail fast.
    
    Michal seems to have run into the same issue[1]. He already correctly
    diagnosed that commit a7b359fc6a37 ("sched/fair: Correctly insert
    cfs_rq's to list on unthrottle") is causing the preconditions for the
    UAF to happen by re-adding cfs_rq's also to task groups that have no
    more running tasks, i.e. also to dead ones. His analysis, however,
    misses the real root cause and it cannot be seen from the crash
    backtrace only, as the real offender is tg_unthrottle_up() getting
    called via sched_cfs_period_timer() via the timer interrupt at an
    inconvenient time.
    
    When unregister_fair_sched_group() unlinks all cfs_rq's from the dying
    task group, it doesn't protect itself from getting interrupted. If the
    timer interrupt triggers while we iterate over all CPUs or after
    unregister_fair_sched_group() has finished but prior to unlinking the
    task group, sched_cfs_period_timer() will execute and walk the list of
    task groups, trying to unthrottle cfs_rq's, i.e. re-add them to the
    dying task group. These will later -- in free_fair_sched_group() -- be
    kfree()'ed while still being linked, leading to the fireworks Kevin
    and Michal are seeing.
    
    To fix this race, ensure the dying task group gets unlinked first.
    However, simply switching the order of unregistering and unlinking the
    task group isn't sufficient, as concurrent RCU walkers might still see
    it, as can be seen below:
    
        CPU1:                                      CPU2:
          :                                        timer IRQ:
          :                                          do_sched_cfs_period_timer():
          :                                            :
          :                                            distribute_cfs_runtime():
          :                                              rcu_read_lock();
          :                                              :
          :                                              unthrottle_cfs_rq():
        sched_offline_group():                             :
          :                                                walk_tg_tree_from(…,tg_unthrottle_up,…):
          list_del_rcu(&tg->list);                           :
     (1)  :                                                  list_for_each_entry_rcu(child, &parent->children, siblings)
          :                                                    :
     (2)  list_del_rcu(&tg->siblings);                         :
          :                                                    tg_unthrottle_up():
          unregister_fair_sched_group():                         struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
            :                                                    :
            list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);               :
            :                                                    :
            :                                                    if (!cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running)
     (3)    :                                                        list_add_leaf_cfs_rq(cfs_rq);
          :                                                      :
          :                                                    :
          :                                                  :
          :                                                :
          :                                              :
     (4)  :                                              rcu_read_unlock();
    
    CPU 2 walks the task group list in parallel to sched_offline_group(),
    specifically, it'll read the soon to be unlinked task group entry at
    (1). Unlinking it on CPU 1 at (2) therefore won't prevent CPU 2 from
    still passing it on to tg_unthrottle_up(). CPU 1 now tries to unlink
    all cfs_rq's via list_del_leaf_cfs_rq() in
    unregister_fair_sched_group().  Meanwhile CPU 2 will re-add some of
    these at (3), which is the cause of the UAF later on.
    
    To prevent this additional race from happening, we need to wait until
    walk_tg_tree_from() has finished traversing the task groups, i.e.
    after the RCU read critical section ends in (4). Afterwards we're safe
    to call unregister_fair_sched_group(), as each new walk won't see the
    dying task group any more.
    
    On top of that, we need to wait yet another RCU grace period after
    unregister_fair_sched_group() to ensure print_cfs_stats(), which might
    run concurrently, always sees valid objects, i.e. not already free'd
    ones.
    
    This patch survives Michal's reproducer[2] for 8h+ now, which used to
    trigger within minutes before.
    
      [1] https://lore.kernel.org/lkml/20211011172236.11223-1-mkoutny@xxxxxxxx/
      [2] https://lore.kernel.org/lkml/20211102160228.GA57072@xxxxxxxxxxxxxxxxx/
    
    Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
    [peterz: shuffle code around a bit]
    Reported-by: Kevin Tanguy <kevin.tanguy@xxxxxxxxxxxx>
    Signed-off-by: Mathias Krause <minipli@xxxxxxxxxxxxxx>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
    Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>

diff --git a/kernel/sched/autogroup.c b/kernel/sched/autogroup.c
index 2067080bb2358..8629b37d118e7 100644
--- a/kernel/sched/autogroup.c
+++ b/kernel/sched/autogroup.c
@@ -31,7 +31,7 @@ static inline void autogroup_destroy(struct kref *kref)
 	ag->tg->rt_se = NULL;
 	ag->tg->rt_rq = NULL;
 #endif
-	sched_offline_group(ag->tg);
+	sched_release_group(ag->tg);
 	sched_destroy_group(ag->tg);
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2c34c7bd559f2..779f27a4b46ac 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9720,6 +9720,22 @@ static void sched_free_group(struct task_group *tg)
 	kmem_cache_free(task_group_cache, tg);
 }
 
+static void sched_free_group_rcu(struct rcu_head *rcu)
+{
+	sched_free_group(container_of(rcu, struct task_group, rcu));
+}
+
+static void sched_unregister_group(struct task_group *tg)
+{
+	unregister_fair_sched_group(tg);
+	unregister_rt_sched_group(tg);
+	/*
+	 * We have to wait for yet another RCU grace period to expire, as
+	 * print_cfs_stats() might run concurrently.
+	 */
+	call_rcu(&tg->rcu, sched_free_group_rcu);
+}
+
 /* allocate runqueue etc for a new task group */
 struct task_group *sched_create_group(struct task_group *parent)
 {
@@ -9763,25 +9779,35 @@ void sched_online_group(struct task_group *tg, struct task_group *parent)
 }
 
 /* rcu callback to free various structures associated with a task group */
-static void sched_free_group_rcu(struct rcu_head *rhp)
+static void sched_unregister_group_rcu(struct rcu_head *rhp)
 {
 	/* Now it should be safe to free those cfs_rqs: */
-	sched_free_group(container_of(rhp, struct task_group, rcu));
+	sched_unregister_group(container_of(rhp, struct task_group, rcu));
 }
 
 void sched_destroy_group(struct task_group *tg)
 {
 	/* Wait for possible concurrent references to cfs_rqs complete: */
-	call_rcu(&tg->rcu, sched_free_group_rcu);
+	call_rcu(&tg->rcu, sched_unregister_group_rcu);
 }
 
-void sched_offline_group(struct task_group *tg)
+void sched_release_group(struct task_group *tg)
 {
 	unsigned long flags;
 
-	/* End participation in shares distribution: */
-	unregister_fair_sched_group(tg);
-
+	/*
+	 * Unlink first, to avoid walk_tg_tree_from() from finding us (via
+	 * sched_cfs_period_timer()).
+	 *
+	 * For this to be effective, we have to wait for all pending users of
+	 * this task group to leave their RCU critical section to ensure no new
+	 * user will see our dying task group any more. Specifically ensure
+	 * that tg_unthrottle_up() won't add decayed cfs_rq's to it.
+	 *
+	 * We therefore defer calling unregister_fair_sched_group() to
+	 * sched_unregister_group() which is guarantied to get called only after the
+	 * current RCU grace period has expired.
+	 */
 	spin_lock_irqsave(&task_group_lock, flags);
 	list_del_rcu(&tg->list);
 	list_del_rcu(&tg->siblings);
@@ -9900,7 +9926,7 @@ static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
 {
 	struct task_group *tg = css_tg(css);
 
-	sched_offline_group(tg);
+	sched_release_group(tg);
 }
 
 static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
@@ -9910,7 +9936,7 @@ static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
 	/*
 	 * Relies on the RCU grace period between css_released() and this.
 	 */
-	sched_free_group(tg);
+	sched_unregister_group(tg);
 }
 
 /*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f6a05d9b54436..6f16dfb742462 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11358,8 +11358,6 @@ void free_fair_sched_group(struct task_group *tg)
 {
 	int i;
 
-	destroy_cfs_bandwidth(tg_cfs_bandwidth(tg));
-
 	for_each_possible_cpu(i) {
 		if (tg->cfs_rq)
 			kfree(tg->cfs_rq[i]);
@@ -11436,6 +11434,8 @@ void unregister_fair_sched_group(struct task_group *tg)
 	struct rq *rq;
 	int cpu;
 
+	destroy_cfs_bandwidth(tg_cfs_bandwidth(tg));
+
 	for_each_possible_cpu(cpu) {
 		if (tg->se[cpu])
 			remove_entity_load_avg(tg->se[cpu]);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 3daf42a0f4623..bfef3f39b5552 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -137,13 +137,17 @@ static inline struct rq *rq_of_rt_se(struct sched_rt_entity *rt_se)
 	return rt_rq->rq;
 }
 
-void free_rt_sched_group(struct task_group *tg)
+void unregister_rt_sched_group(struct task_group *tg)
 {
-	int i;
-
 	if (tg->rt_se)
 		destroy_rt_bandwidth(&tg->rt_bandwidth);
 
+}
+
+void free_rt_sched_group(struct task_group *tg)
+{
+	int i;
+
 	for_each_possible_cpu(i) {
 		if (tg->rt_rq)
 			kfree(tg->rt_rq[i]);
@@ -250,6 +254,8 @@ static inline struct rt_rq *rt_rq_of_se(struct sched_rt_entity *rt_se)
 	return &rq->rt;
 }
 
+void unregister_rt_sched_group(struct task_group *tg) { }
+
 void free_rt_sched_group(struct task_group *tg) { }
 
 int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3d3e5793e1172..4f432826933da 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -486,6 +486,7 @@ extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
 extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
 extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);
 
+extern void unregister_rt_sched_group(struct task_group *tg);
 extern void free_rt_sched_group(struct task_group *tg);
 extern int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent);
 extern void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
@@ -501,7 +502,7 @@ extern struct task_group *sched_create_group(struct task_group *parent);
 extern void sched_online_group(struct task_group *tg,
 			       struct task_group *parent);
 extern void sched_destroy_group(struct task_group *tg);
-extern void sched_offline_group(struct task_group *tg);
+extern void sched_release_group(struct task_group *tg);
 
 extern void sched_move_task(struct task_struct *tsk);