[PATCH review for 4.4] sched/autogroup: Fix autogroup_move_group() to never skip sched_move_task()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



From: Oleg Nesterov <oleg@xxxxxxxxxx>

[ Upstream commit 18f649ef344127ef6de23a5a4272dbe2fdb73dde ]

The PF_EXITING check in task_wants_autogroup() is no longer needed. Remove
it, but see the next patch.

However the comment is correct in that autogroup_move_group() must always
change task_group() for every thread so the sysctl_ check is very wrong;
we can race with cgroups and even sys_setsid() is not safe because a task
running with task_group() == ag->tg must participate in refcounting:

	int main(void)
	{
		int sctl = open("/proc/sys/kernel/sched_autogroup_enabled", O_WRONLY);

		assert(sctl > 0);
		if (fork()) {
			wait(NULL); // destroy the child's ag/tg
			pause();
		}

		assert(pwrite(sctl, "1\n", 2, 0) == 2);
		assert(setsid() > 0);
		if (fork())
			pause();

		kill(getppid(), SIGKILL);
		sleep(1);

		// The child has gone, the grandchild runs with kref == 1
		assert(pwrite(sctl, "0\n", 2, 0) == 2);
		assert(setsid() > 0);

		// runs with the freed ag/tg
		for (;;)
			sleep(1);

		return 0;
	}

crashes the kernel. It doesn't really need sleep(1), it doesn't matter if
autogroup_move_group() actually frees the task_group or this happens later.

-----

Without this commit, 4.4.y + hikey kernel crashes while running
'autogroup01' test in ltp (it seems to be written to catch this break
itself):

Unable to handle kernel NULL pointer dereference at virtual address 00000080
pgd = ffffffc050b11000
[00000080] *pgd=0000000070744003, *pud=0000000070744003, *pmd=0000000000000000
Internal error: Oops: 96000006 [#1] PREEMPT SMP
Modules linked in: bnep arc4 wl18xx wlcore mac80211 cfg80211 wlcore_sdio spidev btwilink st_drv bluetooth rfkill
CPU: 7 PID: 7965 Comm: autogroup01 Tainted: G        W       4.4.92-rc1+ #1
Hardware name: HiKey Development Board (DT)
task: ffffffc04ccf0000 ti: ffffffc0491ec000 task.ti: ffffffc0491ec000
PC is at pick_next_task_fair+0x620/0x8f0
LR is at pick_next_task_fair+0x620/0x8f0
pc : [<ffffffc0000f3dd0>] lr : [<ffffffc0000f3dd0>] pstate: 600001c5
sp : ffffffc0491efcc0
...
...
...
Call trace:
[<ffffffc0000f3dd0>] pick_next_task_fair+0x620/0x8f0
[<ffffffc00080fb7c>] __schedule+0x124/0x768
[<ffffffc000810204>] schedule+0x44/0xb8
[<ffffffc0008148c4>] do_nanosleep+0x94/0x168
[<ffffffc00012340c>] hrtimer_nanosleep+0x94/0x148
[<ffffffc000123534>] SyS_nanosleep+0x74/0xa0
[<ffffffc000085e30>] el0_svc_naked+0x24/0x28
Code: 54ffdae1 aa1303e0 aa1403e1 97ffdb61 (f9404013)

With this, the test case passes:

hikey:/opt/ltp# ./testcases/bin/autogroup01
tst_test.c:934: INFO: Timeout per run is 0h 05m 00s
autogroup01.c:71: PASS: Bug not reproduced

Summary:
passed   1
failed   0
skipped  0
warnings 0

Reported-by: Vern Lovejoy <vlovejoy@xxxxxxxxxx>
Signed-off-by: Oleg Nesterov <oleg@xxxxxxxxxx>
Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Cc: Mike Galbraith <efault@xxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Cc: hartsjc@xxxxxxxxxx
Cc: vbendel@xxxxxxxxxx
Link: http://lkml.kernel.org/r/20161114184609.GA15965@xxxxxxxxxx
Signed-off-by: Ingo Molnar <mingo@xxxxxxxxxx>
Signed-off-by: Sumit Semwal <sumit.semwal@xxxxxxxxxx>
 [sumits: submit to 4.4 LTS, post testing on Hikey]
---
 kernel/sched/auto_group.c | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/auto_group.c b/kernel/sched/auto_group.c
index 750ed601ddf7..8620fd01b3d0 100644
--- a/kernel/sched/auto_group.c
+++ b/kernel/sched/auto_group.c
@@ -111,14 +111,11 @@ bool task_wants_autogroup(struct task_struct *p, struct task_group *tg)
 {
 	if (tg != &root_task_group)
 		return false;
-
 	/*
-	 * We can only assume the task group can't go away on us if
-	 * autogroup_move_group() can see us on ->thread_group list.
+	 * If we race with autogroup_move_group() the caller can use the old
+	 * value of signal->autogroup but in this case sched_move_task() will
+	 * be called again before autogroup_kref_put().
 	 */
-	if (p->flags & PF_EXITING)
-		return false;
-
 	return true;
 }
 
@@ -138,13 +135,17 @@ autogroup_move_group(struct task_struct *p, struct autogroup *ag)
 	}
 
 	p->signal->autogroup = autogroup_kref_get(ag);
-
-	if (!READ_ONCE(sysctl_sched_autogroup_enabled))
-		goto out;
-
+	/*
+	 * We can't avoid sched_move_task() after we changed signal->autogroup,
+	 * this process can already run with task_group() == prev->tg or we can
+	 * race with cgroup code which can read autogroup = prev under rq->lock.
+	 * In the latter case for_each_thread() can not miss a migrating thread,
+	 * cpu_cgroup_attach() must not be possible after cgroup_exit() and it
+	 * can't be removed from thread list, we hold ->siglock.
+	 */
 	for_each_thread(p, t)
 		sched_move_task(t);
-out:
+
 	unlock_task_sighand(p, &flags);
 	autogroup_kref_put(prev);
 }
-- 
2.7.4




[Index of Archives]     [Linux Kernel]     [Kernel Development Newbies]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Hiking]     [Linux Kernel]     [Linux SCSI]