Re: [PATCH -next] cgroup: fix uaf when proc_cpuset_show

Waiman Long <longman@xxxxxxxxxx> · Mon, 24 Jun 2024 19:59:15 -0400

On 6/23/24 22:59, chenridong wrote:

On 2024/6/22 23:05, Waiman Long wrote:

On 6/22/24 07:38, Chen Ridong wrote:
We found a refcount UAF bug as follows:

BUG: KASAN: use-after-free in cgroup_path_ns+0x112/0x150
Read of size 8 at addr ffff8882a4b242b8 by task atop/19903

CPU: 27 PID: 19903 Comm: atop Kdump: loaded Tainted: GF
Call Trace:
  dump_stack+0x7d/0xa7
  print_address_description.constprop.0+0x19/0x170
  ? cgroup_path_ns+0x112/0x150
  __kasan_report.cold+0x6c/0x84
  ? print_unreferenced+0x390/0x3b0
  ? cgroup_path_ns+0x112/0x150
  kasan_report+0x3a/0x50
  cgroup_path_ns+0x112/0x150
  proc_cpuset_show+0x164/0x530
  proc_single_show+0x10f/0x1c0
  seq_read_iter+0x405/0x1020
  ? aa_path_link+0x2e0/0x2e0
  seq_read+0x324/0x500
  ? seq_read_iter+0x1020/0x1020
  ? common_file_perm+0x2a1/0x4a0
  ? fsnotify_unmount_inodes+0x380/0x380
  ? bpf_lsm_file_permission_wrapper+0xa/0x30
  ? security_file_permission+0x53/0x460
  vfs_read+0x122/0x420
  ksys_read+0xed/0x1c0
  ? __ia32_sys_pwrite64+0x1e0/0x1e0
  ? __audit_syscall_exit+0x741/0xa70
  do_syscall_64+0x33/0x40
  entry_SYSCALL_64_after_hwframe+0x67/0xcc

This is also reported by: 
https://syzkaller.appspot.com/bug?extid=9b1ff7be974a403aa4cd

This can be reproduced by the following methods:
1.add an mdelay(1000) before acquiring the cgroup_lock In the
  cgroup_path_ns function.
2.$cat /proc/<pid>/cpuset   repeatly.
3.$mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset/
$umount /sys/fs/cgroup/cpuset/   repeatly.

The race that cause this bug can be shown as below:

(umount)        |    (cat /proc/<pid>/cpuset)
css_release        |    proc_cpuset_show
css_release_work_fn    |    css = task_get_css(tsk, cpuset_cgrp_id);
css_free_rwork_fn    |    cgroup_path_ns(css->cgroup, ...);
cgroup_destroy_root    |    mutex_lock(&cgroup_mutex);
rebind_subsystems    |
cgroup_free_root     |
            |    // cgrp was freed, UAF
            |    cgroup_path_ns_locked(cgrp,..);

When the cpuset is initialized, the root node top_cpuset.css.cgrp
will point to &cgrp_dfl_root.cgrp. In cgroup v1, the mount operation 
will
allocate cgroup_root, and top_cpuset.css.cgrp will point to the 
allocated
&cgroup_root.cgrp. When the umount operation is executed,
top_cpuset.css.cgrp will be rebound to &cgrp_dfl_root.cgrp.

The problem is that when rebinding to cgrp_dfl_root, there are cases
where the cgroup_root allocated by setting up the root for cgroup v1
is cached. This could lead to a Use-After-Free (UAF) if it is
subsequently freed. The descendant cgroups of cgroup v1 can only be
freed after the css is released. However, the css of the root will 
never
be released, yet the cgroup_root should be freed when it is unmounted.
This means that obtaining a reference to the css of the root does
not guarantee that css.cgrp->root will not be freed.

To solve this issue, we have added a cgroup reference count in
the proc_cpuset_show function to ensure that css.cgrp->root will not
be freed prematurely. This is a temporary solution. Let's see if anyone
has a better solution.

Signed-off-by: Chen Ridong <chenridong@xxxxxxxxxx>
---
  kernel/cgroup/cpuset.c | 20 ++++++++++++++++++++
  1 file changed, 20 insertions(+)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index c12b9fdb22a4..782eaf807173 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -5045,6 +5045,7 @@ int proc_cpuset_show(struct seq_file *m, 
struct pid_namespace *ns,
      char *buf;
      struct cgroup_subsys_state *css;
      int retval;
+    struct cgroup *root_cgroup = NULL;
        retval = -ENOMEM;
      buf = kmalloc(PATH_MAX, GFP_KERNEL);
@@ -5052,9 +5053,28 @@ int proc_cpuset_show(struct seq_file *m, 
struct pid_namespace *ns,
          goto out;
        css = task_get_css(tsk, cpuset_cgrp_id);
+    rcu_read_lock();
+    /*
+     * When the cpuset subsystem is mounted on the legacy hierarchy,
+     * the top_cpuset.css->cgroup does not hold a reference count of
+     * cgroup_root.cgroup. This makes accessing css->cgroup very
+     * dangerous because when the cpuset subsystem is remounted to the
+     * default hierarchy, the cgroup_root.cgroup that css->cgroup 
points
+     * to will be released, leading to a UAF issue. To avoid this 
problem,
+     * get the reference count of top_cpuset.css->cgroup first.
+     *
+     * This is ugly!!
+     */
+    if (css == &top_cpuset.css) {
+        cgroup_get(css->cgroup);
+        root_cgroup = css->cgroup;
+    }
+    rcu_read_unlock();
      retval = cgroup_path_ns(css->cgroup, buf, PATH_MAX,
                  current->nsproxy->cgroup_ns);
      css_put(css);
+    if (root_cgroup)
+        cgroup_put(root_cgroup);
      if (retval == -E2BIG)
          retval = -ENAMETOOLONG;
      if (retval < 0)

Thanks for reporting this UAF bug. Could you try the attached patch 
to see if it can fix the issue?


+/*
+ * With a cgroup v1 mount, root_css.cgroup can be freed. We need to 
take a
+ * reference to it to avoid UAF as proc_cpuset_show() may access the 
content
+ * of this cgroup.
+ */
 static void cpuset_bind(struct cgroup_subsys_state *root_css)
 {
+    static struct cgroup *v1_cgroup_root;
+
     mutex_lock(&cpuset_mutex);
+    if (v1_cgroup_root) {
+        cgroup_put(v1_cgroup_root);
+        v1_cgroup_root = NULL;
+    }
     spin_lock_irq(&callback_lock);

     if (is_in_v2_mode()) {
@@ -4159,6 +4170,10 @@ static void cpuset_bind(struct 
cgroup_subsys_state *root_css)
     }

     spin_unlock_irq(&callback_lock);
+    if (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys)) {
+        v1_cgroup_root = root_css->cgroup;
+        cgroup_get(v1_cgroup_root);
+    }
     mutex_unlock(&cpuset_mutex);
 }

Thanks for your suggestion. If we take a reference at rebind(call 
->bind()) function, cgroup_root allocated when setting up root for 
cgroup v1 can never be released, because the reference count will 
never be reduced to zero.

We have already tried similar methods to fix this issue, however doing 
so causes another issue as mentioned previously.

You are right. Taking the reference in cpuset_bind() will prevent 
cgroup_destroy_root() from being called. I had overlooked that.

Now I have an even simpler fix. Could you try the attached v2 patch to 
verify if that can fix the problem?

Thanks,
Longman
From 2996235545433ce25e917af11f4985d7b6880764 Mon Sep 17 00:00:00 2001
From: Waiman Long <longman@xxxxxxxxxx>
Date: Mon, 24 Jun 2024 19:53:32 -0400
Subject: [PATCH v2] cgroup/cpuset: Prevent UAF in proc_cpuset_show()

The unmounting of a cpuset cgroup filesystem will lead to a call to
cpuset_bind() to rebind it back to &cgrp_dfl_root.cgrp via the following
call sequence.

  cgroup_destroy_root()
  --> rebind_subsystems()
  --> cpuset_bind()

The call to cpuset_bind() is done after setting top_cpuset.css.cgroup
to the &cgrp_dfl_root.cgrp. The allocated v1 cgroup root will be freed
after the completion of the cpuset_bind() call and other miscellaneous
cleanups.

Fix this potential UAF problem by putting the access and parsing
of top_cpuset.css.cgroup under cpuset_mutex to synchronize with
cpuset_bind() of the unmount operation. If the cpuset_mutex is acquired
after cpuset_bind(), top_cpuset.css.cgroup is guaranteed to point to
cgrp_dfl_root.cgrp. If it is acquired before cpuset_bind(), the allocated
v1 cgroup root cannot be freed until after the cpuset_mutex is released.

A similar UAF problem in proc_cpuset_show() had been reported before in
[1].

[1] https://syzkaller.appspot.com/bug?extid=9b1ff7be974a403aa4cd

Reported-by: Chen Ridong <chenridong@xxxxxxxxxx>
Closes: https://syzkaller.appspot.com/bug?extid=9b1ff7be974a403aa4cd
Signed-off-by: Waiman Long <longman@xxxxxxxxxx>
---
 kernel/cgroup/cpuset.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index c12b9fdb22a4..953150a06d81 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -5051,10 +5051,17 @@ int proc_cpuset_show(struct seq_file *m, struct pid_namespace *ns,
 	if (!buf)
 		goto out;
 
+	/*
+	 * Access to css->cgroup is guarded by cpuset_mutex to synchronize
+	 * with the cpuset_bind() call of a racing v1 cgroup root unmount
+	 * operation to prevent UAF.
+	 */
+	mutex_lock(&cpuset_mutex);
 	css = task_get_css(tsk, cpuset_cgrp_id);
 	retval = cgroup_path_ns(css->cgroup, buf, PATH_MAX,
 				current->nsproxy->cgroup_ns);
 	css_put(css);
+	mutex_unlock(&cpuset_mutex);
 	if (retval == -E2BIG)
 		retval = -ENAMETOOLONG;
 	if (retval < 0)
-- 
2.39.3