On Wed, Jun 14, 2023 at 10:40 AM Eric Biggers <ebiggers@xxxxxxxxxx> wrote: > > On Wed, Jun 14, 2023 at 03:07:33PM +0800, Lu Jialin wrote: > > We found a UAF bug in remove_wait_queue as follows: > > > > ================================================================== > > BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x71/0xe0 > > Write of size 4 at addr ffff8881150d7b28 by task psi_trigger/15306 > > Call Trace: > > dump_stack+0x9c/0xd3 > > print_address_description.constprop.0+0x19/0x170 > > __kasan_report.cold+0x6c/0x84 > > kasan_report+0x3a/0x50 > > check_memory_region+0xfd/0x1f0 > > _raw_spin_lock_irqsave+0x71/0xe0 > > remove_wait_queue+0x26/0xc0 > > poll_freewait+0x6b/0x120 > > do_sys_poll+0x305/0x400 > > do_syscall_64+0x33/0x40 > > entry_SYSCALL_64_after_hwframe+0x61/0xc6 > > > > Allocated by task 15306: > > kasan_save_stack+0x1b/0x40 > > __kasan_kmalloc.constprop.0+0xb5/0xe0 > > psi_trigger_create.part.0+0xfc/0x450 > > cgroup_pressure_write+0xfc/0x3b0 > > cgroup_file_write+0x1b3/0x390 > > kernfs_fop_write_iter+0x224/0x2e0 > > new_sync_write+0x2ac/0x3a0 > > vfs_write+0x365/0x430 > > ksys_write+0xd5/0x1b0 > > do_syscall_64+0x33/0x40 > > entry_SYSCALL_64_after_hwframe+0x61/0xc6 > > > > Freed by task 15850: > > kasan_save_stack+0x1b/0x40 > > kasan_set_track+0x1c/0x30 > > kasan_set_free_info+0x20/0x40 > > __kasan_slab_free+0x151/0x180 > > kfree+0xba/0x680 > > cgroup_file_release+0x5c/0xe0 > > kernfs_drain_open_files+0x122/0x1e0 > > kernfs_drain+0xff/0x1e0 > > __kernfs_remove.part.0+0x1d1/0x3b0 > > kernfs_remove_by_name_ns+0x89/0xf0 > > cgroup_addrm_files+0x393/0x3d0 > > css_clear_dir+0x8f/0x120 > > kill_css+0x41/0xd0 > > cgroup_destroy_locked+0x166/0x300 > > cgroup_rmdir+0x37/0x140 > > kernfs_iop_rmdir+0xbb/0xf0 > > vfs_rmdir.part.0+0xa5/0x230 > > do_rmdir+0x2e0/0x320 > > __x64_sys_unlinkat+0x99/0xc0 > > do_syscall_64+0x33/0x40 > > entry_SYSCALL_64_after_hwframe+0x61/0xc6 > > ================================================================== > > > > If using epoll(), wake_up_pollfree will empty waitqueue and set > > wait_queue_head is NULL before free waitqueue of psi trigger. But is > > doesn't work when using poll(), which will lead a UAF problem in > > poll_freewait coms as following: > > > > (cgroup_rmdir) | > > psi_trigger_destroy | > > wake_up_pollfree(&t->event_wait) | > > synchronize_rcu(); | > > kfree(t) | > > | (poll_freewait) > > | free_poll_entry(pwq->inline_entries + i) > > | remove_wait_queue(entry->wait_address) > > | spin_lock_irqsave(&wq_head->lock) > > > > entry->wait_address in poll_freewait() is t->event_wait in cgroup_rmdir(). > > t->event_wait is free in psi_trigger_destroy before call poll_freewait(), > > therefore wq_head in poll_freewait() has been already freed, which would > > lead to a UAF. > > > > similar problem for epoll() has been fixed commit c2dbe32d5db5 > > ("sched/psi: Fix use-after-free in ep_remove_wait_queue()"). > > epoll wakeup function ep_poll_callback() will empty waitqueue and set > > wait_queue_head is NULL when pollflags is POLLFREE and judge pwq->whead > > is NULL or not before remove_wait_queue in ep_remove_wait_queue(), > > which will fix the UAF bug in ep_remove_wait_queue. > > > > But poll wakeup function pollwake() doesn't do that. To fix the > > problem, we empty waitqueue and set wait_address is NULL in pollwake() when > > key is POLLFREE. otherwise in remove_wait_queue, which is similar to > > epoll(). > > > > Fixes: 0e94682b73bf ("psi: introduce psi monitor") > > Suggested-by: Suren Baghdasaryan <surenb@xxxxxxxxxx> > > Link: https://lore.kernel.org/all/CAJuCfpEoCRHkJF-=1Go9E94wchB4BzwQ1E3vHGWxNe+tEmSJoA@xxxxxxxxxxxxxx/#t > > Signed-off-by: Lu Jialin <lujialin4@xxxxxxxxxx> > > --- > > v2: correct commit msg and title suggested by Suren Baghdasaryan > > --- > > fs/select.c | 20 +++++++++++++++++++- > > 1 file changed, 19 insertions(+), 1 deletion(-) > > > > diff --git a/fs/select.c b/fs/select.c > > index 0ee55af1a55c..e64c7b4e9959 100644 > > --- a/fs/select.c > > +++ b/fs/select.c > > @@ -132,7 +132,17 @@ EXPORT_SYMBOL(poll_initwait); > > > > static void free_poll_entry(struct poll_table_entry *entry) > > { > > - remove_wait_queue(entry->wait_address, &entry->wait); > > + wait_queue_head_t *whead; > > + > > + rcu_read_lock(); > > + /* If it is cleared by POLLFREE, it should be rcu-safe. > > + * If we read NULL we need a barrier paired with smp_store_release() > > + * in pollwake(). > > + */ > > + whead = smp_load_acquire(&entry->wait_address); > > + if (whead) > > + remove_wait_queue(whead, &entry->wait); > > + rcu_read_unlock(); > > fput(entry->filp); > > } > > > > @@ -215,6 +225,14 @@ static int pollwake(wait_queue_entry_t *wait, unsigned mode, int sync, void *key > > entry = container_of(wait, struct poll_table_entry, wait); > > if (key && !(key_to_poll(key) & entry->key)) > > return 0; > > + if (key_to_poll(key) & POLLFREE) { > > + list_del_init(&wait->entry); > > + /* wait_address !=NULL protects us from the race with > > + * poll_freewait(). > > + */ > > + smp_store_release(&entry->wait_address, NULL); > > + return 0; > > + } > > return __pollwake(wait, mode, sync, key); > > I don't understand why this patch is needed. > > The last time I looked at POLLFREE, it is only needed because of asynchronous > polls. See my explanation in the commit message of commit 50252e4b5e989ce6. Ah, I missed that. Thanks for the correction. > > In summary, POLLFREE solves the problem of polled waitqueues whose lifetime is > tied to the current task rather than to the file being polled. Also refer to > the comment above wake_up_pollfree(), which mentions this. > > fs/select.c is synchronous polling, not asynchronous. Therefore, it should not > need to handle POLLFREE. > > If there's actually a bug here, most likely it's a bug in psi_trigger_poll() > where it is using a waitqueue whose lifetime is tied to neither the current task > nor the file being polled. That needs to be fixed. Yeah. We discussed this issue in https://lore.kernel.org/all/CAJuCfpFb0J5ZwO6kncjRG0_4jQLXUy-_dicpH5uGiWP8aKYEJQ@xxxxxxxxxxxxxx and the root cause is that cgroup_file_release() where psi_trigger_destroy() is called is not tied to the cgroup file's real lifetime (see my analysis here: https://lore.kernel.org/all/CAJuCfpFZ3B4530TgsSHqp5F_gwfrDujwRYewKReJru==MdEHQg@xxxxxxxxxxxxxx/#t). I guess it's time to do a deeper surgery and figure out a way to call psi_trigger_destroy() when the polled cgroup file is actually being destroyed. I'll take a closer look into this later today. A fix will likely require some cgroup or kernfs code changes, so CC'ing Tejun for visibility. Thanks, Suren. > > - Eric