On Sun, Jun 18, 2023 at 6:28 AM lujialin (A) <lujialin4@xxxxxxxxxx> wrote: > > Hi Suren: > > kernel config: > x86_64_defconfig > CONFIG_PSI=y > CONFIG_SLUB_DEBUG=y > CONFIG_SLUB_DEBUG_ON=y > CONFIG_KASAN=y > CONFIG_KASAN_INLINE=y > > I make some change in code, in order to increase the recurrence probability. > diff --git a/fs/select.c b/fs/select.c > index 5edffee1162c..5ee5b74a8386 100644 > --- a/fs/select.c > +++ b/fs/select.c > @@ -139,6 +139,7 @@ void poll_freewait(struct poll_wqueues *pwq) > { > struct poll_table_page * p = pwq->table; > int i; > + mdelay(50); > for (i = 0; i < pwq->inline_index; i++) > free_poll_entry(pwq->inline_entries + i); > while (p) { > > Here is the simple repo test.sh: > #!/bin/bash > > RESOURCE_TYPES=("cpu" "memory" "io" "irq") > #RESOURCE_TYPES=("cpu") > cgroup_num=50 > test_dir=/sys/fs/cgroup/test > > function restart_cgroup() { > num=$(expr $RANDOM % $cgroup_num + 1) > rmdir $test_dir/test_$num > mkdir $test_dir/test_$num > } > > function create_triggers() { > num=$(expr $RANDOM % $cgroup_num + 1) > random=$(expr $RANDOM % "${#RESOURCE_TYPES[@]}") > psi_type="${RESOURCE_TYPES[${random}]}" > ./psi_monitor $test_dir/test_$num $psi_type & > } > > mkdir $test_dir > for i in $(seq 1 $cgroup_num) > do > mkdir $test_dir/test_$i > done > for j in $(seq 1 100) > do > restart_cgroup & > create_triggers & > done > > psi_monitor.c: > #include <errno.h> > #include <fcntl.h> > #include <stdio.h> > #include <poll.h> > #include <string.h> > #include <unistd.h> > > int main(int argc, char *argv[]) { > const char trig[] = "full 1000000 1000000"; > struct pollfd fds; > char filename[100]; > > sprintf(filename, "%s/%s.pressure", argv[1], argv[2]); > > fds.fd = open(filename, O_RDWR | O_NONBLOCK); > if (fds.fd < 0) { > printf("%s open error: %s\n", filename,strerror(errno)); > return 1; > } > fds.events = POLLPRI; > if (write(fds.fd, trig, strlen(trig) + 1) < 0) { > printf("%s write error: %s\n",filename,strerror(errno)); > return 1; > } > while (1) { > poll(&fds, 1, -1); > } > close(fds.fd); > return 0; > } Thanks a lot! I'll try to get this reproduced and fixed by the end of this week. Suren. > Thanks, > Lu > 在 2023/6/16 7:13, Suren Baghdasaryan 写道: > > On Wed, Jun 14, 2023 at 11:19 AM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > >> > >> On Wed, Jun 14, 2023 at 10:40 AM Eric Biggers <ebiggers@xxxxxxxxxx> wrote: > >>> > >>> On Wed, Jun 14, 2023 at 03:07:33PM +0800, Lu Jialin wrote: > >>>> We found a UAF bug in remove_wait_queue as follows: > >>>> > >>>> ================================================================== > >>>> BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x71/0xe0 > >>>> Write of size 4 at addr ffff8881150d7b28 by task psi_trigger/15306 > >>>> Call Trace: > >>>> dump_stack+0x9c/0xd3 > >>>> print_address_description.constprop.0+0x19/0x170 > >>>> __kasan_report.cold+0x6c/0x84 > >>>> kasan_report+0x3a/0x50 > >>>> check_memory_region+0xfd/0x1f0 > >>>> _raw_spin_lock_irqsave+0x71/0xe0 > >>>> remove_wait_queue+0x26/0xc0 > >>>> poll_freewait+0x6b/0x120 > >>>> do_sys_poll+0x305/0x400 > >>>> do_syscall_64+0x33/0x40 > >>>> entry_SYSCALL_64_after_hwframe+0x61/0xc6 > >>>> > >>>> Allocated by task 15306: > >>>> kasan_save_stack+0x1b/0x40 > >>>> __kasan_kmalloc.constprop.0+0xb5/0xe0 > >>>> psi_trigger_create.part.0+0xfc/0x450 > >>>> cgroup_pressure_write+0xfc/0x3b0 > >>>> cgroup_file_write+0x1b3/0x390 > >>>> kernfs_fop_write_iter+0x224/0x2e0 > >>>> new_sync_write+0x2ac/0x3a0 > >>>> vfs_write+0x365/0x430 > >>>> ksys_write+0xd5/0x1b0 > >>>> do_syscall_64+0x33/0x40 > >>>> entry_SYSCALL_64_after_hwframe+0x61/0xc6 > >>>> > >>>> Freed by task 15850: > >>>> kasan_save_stack+0x1b/0x40 > >>>> kasan_set_track+0x1c/0x30 > >>>> kasan_set_free_info+0x20/0x40 > >>>> __kasan_slab_free+0x151/0x180 > >>>> kfree+0xba/0x680 > >>>> cgroup_file_release+0x5c/0xe0 > >>>> kernfs_drain_open_files+0x122/0x1e0 > >>>> kernfs_drain+0xff/0x1e0 > >>>> __kernfs_remove.part.0+0x1d1/0x3b0 > >>>> kernfs_remove_by_name_ns+0x89/0xf0 > >>>> cgroup_addrm_files+0x393/0x3d0 > >>>> css_clear_dir+0x8f/0x120 > >>>> kill_css+0x41/0xd0 > >>>> cgroup_destroy_locked+0x166/0x300 > >>>> cgroup_rmdir+0x37/0x140 > >>>> kernfs_iop_rmdir+0xbb/0xf0 > >>>> vfs_rmdir.part.0+0xa5/0x230 > >>>> do_rmdir+0x2e0/0x320 > >>>> __x64_sys_unlinkat+0x99/0xc0 > >>>> do_syscall_64+0x33/0x40 > >>>> entry_SYSCALL_64_after_hwframe+0x61/0xc6 > >>>> ================================================================== > >>>> > >>>> If using epoll(), wake_up_pollfree will empty waitqueue and set > >>>> wait_queue_head is NULL before free waitqueue of psi trigger. But is > >>>> doesn't work when using poll(), which will lead a UAF problem in > >>>> poll_freewait coms as following: > >>>> > >>>> (cgroup_rmdir) | > >>>> psi_trigger_destroy | > >>>> wake_up_pollfree(&t->event_wait) | > >>>> synchronize_rcu(); | > >>>> kfree(t) | > >>>> | (poll_freewait) > >>>> | free_poll_entry(pwq->inline_entries + i) > >>>> | remove_wait_queue(entry->wait_address) > >>>> | spin_lock_irqsave(&wq_head->lock) > >>>> > >>>> entry->wait_address in poll_freewait() is t->event_wait in cgroup_rmdir(). > >>>> t->event_wait is free in psi_trigger_destroy before call poll_freewait(), > >>>> therefore wq_head in poll_freewait() has been already freed, which would > >>>> lead to a UAF. > > > > Hi Lu, > > Could you please share your reproducer along with the kernel config > > you used? I'm trying to reproduce this UAF but every time I delete the > > cgroup being polled, poll() simply returns POLLERR. > > Thanks, > > Suren. > > > >>>> > >>>> similar problem for epoll() has been fixed commit c2dbe32d5db5 > >>>> ("sched/psi: Fix use-after-free in ep_remove_wait_queue()"). > >>>> epoll wakeup function ep_poll_callback() will empty waitqueue and set > >>>> wait_queue_head is NULL when pollflags is POLLFREE and judge pwq->whead > >>>> is NULL or not before remove_wait_queue in ep_remove_wait_queue(), > >>>> which will fix the UAF bug in ep_remove_wait_queue. > >>>> > >>>> But poll wakeup function pollwake() doesn't do that. To fix the > >>>> problem, we empty waitqueue and set wait_address is NULL in pollwake() when > >>>> key is POLLFREE. otherwise in remove_wait_queue, which is similar to > >>>> epoll(). > >>>> > >>>> Fixes: 0e94682b73bf ("psi: introduce psi monitor") > >>>> Suggested-by: Suren Baghdasaryan <surenb@xxxxxxxxxx> > >>>> Link: https://lore.kernel.org/all/CAJuCfpEoCRHkJF-=1Go9E94wchB4BzwQ1E3vHGWxNe+tEmSJoA@xxxxxxxxxxxxxx/#t > >>>> Signed-off-by: Lu Jialin <lujialin4@xxxxxxxxxx> > >>>> --- > >>>> v2: correct commit msg and title suggested by Suren Baghdasaryan > >>>> --- > >>>> fs/select.c | 20 +++++++++++++++++++- > >>>> 1 file changed, 19 insertions(+), 1 deletion(-) > >>>> > >>>> diff --git a/fs/select.c b/fs/select.c > >>>> index 0ee55af1a55c..e64c7b4e9959 100644 > >>>> --- a/fs/select.c > >>>> +++ b/fs/select.c > >>>> @@ -132,7 +132,17 @@ EXPORT_SYMBOL(poll_initwait); > >>>> > >>>> static void free_poll_entry(struct poll_table_entry *entry) > >>>> { > >>>> - remove_wait_queue(entry->wait_address, &entry->wait); > >>>> + wait_queue_head_t *whead; > >>>> + > >>>> + rcu_read_lock(); > >>>> + /* If it is cleared by POLLFREE, it should be rcu-safe. > >>>> + * If we read NULL we need a barrier paired with smp_store_release() > >>>> + * in pollwake(). > >>>> + */ > >>>> + whead = smp_load_acquire(&entry->wait_address); > >>>> + if (whead) > >>>> + remove_wait_queue(whead, &entry->wait); > >>>> + rcu_read_unlock(); > >>>> fput(entry->filp); > >>>> } > >>>> > >>>> @@ -215,6 +225,14 @@ static int pollwake(wait_queue_entry_t *wait, unsigned mode, int sync, void *key > >>>> entry = container_of(wait, struct poll_table_entry, wait); > >>>> if (key && !(key_to_poll(key) & entry->key)) > >>>> return 0; > >>>> + if (key_to_poll(key) & POLLFREE) { > >>>> + list_del_init(&wait->entry); > >>>> + /* wait_address !=NULL protects us from the race with > >>>> + * poll_freewait(). > >>>> + */ > >>>> + smp_store_release(&entry->wait_address, NULL); > >>>> + return 0; > >>>> + } > >>>> return __pollwake(wait, mode, sync, key); > >>> > >>> I don't understand why this patch is needed. > >>> > >>> The last time I looked at POLLFREE, it is only needed because of asynchronous > >>> polls. See my explanation in the commit message of commit 50252e4b5e989ce6. > >> > >> Ah, I missed that. Thanks for the correction. > >> > >>> > >>> In summary, POLLFREE solves the problem of polled waitqueues whose lifetime is > >>> tied to the current task rather than to the file being polled. Also refer to > >>> the comment above wake_up_pollfree(), which mentions this. > >>> > >>> fs/select.c is synchronous polling, not asynchronous. Therefore, it should not > >>> need to handle POLLFREE. > >>> > >>> If there's actually a bug here, most likely it's a bug in psi_trigger_poll() > >>> where it is using a waitqueue whose lifetime is tied to neither the current task > >>> nor the file being polled. That needs to be fixed. > >> > >> Yeah. We discussed this issue in > >> https://lore.kernel.org/all/CAJuCfpFb0J5ZwO6kncjRG0_4jQLXUy-_dicpH5uGiWP8aKYEJQ@xxxxxxxxxxxxxx > >> and the root cause is that cgroup_file_release() where > >> psi_trigger_destroy() is called is not tied to the cgroup file's real > >> lifetime (see my analysis here: > >> https://lore.kernel.org/all/CAJuCfpFZ3B4530TgsSHqp5F_gwfrDujwRYewKReJru==MdEHQg@xxxxxxxxxxxxxx/#t). > >> I guess it's time to do a deeper surgery and figure out a way to call > >> psi_trigger_destroy() when the polled cgroup file is actually being > >> destroyed. I'll take a closer look into this later today. > >> A fix will likely require some cgroup or kernfs code changes, so > >> CC'ing Tejun for visibility. > >> Thanks, > >> Suren. > >> > >>> > >>> - Eric > > . > >