On Fri, Jan 20, 2023 at 1:00 AM Hillf Danton <hdanton@xxxxxxxx> wrote: > > On Thu, 19 Jan 2023 17:37:11 -0800 Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > On Thu, Jan 19, 2023 at 5:31 PM Hillf Danton <hdanton@xxxxxxxx> wrote: > > > On Thu, 19 Jan 2023 13:01:42 -0800 Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > > > > > > > Hi Folks, > > > > I spent some more time digging into the details and this is what's > > > > happening. When we call rmdir to delete the cgroup with the pressure > > > > file being epoll'ed, roughly the following call chain happens in the > > > > context of the shell process: > > > > > > > > do_rmdir > > > > cgroup_rmdir > > > > kernfs_drain_open_files > > > > cgroup_file_release > > > > cgroup_pressure_release > > > > psi_trigger_destroy > > > > > > > > Later on in the context of our reproducer, the last fput() is called > > > > causing wait queue removal: > > > > > > > > fput > > > > ep_eventpoll_release > > > > ep_free > > > > ep_remove_wait_queue > > > > remove_wait_queue > > > > > > > > By this time psi_trigger_destroy() already destroyed the trigger's > > > > waitqueue head and we hit UAF. > > > > I think the conceptual problem here (or maybe that's by design?) is > > > > that cgroup_file_release() is not really tied to the file's real > > > > lifetime (when the last fput() is issued). Otherwise fput() would call > > > > eventpoll_release() before f_op->release() and the order would be fine > > > > (we would remove the wait queue first in eventpoll_release() and then > > > > f_op->release() would cause trigger's destruction). > > > > > > eventpoll_release > > > eventpoll_release_file > > > ep_remove > > > ep_unregister_pollwait > > > ep_remove_wait_queue > > > > > > > Yes but fput() calls eventpoll_release() *before* f_op->release(), so > > waitqueue_head would be removed before trigger destruction. > > Then check if file is polled before destroying trigger. > > +++ b/kernel/sched/psi.c > @@ -1529,6 +1529,7 @@ static int psi_fop_release(struct inode > { > struct seq_file *seq = file->private_data; > > + eventpoll_release_file(file); Be careful here and see the comment in https://elixir.bootlin.com/linux/latest/source/fs/eventpoll.c#L912. eventpoll_release_file() assumes that the last fput() was called and nobody other than ep_free() will race with us. So, this will not be that simple. Besides if we really need to fix the order here, the fix should be somewhere at the level of cgroup_file_release() or even kernfs to work for other similar situations. > psi_trigger_destroy(seq->private); > return single_release(inode, file); > } >