On Thu, Jan 19, 2023 at 01:01:42PM -0800, Suren Baghdasaryan wrote: > I spent some more time digging into the details and this is what's > happening. When we call rmdir to delete the cgroup with the pressure > file being epoll'ed, roughly the following call chain happens in the > context of the shell process: > > do_rmdir > cgroup_rmdir > kernfs_drain_open_files > cgroup_file_release > cgroup_pressure_release > psi_trigger_destroy > > Later on in the context of our reproducer, the last fput() is called > causing wait queue removal: > > fput > ep_eventpoll_release > ep_free > ep_remove_wait_queue > remove_wait_queue > > By this time psi_trigger_destroy() already destroyed the trigger's > waitqueue head and we hit UAF. > I think the conceptual problem here (or maybe that's by design?) is > that cgroup_file_release() is not really tied to the file's real > lifetime (when the last fput() is issued). Otherwise fput() would call > eventpoll_release() before f_op->release() and the order would be fine > (we would remove the wait queue first in eventpoll_release() and then > f_op->release() would cause trigger's destruction). > Considering these findings, I think we can use the wake_up_pollfree() > without contradicting the comment at > https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L253 > because indeed, cgroup_file_release() and therefore > psi_trigger_destroy() are not tied to the file's lifetime. > > I'm CC'ing Tejun to check if this makes sense to him and > cgroup_file_release() is working as expected in this case. > > Munehisha, if Tejun confirms this is all valid, could you please post > a patch replacing wake_up_interruptible() with wake_up_pollfree()? We > don't need to worry about wake_up_all() because we have a limitation > of one trigger per file descriptor: Solid analysis! Indeed, wake_up_pollfree() should fix it.