On Fri, Feb 14, 2025 at 03:31:54PM -0700, Jens Axboe wrote: > I'll get it queued up. I do think for a better fix, we could rely on > task_work on the actual task in question. Because that will be run once > it exits to userspace, which will deliver any pending signals as well. > That should be a better gating mechanism for the retry. But that will > most likely become more involved, so I think doing something like this > first is fine. How would that work? task_work_run is called from various places, including get_signal where we're fairly likely to have a signal pending. I don't think there is a way to get a task_work item to run only when we're guaranteed that no signal is pending. There is the "resume user mode work" stuff but that looks like it is only about the notification mechanism - the work item itself is not marked in any way and may be executed "sooner" e.g. if the task gets a signal. This also doesn't work for retries past the first - in that case, when we fail create_io_thread, we're already in task_work context, and immediately queueing a task_work for the retry there won't work, as the very same invocation of task_work_run that we're currently in will pick up the new work as well. I assume that was the whole reason why we bounced queueing the retry to a kworker, only to come back to the original task via task_work in the first place. I also thought it might be worth studying what fork() and friends do, since they have to deal with a similar problem. These syscalls seem to do their retry by editing the syscalling task's registers before returning to userspace in such a way that the syscall instruction is executed again. If there's a signal that needs to be delivered, the signal handler in userspace is called before the retry executes. This solution seems very specific to a syscall and I don't think we can take inspiration from it given that we are calling copy_process from task_work...