On Thu, Feb 13, 2020 at 10:03:54PM -0700, Jens Axboe wrote: > CC'ing peterz for some cluebat knowledge. Peter, is there a nice way to > currently do something like this? Only thing I'm currently aware of is > the preempt in/out notifiers, but they don't quite provide what I need, > since I need to pass some data (a request) as well. Whee, nothing quite like this around I think. > The full detail on what I'm trying here is: > > io_uring can have linked requests. One obvious use case for that is to > queue a POLLIN on a socket, and then link a read/recv to that. When the > poll completes, we want to run the read/recv. io_uring hooks into the > waitqueue wakeup handler to finish the poll request, and since we're > deep in waitqueue wakeup code, it queues the linked read/recv for > execution via an async thread. This is not optimal, obviously, as it > relies on a switch to a new thread to perform this read. This hack > queues a backlog to the task itself, and runs it when it's scheduled in. > Probably want to do the same for sched out as well, currently I just > hack that in the io_uring wait part... I'll definitely need to think more about this, but a few comments on the below. > +static void __io_uring_task_handler(struct list_head *list) > +{ > + struct io_kiocb *req; > + > + while (!list_empty(list)) { > + req = list_first_entry(list, struct io_kiocb, list); > + list_del(&req->list); > + > + __io_queue_sqe(req, NULL); > + } > +} > + > +void io_uring_task_handler(struct task_struct *tsk) > +{ > + LIST_HEAD(list); > + > + raw_spin_lock_irq(&tsk->uring_lock); > + if (!list_empty(&tsk->uring_work)) > + list_splice_init(&tsk->uring_work, &list); > + raw_spin_unlock_irq(&tsk->uring_lock); > + > + __io_uring_task_handler(&list); > +} > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index fc1dfc007604..b60f081cac17 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -2717,6 +2717,11 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) > INIT_HLIST_HEAD(&p->preempt_notifiers); > #endif > > +#ifdef CONFIG_IO_URING > + INIT_LIST_HEAD(&p->uring_work); > + raw_spin_lock_init(&p->uring_lock); > +#endif > + > #ifdef CONFIG_COMPACTION > p->capture_control = NULL; > #endif > @@ -3069,6 +3074,20 @@ fire_sched_out_preempt_notifiers(struct task_struct *curr, > > #endif /* CONFIG_PREEMPT_NOTIFIERS */ > > +#ifdef CONFIG_IO_URING > +extern void io_uring_task_handler(struct task_struct *tsk); > + > +static inline void io_uring_handler(struct task_struct *tsk) > +{ > + if (!list_empty(&tsk->uring_work)) > + io_uring_task_handler(tsk); > +} > +#else /* !CONFIG_IO_URING */ > +static inline void io_uring_handler(struct task_struct *tsk) > +{ > +} > +#endif > + > static inline void prepare_task(struct task_struct *next) > { > #ifdef CONFIG_SMP > @@ -3322,6 +3341,8 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev) > balance_callback(rq); > preempt_enable(); > > + io_uring_handler(current); > + > if (current->set_child_tid) > put_user(task_pid_vnr(current), current->set_child_tid); > I suspect you meant to put that in finish_task_switch() which is the tail end of every schedule(), schedule_tail() is the tail end of clone(). Or maybe you meant to put it in (and rename) sched_update_worker() which is after every schedule() but in a preemptible context -- much saner since you don't want to go add an unbounded amount of work in a non-preemptible context. At which point you already have your callback: io_wq_worker_running(), or is this for any random task?