Hi Peter, On Thu, Jan 20, 2022 at 04:55:22PM +0100, Peter Zijlstra wrote: [...] > +/* pre-schedule() */ > +void umcg_wq_worker_sleeping(struct task_struct *tsk) > +{ > + struct umcg_task __user *self = READ_ONCE(tsk->umcg_task); > + int ret; > + > + if (!tsk->umcg_server) { > + /* > + * Already blocked before, the pages are unpinned. > + */ > + return; > + } > + > + /* Must not fault, mmap_sem might be held. */ > + pagefault_disable(); > + > + ret = umcg_update_state(tsk, self, UMCG_TASK_RUNNING, UMCG_TASK_BLOCKED); > + if (ret == -EAGAIN) { > + /* > + * Consider: > + * > + * self->state = UMCG_TASK_RUNNABLE | UMCG_TF_COND_WAIT; > + * ... > + * sys_umcg_wait(); > + * > + * and the '...' code doing a blocking syscall/fault. This > + * ensures that returns with UMCG_TASK_RUNNING, which will make /UMCG_TASK_RUNNING/UMCG_TASK_RUNNABLE/ > + * sys_umcg_wait() return with -EAGAIN. > + */ > + ret = umcg_update_state(tsk, self, UMCG_TASK_RUNNABLE, UMCG_TASK_BLOCKED); > + } > + if (ret) > + UMCG_DIE_PF("state"); > + > + if (umcg_wake_server(tsk)) > + UMCG_DIE_PF("wake"); > + > + pagefault_enable(); > + > + /* > + * We're going to sleep, make sure to unpin the pages, this ensures > + * the pins are temporary. Also see umcg_sys_exit(). > + */ > + umcg_unpin_pages(); > +} [...] > +/* Called from syscall exit path and exceptions that can schedule */ > +void umcg_sys_exit(struct pt_regs *regs) > +{ > + struct task_struct *tsk = current; > + long syscall = syscall_get_nr(tsk, regs); > + > + if (syscall == __NR_umcg_wait || > + syscall == __NR_umcg_ctl) > + return; > + > + if (tsk->umcg_server) { > + /* > + * Didn't block, we done. > + */ > + umcg_unpin_pages(); > + return; > + } > + > + umcg_unblock_and_wait(); umcg_unblock_and_wait() -> umcg_enqueue_and_wake() -> umcg_wake_server() -> umcg_wake_task(tsk->umcg_server, ...) tsk->umcg_server is NULL here and umcg_wake_task() use it to update state in umcg_update_state(NULL, ...), that means tsk->umcg_clock will happen something i do not know. There are two places to call umcg_unblock_and_wait(). One is in umcg_register() where the server is set. Another one is in umcg_sys_exit() where the server is not set. May use a bool to indicate if the server is set. > +} [...] > +/** > + * sys_umcg_wait: transfer running context > + * > + * Called like: > + * > + * self->state = UMCG_TASK_RUNNABLE | UMCG_TF_COND_WAIT; > + * ... > + * sys_umcg_wait(0, time); > + * > + * The syscall will clear TF_COND_WAIT and wait until state becomes RUNNING. > + * The code '...' must not contain syscalls > + * > + * If self->next_tid is set and indicates a valid UMCG task with RUNNABLE state > + * that task will be made RUNNING and woken -- transfering the running context > + * to that task. In this case self->next_tid is modified with TID_RUNNING to > + * indicate self->next_tid is consumed. > + * > + * If self->next has TID_RUNNING set, it is validated the related task has /self->next/self->next_tid/ Things are not clear to me even they are clear now. Nice. Thanks, Tao