Ankur Arora <ankur.a.arora@xxxxxxxxxx> writes: > Peter Zijlstra <peterz@xxxxxxxxxxxxx> writes: > >> On Tue, Nov 07, 2023 at 01:57:27PM -0800, Ankur Arora wrote: >> >>> + * We might race with the target CPU while checking its ct_state: >>> + * >>> + * 1. The task might have just entered the kernel, but has not yet >>> + * called user_exit(). We will see stale state (CONTEXT_USER) and >>> + * send an unnecessary resched-IPI. >>> + * >>> + * 2. The user task is through with exit_to_user_mode_loop() but has >>> + * not yet called user_enter(). >>> + * >>> + * We'll see the thread's state as CONTEXT_KERNEL and will try to >>> + * schedule it lazily. There's obviously nothing that will handle >>> + * this need-resched bit until the thread enters the kernel next. >>> + * >>> + * The scheduler will still do tick accounting, but a potentially >>> + * higher priority task waited to be scheduled for a user tick, >>> + * instead of execution time in the kernel. >>> + */ >>> + context = ct_state_cpu(cpu_of(rq)); >>> + if ((context == CONTEXT_USER) || >>> + (context == CONTEXT_GUEST)) { >>> + >>> + rs = RESCHED_eager; >>> + goto resched; >>> + } >> >> Like said, this simply cannot be. You must not rely on the remote CPU >> being in some state or not. Also, it's racy, you could observe USER and >> then it enters KERNEL. > > Or worse. We might observe KERNEL and it enters USER. > > I think we would be fine if we observe USER: it would be upgrade > to RESCHED_eager and send an unnecessary IPI. > > But if we observe KERNEL and it enters USER, then we will have > set the need-resched-lazy bit which the thread might not see > (it might have left exit_to_user_mode_loop()) until the next > entry to the kernel. > > But, yes I would like to avoid the ct_state as well. But > need-resched-lazy only makes sense when the task on the runqueue > is executing in the kernel... So, I discussed this with Thomas offlist, and he pointed out that I'm overengineering this. If we decide to wake up a remote rq lazily with (!sched_feat(TTWU_QUEUE)), and if the target is running in user space, then the resched would happen when the process enters kernel mode. That's somewhat similar to how in this preemption model we let a task run for upto one extra tick while in kernel mode. So I'll drop this and allow the same behaviour in userspace instead of solving it in unnecessarily complicated ways. -- ankur