On Mon, Apr 10, 2023 at 10:27 AM Andy Lutomirski <luto@xxxxxxxxxx> wrote: > > On Sun, Apr 9, 2023 at 9:56 PM Andrei Vagin <avagin@xxxxxxxxx> wrote: > > > > On Fri, Apr 7, 2023 at 8:20 PM Chen Yu <yu.c.chen@xxxxxxxxx> wrote: > > > > > > On 2023-03-07 at 23:31:57 -0800, Andrei Vagin wrote: > > > > From: Peter Oskolkov <posk@xxxxxxxxxx> > > > > > > > > Add WF_CURRENT_CPU wake flag that advices the scheduler to > > > > move the wakee to the current CPU. This is useful for fast on-CPU > > > > context switching use cases. > > > > > > > > In addition, make ttwu external rather than static so that > > > > the flag could be passed to it from outside of sched/core.c. > > > > > > > > Signed-off-by: Peter Oskolkov <posk@xxxxxxxxxx> > > > > Signed-off-by: Andrei Vagin <avagin@xxxxxxxxxx> > > > > --- a/kernel/sched/fair.c > > > > +++ b/kernel/sched/fair.c > > > > @@ -7569,6 +7569,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) > > > > if (wake_flags & WF_TTWU) { > > > > record_wakee(p); > > > > > > > > + if ((wake_flags & WF_CURRENT_CPU) && > > > > + cpumask_test_cpu(cpu, p->cpus_ptr)) > > > > + return cpu; > > > > + > > > I tried to reuse WF_CURRENT_CPU to mitigate the cross-cpu wakeup, however there > > > are regressions when running some workloads, and these workloads want to be > > > spreaded on idle CPUs whenever possible. > > > The reason for the regression is that, above change chooses current CPU no matter > > > what the load/utilization of this CPU is. So task are stacked on 1 CPU and hurts > > > throughput/latency. And I believe this issue would be more severe on system with > > > smaller number of CPU within 1 LLC(when compared to Intel platforms), such as AMD, > > > Arm64. > > > > WF_CURRENT_CPU works only in certain conditions. Maybe you saw my > > attempt to change how WF_SYNC works: > > > > https://www.spinics.net/lists/kernel/msg4567650.html > > > > Then we've found that this idea doesn't work well, and it is a reason > > why we have the separate WF_CURRENT_CPU flag. > > > > > > > > I know WF_CURRENT_CPU benefits seccomp, and can we make this change more genefic > > > to benefit other workloads, by making the condition to trigger WF_CURRENT_CPU stricter? > > > Say, only current CPU has 1 runnable task, and treat current CPU as the last resort by > > > checking if the wakee's previous CPU is not idle. In this way, we can enable WF_CURRENT_CPU flag > > > dynamically when some condition is met(a short task for example). > > > > We discussed all of these here and here: > > > > https://www.spinics.net/lists/kernel/msg4657545.html > > > > https://lore.kernel.org/lkml/CANaxB-yWkKzhhPMGXCQbtjntJbqZ40FL2qtM2hk7LLWE-ZpbAg@xxxxxxxxxxxxxx/ > > > > I like your idea about short-duration tasks, but I think it is a > > separate task and it has to be done in a separate patch set. Here, I > > solve the problem of optimizing synchronous switches when one task wakes > > up another one and falls asleep immediately after that. Waking up the > > target task on the current CPU looks reasonable for a few reasons in > > this case. First, waking up a task on the current CPU is cheaper than on > > another one and it is much cheaper than waking on an idle cpu. Second, > > when tasks want to do synchronous switches, they often exchange some > > data, so memory caches can play on us. > > I've contemplated this on occasion for quite a few years, and I think > that part of our issue is that the userspace ABI part doesn't exist > widely. In particular, most of the common ways that user tasks talk > to each other don't have a single system call that can do the > send-a-message-and-start-waiting part all at once. For example, if > task A is running and it wants to wake task B and then sleep: > > UNIX sockets (or other sockets): A calls send() or write() or > sendmsg() then recv() or read() or recvmsg() or epoll_wait() or poll() > or select(). > > Pipes: Same as sockets except no send/recv. > > Shared memory: no wakeup or sleep mechanism at all. UMONITOR doesn't count :) futex-es? Here was an attempt to add FUTEX_SWAP a few years ago: https://www.spinics.net/lists/kernel/msg3871065.html It hasn't been merged to the upstream repo in favor of umcg: https://lore.kernel.org/linux-mm/20211122211327.5931-1-posk@xxxxxxxxxx/ Both these features solve similar problems, where FUTEX_SWAP is simple and straightforward but umcg is wider and more complicated. > > I think io_uring can kind of do a write-and-wait operation, but I > doubt it's wired up for this purpose. I think it may be a good candidate where this logic can be placed. > > > seccomp seems like it should be able to do this straightforwardly on > the transition from the seccomp-sandboxed task to the monitor, but the > reverse direction is tricky. > > > > Anyway, short of a massive API project, I don't see a totally > brilliant solution. But maybe we could take a baby step toward a > general solution by deferring all the hard work of a wakeup a bit so > that, as we grow syscalls and other mechanisms that do wake-and-wait, > we can optimize them automatically. For example, we could perhaps add > a pending wakeup to task_struct, kind of like: > > struct task_struct *task_to_wake; > > and then, the next time we sleep or return to usermode, we handle the > wakeup. And if we're going to sleep, we can do it as an optimized > synchronous wakeup. And if we try to wake a task while task_to_wake > is set, we just wake everything normally. I am not sure that I understand when it has to be set and when it will be in effect. For example, we want to do the pair write&read syscall. It means write sets task_to_wake, then the current task is resumed without waking the target task and only after that task_to_wake will be in effect. In other words, it has to be in effect after the next but one returns to user-mode. Thanks, Andrei > > (There are refcounting issues here, and maybe this wants to be percpu, > not per task.) > > I think it would be really nifty if Linux could somewhat reliably do > this style of synchronous con > > PeterZ, is this at all sensible or am I nuts? > > --Andy