Re: [PATCH 2/6] sched: add WF_CURRENT_CPU and externise ttwu

Andrei Vagin <avagin@xxxxxxxxxx> · Wed, 12 Apr 2023 12:38:37 -0700

On Mon, Apr 10, 2023 at 10:27 AM Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>
> On Sun, Apr 9, 2023 at 9:56 PM Andrei Vagin <avagin@xxxxxxxxx> wrote:
> >
> > On Fri, Apr 7, 2023 at 8:20 PM Chen Yu <yu.c.chen@xxxxxxxxx> wrote:
> > >
> > > On 2023-03-07 at 23:31:57 -0800, Andrei Vagin wrote:
> > > > From: Peter Oskolkov <posk@xxxxxxxxxx>
> > > >
> > > > Add WF_CURRENT_CPU wake flag that advices the scheduler to
> > > > move the wakee to the current CPU. This is useful for fast on-CPU
> > > > context switching use cases.
> > > >
> > > > In addition, make ttwu external rather than static so that
> > > > the flag could be passed to it from outside of sched/core.c.
> > > >
> > > > Signed-off-by: Peter Oskolkov <posk@xxxxxxxxxx>
> > > > Signed-off-by: Andrei Vagin <avagin@xxxxxxxxxx>
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -7569,6 +7569,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> > > >       if (wake_flags & WF_TTWU) {
> > > >               record_wakee(p);
> > > >
> > > > +             if ((wake_flags & WF_CURRENT_CPU) &&
> > > > +                 cpumask_test_cpu(cpu, p->cpus_ptr))
> > > > +                     return cpu;
> > > > +
> > > I tried to reuse WF_CURRENT_CPU to mitigate the cross-cpu wakeup, however there
> > > are regressions when running some workloads, and these workloads want to be
> > > spreaded on idle CPUs whenever possible.
> > > The reason for the regression is that, above change chooses current CPU no matter
> > > what the load/utilization of this CPU is. So task are stacked on 1 CPU and hurts
> > > throughput/latency. And I believe this issue would be more severe on system with
> > > smaller number of CPU within 1 LLC(when compared to Intel platforms), such as AMD,
> > > Arm64.
> >
> > WF_CURRENT_CPU works only in certain conditions. Maybe you saw my
> > attempt to change how WF_SYNC works:
> >
> > https://www.spinics.net/lists/kernel/msg4567650.html
> >
> > Then we've found that this idea doesn't work well, and it is a reason
> > why we have the separate WF_CURRENT_CPU flag.
> >
> > >
> > > I know WF_CURRENT_CPU benefits seccomp, and can we make this change more genefic
> > > to benefit other workloads, by making the condition to trigger WF_CURRENT_CPU stricter?
> > > Say, only current CPU has 1 runnable task, and treat current CPU as the last resort by
> > > checking if the wakee's previous CPU is not idle. In this way, we can enable WF_CURRENT_CPU flag
> > > dynamically when some condition is met(a short task for example).
> >
> > We discussed all of these here and here:
> >
> > https://www.spinics.net/lists/kernel/msg4657545.html
> >
> > https://lore.kernel.org/lkml/CANaxB-yWkKzhhPMGXCQbtjntJbqZ40FL2qtM2hk7LLWE-ZpbAg@xxxxxxxxxxxxxx/
> >
> > I like your idea about short-duration tasks, but I think it is a
> > separate task and it has to be done in a separate patch set. Here, I
> > solve the problem of optimizing synchronous switches when one task wakes
> > up another one and falls asleep immediately after that. Waking up the
> > target task on the current CPU looks reasonable for a few reasons in
> > this case. First, waking up a task on the current CPU is cheaper than on
> > another one and it is much cheaper than waking on an idle cpu.  Second,
> > when tasks want to do synchronous switches, they often exchange some
> > data, so memory caches can play on us.
>
> I've contemplated this on occasion for quite a few years, and I think
> that part of our issue is that the userspace ABI part doesn't exist
> widely.  In particular, most of the common ways that user tasks talk
> to each other don't have a single system call that can do the
> send-a-message-and-start-waiting part all at once.  For example, if
> task A is running and it wants to wake task B and then sleep:
>
> UNIX sockets (or other sockets): A calls send() or write() or
> sendmsg() then recv() or read() or recvmsg() or epoll_wait() or poll()
> or select().
>
> Pipes: Same as sockets except no send/recv.
>
> Shared memory: no wakeup or sleep mechanism at all. UMONITOR doesn't count :)

futex-es? Here was an attempt to add FUTEX_SWAP a few years ago:
https://www.spinics.net/lists/kernel/msg3871065.html

It hasn't been merged to the upstream repo in favor of umcg:
https://lore.kernel.org/linux-mm/20211122211327.5931-1-posk@xxxxxxxxxx/

Both these features solve similar problems, where FUTEX_SWAP is simple
and straightforward
but umcg is wider and more complicated.

>
> I think io_uring can kind of do a write-and-wait operation, but I
> doubt it's wired up for this purpose.

I think it may be a good candidate where this logic can be placed.

>
>
> seccomp seems like it should be able to do this straightforwardly on
> the transition from the seccomp-sandboxed task to the monitor, but the
> reverse direction is tricky.
>
>
>
> Anyway, short of a massive API project, I don't see a totally
> brilliant solution.  But maybe we could take a baby step toward a
> general solution by deferring all the hard work of a wakeup a bit so
> that, as we grow syscalls and other mechanisms that do wake-and-wait,
> we can optimize them automatically.  For example, we could perhaps add
> a pending wakeup to task_struct, kind of like:
>
> struct task_struct *task_to_wake;
>
> and then, the next time we sleep or return to usermode, we handle the
> wakeup.  And if we're going to sleep, we can do it as an optimized
> synchronous wakeup.  And if we try to wake a task while task_to_wake
> is set, we just wake everything normally.

I am not sure that I understand when it has to be set and when it will
be in effect. For example, we want to do the pair write&read syscall. It
means write sets task_to_wake, then the current task is resumed without waking
the target task and only after that task_to_wake will be in effect.
In other words,
it has to be in effect after the next but one returns to user-mode.

Thanks,
Andrei

>
> (There are refcounting issues here, and maybe this wants to be percpu,
> not per task.)
>
> I think it would be really nifty if Linux could somewhat reliably do
> this style of synchronous con
>
> PeterZ, is this at all sensible or am I nuts?
>
> --Andy