Re: [GIT PULL] fsverity fixes for v6.3-rc4

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Thu, 23 Mar 2023 11:04:25 -0700

On Wed, Mar 22, 2023 at 6:04 PM Tejun Heo <tj@xxxxxxxxxx> wrote:
>
> Thanks for the pointers. They all seem plausible symptoms of work items
> getting bounced across slow cache boundaries. I'm off for a few weeks so
> can't really dig in right now but will get to it afterwards.

So just as a gut feeling, I suspect that one solution would be to
always *start* the work on the local CPU (where "local" might be the
same, or at least a sibling).

The only reason to migrate to another CPU would be if the work is
CPU-intensive, and I do suspect that is commonly not really the case.

And I strongly suspect that our WQ_CPU_INTENSIVE flag is pure garbage,
and should just be gotten rid of, because what could be considered
"CPU intensive" in under one situation might not be CPU intensive in
another one, so trying to use some static knowledge about it is just
pure guess-work.

The different situations might be purely contextual things ("heavy
network traffic when NAPI polling kicks in"), but it might also be
purely hardware-related (ie "this is heavy if we don't have CPU hw
acceleration for crypto, but cheap if we do").

So I really don't think it should be some static decision, either
through WQ_CPU_INTENSIVE _or_ through "WQ_UNBOUND means schedule on
first available CPU".

Wouldn't it be much nicer if we just noticed it dynamically, and
WQ_UNBOUND would mean that the workqueue _can_ be scheduled on another
CPU if it ends up being advantageous?

And we actually kind of have that dynamic flag already, in the form of
the scheduler. It might even be explicit in the context of the
workqueue (with "need_resched()" being true and the workqueue code
itself might notice it and explicitly then try to spread it out), but
with preemption it's more implicit and maybe it needs a bit of
tweaking help.

So that's what I mean by "start the work as local CPU work" - use that
as the baseline decision (since it's going to be the case that has
cache locality), and actively try to avoid spreading things out unless
we have an explicit reason to, and that reason we could just get from
the scheduler.

The worker code already has that "wq_worker_sleeping()" callback from
the scheduler, but that only triggers when a worker is going to sleep.
I'm saying that the "scheduler decided to schedule out a worker" case
might be used as a "Oh, this is CPU intensive, let's try to spread it
out".

See what I'm trying to say?

And yes, the WQ_UNBOUND case does have a weak "prefer local CPU" in
how it basically tends to try to pick the current CPU unless there is
some active reason not to (ie the whole "wq_select_unbound_cpu()"
code), but I suspect that is then counter-acted by the fact that it
will always pick the workqueue pool by node - so the "current CPU"
ends up probably being affected by what random CPU that pool was
running on.

An alternative to any scheduler interaction thing might be to just
tweak "first_idle_worker()". I get the feeling that that choice is
just horrid, and that is another area that could really try to take
locality into account. insert_work() realyl seems to pick a random
worker from the pool - which is fine when the pool is per-cpu, but
when it's the unbound "random node" pool, I really suspect that it
might be much better to try to pick a worker that is on the right cpu.

But hey, I may be missing something. You know this code much better than I do.

                  Linus