Re: [PATCH 09/39] sched: Add @reason to sched_class->rq_{on|off}line()

Peter Zijlstra <peterz@xxxxxxxxxxxxx> · Wed, 26 Jun 2024 10:23:42 +0200

On Tue, Jun 25, 2024 at 01:41:01PM -1000, Tejun Heo wrote:
> Hello,
> 
> On Tue, Jun 25, 2024 at 10:29:26AM +0200, Peter Zijlstra wrote:
> ...
> > > Taking a step back to the sched domains. They don't translate well to
> > > sched_ext schedulers where task to CPU associations are often more dynamic
> > > (e.g. multiple CPUs sharing a task queue) and load balancing operations can
> > > be implemented pretty differently from CFS. The benefits of exposing sched
> > > domains directly to the BPF schedulers is unclear as most of relevant
> > > information can be obtained from userspace already.
> > 
> > Either which way around you want to turn it, you must not violate
> > partitions. If a bpf thing isn't capable of handling partitions, you
> > must refuse loading it when a partition exists and equally disallow
> > creation of partitions when it does load.
> > 
> > For partitions specifically, you only need the root_domain, not the full
> > sched_domain trees.
> > 
> > I'm aware you have these shared runqueues, but you don't *have* to do
> > that. Esp. so if the user explicitly requested partitions.
> 
> As a quick work around, I can just disallow / eject the BPF scheduler when
> partitioning is configured. However, I think I'm still missing something and
> would appreciate if you can fill me in.
> 
> Abiding by core scheduling configuration is critical because it has direct
> user visible and security implications and this can be tested from userspace
> - are two threads which shouldn't be on the same core on the same core or
> not? So, the violation condition is pretty clear.
> 
> However, I'm not sure how partioning is similar.

I'm not sure what you mean. It's like violating the cpumask, probably
not a big deal, but against the express wishes of the user.

> My understanding is that it
> works as a barrier for the load balancer. LB on this side can't look there
> and LB on that side can't look here. However, isn't the impact purely
> performance / isolation difference? 

Yes. But this isolation is very important to some people.

> IOW, let's say you laod a BPF scheduler
> which consumes the partition information but doesn't do anything differently
> based on it. cpumasks are still enforced the same and I can't think of
> anything which userspace would be able to test to tell whether partitioning
> is working or not.

So barring a few caveats it really boils down to a task staying in the
partition it's part of. If you ever see it leave, you know you got a
problem.

Now, there's a bunch of ways to actually create partitions:

 - cpuset
 - cpuset-v2
 - isolcpus boot crap

And they're all subtly different iirc, but IIRC the cpuset ones are
simplest since the task is part of a cgroup and the cgroup cpumask is
imposed on them and things should be fairly straight forward.

The isolcpus thing creates a pile of single CPU partitions and people
have to manually set cpu-affinity, and here we have some hysterical
behaviour that I would love to change but have not yet dared do --
because I know there's people doing dodgy things because they've been
sending 'bug' reports.

Specifically it is possible to set a cpumask that spans multiple
partitions :-( Traditionally the behaviour was that it would place the
task on the lowest cpu number, the current behaviour is the task it
placed randomly on any CPU in the given mask.

It is my opinion that both behaviours are correct, since after all, we
don't violate the given constraint, the user provided mask. If that's
not what you wanted, you should be setting something else etc..

I've proposed rejecting a cpumask that spans partitions -- I've not yet
done this, because clearly people are doing this, however misguided. But
perhaps we should just bite the bullet and cause pain -- dunno.

Anyway, tl;dr, you can have a cpumask wider than a parition and people
still not wanting migrations to happen.

> If the only difference partitions make is on performance. 

People explicitly did not want migrations there -- otherwise they would
not have gone to the trouble of setting up the partitions in the first
place.

> While it would
> make sense to communicate partitions to the BPF scheduler, would it make
> sense to reject BPF scheduler based on it? ie. Assuming that the feature is
> implemented, what would distinguish between one BPF scheduler which handles
> partitions specially and the other which doesn't care?

Correctness? Anyway, can't you handle this in the kernel part, simply
never allow a shared runqueue to cross a root_domain's mask and put some
WARNs on to ensure constraints are respected etc.? Should be fairly
simple to check prev_cpu and new_cpu are having the same root_domain for
instance.