Re: [PATCH 09/39] sched: Add @reason to sched_class->rq_{on|off}line()

Tejun Heo <tj@xxxxxxxxxx> · Mon, 24 Jun 2024 11:18:06 -1000

Hello, Peter.

On Mon, Jun 24, 2024 at 01:32:12PM +0200, Peter Zijlstra wrote:
> On Wed, May 01, 2024 at 05:09:44AM -1000, Tejun Heo wrote:
> > ->rq_{on|off}line are called either during CPU hotplug or cpuset partition
> > updates. A planned BPF extensible sched_class wants to tell the BPF
> > scheduler progs about CPU hotplug events in a way that's synchronized with
> > rq state changes.
> > 
> > As the BPF scheduler progs aren't necessarily affected by cpuset partition
> > updates, we need a way to distinguish the two types of events. Let's add an
> > argument to tell them apart.
> 
> That would be a bug. Must not be able to ignore partitions.

So, first of all, this implementation was brittle in assuming CPU hotplug
events would be called in first and broke after recent cpuset changes. In
v7, it's replaced by hooks in sched_cpu_[de]activate(), which has the extra
benefit of allowing the BPF hotplug methods to be sleepable.

Taking a step back to the sched domains. They don't translate well to
sched_ext schedulers where task to CPU associations are often more dynamic
(e.g. multiple CPUs sharing a task queue) and load balancing operations can
be implemented pretty differently from CFS. The benefits of exposing sched
domains directly to the BPF schedulers is unclear as most of relevant
information can be obtained from userspace already.

The cgroup support side isn't fully developed yet (e.g. cpu.weight is
available but I haven't added cpu.max yet) and plans can always change but I
was thinking taking a similar approach as cpu.weight for cpuset's isolation
features - ie. give the BPF scheduler a way to access the user's
configuration and let it implement whatever it wants to implement.

Thanks.

-- 
tejun