On Fri, Sep 16, 2016 at 08:12:58AM -0700, Andy Lutomirski wrote: > On Sep 16, 2016 12:51 AM, "Peter Zijlstra" <peterz@xxxxxxxxxxxxx> wrote: > > > > On Thu, Sep 15, 2016 at 01:08:07PM -0700, Andy Lutomirski wrote: > > > BTW, Mike keeps mentioning exclusive cgroups as problematic with the > > > no-internal-tasks constraints. Do exclusive cgroups still exist in > > > cgroup2? Could we perhaps just remove that capability entirely? I've > > > never understood what problem exlusive cpusets and such solve that > > > can't be more comprehensibly solved by just assigning the cpusets the > > > normal inclusive way. > > > > Without exclusive sets we cannot split the sched_domain structure. > > Which leads to not being able to actually partition things. That would > > break DL for one. > > Can you sketch out a toy example? [ Also see Documentation/cgroup-v1/cpusets.txt section 1.7 ] mkdir /cpuset mount -t cgroup -o cpuset none /cpuset mkdir /cpuset/A mkdir /cpuset/B cat /sys/devices/system/node/node0/cpulist > /cpuset/A/cpuset.cpus echo 0 > /cpuset/A/cpuset.mems cat /sys/devices/system/node/node1/cpulist > /cpuset/B/cpuset.cpus echo 1 > /cpuset/B/cpuset.mems # move all movable tasks into A cat /cpuset/tasks | while read task; do echo $task > /cpuset/A/tasks ; done # kill machine wide load-balancing echo 0 > /cpuset/cpuset.sched_load_balance # now place 'special' tasks in B This partitions the scheduler into two, one for each node. Hereafter no task will be moved from one node to another. The load-balancer is split in two, one balances in A one balances in B nothing crosses. (It is important that A.cpus and B.cpus do not intersect.) Ideally no task would remain in the root group, back in the day we could actually do this (with exception of the cpu bound kernel threads), but this has significantly regressed :-( (still hate the workqueue affinity interface) As is, tasks that are left in the root group get balanced within whatever domain they ended up in. > And what's DL? SCHED_DEADLINE, its a 'Global'-EDF like scheduler that doesn't support CPU affinities (because that doesn't make sense). The only way to restrict it is to partition. 'Global' because you can partition it. If you reduce your system to single CPU partitions you'll reduce to P-EDF. (The same is true of SCHED_FIFO, that's a 'Global'-FIFO on the same partition scheme, it however does support sched_affinity, but using it gives 'interesting' schedulability results -- call it a historic accident). Note that related, but differently, we have the isolcpus boot parameter which creates single CPU partitions for all listed CPUs and gives the rest to the root cpuset. Ideally we'd kill this option given its a boot time setting (for something which is trivially to do at runtime). But this cannot be done, because that would mean we'd have to start with a !0 cpuset layout: '/' load_balance=0 / \ 'system' 'isolated' cpus=~isolcpus cpus=isolcpus load_balance=0 And start with _everything_ in the /system group (inclding default IRQ affinities). Of course, that will break everything cgroup :-( -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html