On Fri, Sep 16, 2016 at 9:19 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > On Fri, Sep 16, 2016 at 08:12:58AM -0700, Andy Lutomirski wrote: >> On Sep 16, 2016 12:51 AM, "Peter Zijlstra" <peterz@xxxxxxxxxxxxx> wrote: >> > >> > On Thu, Sep 15, 2016 at 01:08:07PM -0700, Andy Lutomirski wrote: >> > > BTW, Mike keeps mentioning exclusive cgroups as problematic with the >> > > no-internal-tasks constraints. Do exclusive cgroups still exist in >> > > cgroup2? Could we perhaps just remove that capability entirely? I've >> > > never understood what problem exlusive cpusets and such solve that >> > > can't be more comprehensibly solved by just assigning the cpusets the >> > > normal inclusive way. >> > >> > Without exclusive sets we cannot split the sched_domain structure. >> > Which leads to not being able to actually partition things. That would >> > break DL for one. >> >> Can you sketch out a toy example? > > [ Also see Documentation/cgroup-v1/cpusets.txt section 1.7 ] > > > mkdir /cpuset > > mount -t cgroup -o cpuset none /cpuset > > mkdir /cpuset/A > mkdir /cpuset/B > > cat /sys/devices/system/node/node0/cpulist > /cpuset/A/cpuset.cpus > echo 0 > /cpuset/A/cpuset.mems > > cat /sys/devices/system/node/node1/cpulist > /cpuset/B/cpuset.cpus > echo 1 > /cpuset/B/cpuset.mems > > # move all movable tasks into A > cat /cpuset/tasks | while read task; do echo $task > /cpuset/A/tasks ; done > > # kill machine wide load-balancing > echo 0 > /cpuset/cpuset.sched_load_balance > > # now place 'special' tasks in B > > > This partitions the scheduler into two, one for each node. > > Hereafter no task will be moved from one node to another. The > load-balancer is split in two, one balances in A one balances in B > nothing crosses. (It is important that A.cpus and B.cpus do not > intersect.) > > Ideally no task would remain in the root group, back in the day we could > actually do this (with exception of the cpu bound kernel threads), but > this has significantly regressed :-( > (still hate the workqueue affinity interface) I wonder if we could address this by creating (automatically at boot or when the cpuset controller is enabled or whatever) a /cpuset/random_kernel_shit cgroup and have all of the unmoveable tasks land there? > > As is, tasks that are left in the root group get balanced within > whatever domain they ended up in. > >> And what's DL? > > SCHED_DEADLINE, its a 'Global'-EDF like scheduler that doesn't support > CPU affinities (because that doesn't make sense). The only way to > restrict it is to partition. > > 'Global' because you can partition it. If you reduce your system to > single CPU partitions you'll reduce to P-EDF. > > (The same is true of SCHED_FIFO, that's a 'Global'-FIFO on the same > partition scheme, it however does support sched_affinity, but using it > gives 'interesting' schedulability results -- call it a historic > accident). Hmm, I didn't realize that the deadline scheduler was global. But ISTM requiring the use of "exclusive" to get this working is unfortunate. What if a user wants two separate partitions, one using CPUs 1 and 2 and the other using CPUs 3 and 4 (with 5 reserved for non-RT stuff)? Shouldn't we be able to have a cgroup for each of the DL partitions and do something to tell the deadline scheduler "here is your domain"? > > > Note that related, but differently, we have the isolcpus boot parameter > which creates single CPU partitions for all listed CPUs and gives the > rest to the root cpuset. Ideally we'd kill this option given its a boot > time setting (for something which is trivially to do at runtime). > > But this cannot be done, because that would mean we'd have to start with > a !0 cpuset layout: > > '/' > load_balance=0 > / \ > 'system' 'isolated' > cpus=~isolcpus cpus=isolcpus > load_balance=0 > > And start with _everything_ in the /system group (inclding default IRQ > affinities). > > Of course, that will break everything cgroup :-( > I would actually *much* prefer this over the status quo. I'm tired of my crappy, partially-working script that sits there and creates exactly this configuration (minus the isolcpus part because I actually want migration to work) on boot. (Actually, it could have two automatic cgroups: /kernel and /init -- init and UMH would go in init and kernel threads and such would go in /kernel. Userspace would be able to request that a different cgroup be used for newly-created kernel threads.) Heck, even systemd would probably prefer this. Then it could cleanly expose a "slice" or whatever it's called for random kernel shit and at least you could configure it meaningfully. -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html