On 05/28/24 13:46, Tejun Heo wrote: > Hello, > > BTW, David is off for the week and might be a bit slow to respond. I just > want to comment on one part. > > On Mon, May 27, 2024 at 10:25:40PM +0100, Qais Yousef wrote: > ... > > And I can only share my experience, I don't think the algorithm itself is the > > bottleneck here. The devil is in the corner cases. And these are hard to deal > > with without explicit hints. > > Our perceptions of the scope of the problem space seem very different. To > me, it seems pretty unexplored. Here's just one area: Constantly increasing > number of cores and popularization of more complex cache hierarchies. > > Over a hundred CPUs in a system is fairly normal now with a couple layers of > cache hierarchy. Once we have so many, things can look a bit different from > the days when we had a few. Flipping the approach so that we can dynamically > assign close-by CPUs to related groups of threads becomes attractive. I had this use case in mind actually for sched-qos [1] idea I am trying to develop. There are workloads that can benefit if 2 or 3 tasks are kept withing the closest cache. And I think we can describe that with a hint. I was thinking to borrow from core scheduling concept of cookie to tag a group of task via the hint and try to find reasonable higher level behavior that we can translate correctly into different systems. > > e.g. If you have a bunch of services which aren't latency critical but are > needed to maintain system integrity (updates, monitoring, security and so > on), soft-affining them to a number of CPUs while allowing some CPU headroom > can give you noticeable gain both in performance (partly from cleaner > caches) and power consumption while not adding that much to latency. This is > something the scheduler can and, I believe, should do transparently. This looks similar to what I am trying to do with uclamp_max and extending load balancer to allow to balance workloads based on power - keeping in mind freeing resources for tasks that need performance too. I don't think we can fix this problem on wake up balance only. The system is in a constant flux and we need load balancer to do corrections when other things wake up and we need better decisions to be made. Generally if we have EAS type of behavior available for SMP systems where we don't distribute by default but try to pack based on compute demand - and a hint to tell us that some tasks really want to be spread as an exception for those that packing really hurts them, I think we'd be in a much better place to be able to distribute resources like you describe. > > It's not obvious how to do it though. It doesn't quite fit the current LB > model. cgroup hierarchy seems to provide some hints on how threads can be > grouped but the boundaries might not match that well. Even if we figure out cgroups is too aggressive IMHO. We really need per-task hints. It's coarse vs fine grained hinting. There's only so much classification you can give to a larger group of tasks. Especially if you can't control the codebase of this group of tasks. Some people can get invested in tuning specific apps. But this is not scalable and fragile. > how to define these groups, figuring out group-vs-group competition isn't > trivial (naive load-sums don't work when comparing across groups spanning > multiple CPUs). I think the implementation is trickier than the definition. There's lots of demands to keep the fast path as fast as possible. To do smarter decisions this will get expensive. Personally I think today we have abundant of compute power and the challenge is how to smartly distribute resources, which justify slowing things down in favour of making better choices. But I don't know how much we can afford to be honest. Generally as I was telling David, people who tend to come forward more to support or complain are those who have pure throughput in mind. Maybe I am wrong, but from my perception a lot of decisions were biased this way. We need to be more vocal about our needs to make sure that things move in the right direction. It's hard to help a use case or fix a problem when you don't know about it. > > Also, what about the threads with oddball cpumasks? Should we begin to treat > CPUs more like other resources, e.g., memory? We don't generally allow > applications to specify which specific physical pages they get because that > doesn't buy anything while adding a lot of constraints. If we have dozens > and hundreds of CPUs, are there fundamental reason to view them differently > from other resources which are treated fungible? I'd be more than happy to see affinity and cpuset disappear :) But I fear it might be a little too late.. Can't some selinux rule or some syscall filter be used to block userspace from playing with affinity? I'm assuming you're not referring to in-kernel usage of affinity. Which might be worth scrutinizing. But we have more control over that in general to make it better when a problem arises. > > The claim that the current scheduler has the fundamentals all figured out > and it's mostly about handling edge cases and educating users seems wildly > off mark to me. I don't think anyone claimed that. But EEVDF or CFS is about how tasks enqueued on the CPU will be ordered and run. It's not about selecting which CPU to run the task on. EAS modifies the selection algorithm (which is not what David was talking about IIUC). It seems your problems are more with CPU selection then? > > Maybe we can develop all that in the current framework in a gradual fashion, > but when the problem space is so wide open, that is not a good approach to > take. The cost of constricting is likely significantly higher than the > benefits of having a single code base. Imagine having to develop all the > features of btrfs in the ext2 code base. It's probably doable, at least > theoretically, but that would have been massively stifling, maybe to the > point of most of it not happening. > > To the above particular problem of soft-affinity, scx_layered has something What layered refers to here? Is it akin to different sched classes? > really simple and dumb implemented and we're testing and deploying it in the > fleet with noticeable perf gains, and there are early efforts to see whether > we can automatically figure out grouping based on the cgroup hierarchy and > possibly minimal xattr hints on them. > > I don't yet know what generic form soft-affinity should take eventually, > but, with sched_ext, we have a way to try out different ideas in production > and iterate on them learning each step of the way. Given how generic both > the problem and benefits from solving it are, we'll have to reach some > generic solution at one point. Maybe it will come from sched_ext or maybe it > will come from people working on fair like yourself. Either way, sched_ext > is already showing us what can be achieved and prodding people towards > solving it. To be honest this doesn't look any different to all the hacks out there that do the same. The path I see this is going into is the same as I mentioned above where some people manually tune for specific usage. I really struggle to see how this is going to be applicable later and all I see a divergence and parallel universes - which ultimately will hurt the user as Linux behavior is just not predictable. This Linus rant [2] is relevant to the situation. In this case people who write applications will just find that Linux is not reliable because every system doesn't behave the same. [1] https://lore.kernel.org/lkml/20230916213316.p36nhgnibsidoggt@airbuntu/ [2] https://lore.kernel.org/lkml/CAHk-=wgtb7y-bEh7tPDvDWru7ZKQ8-KMjZ53Tsk37zsPPdwXbA@xxxxxxxxxxxxxx/ Thanks! -- Qais Yousef