Andy wrote:
Your patches more or less implement "don't run me unless I'm isolated". A scheduler class would be more like "isolate me (and maybe make me super high priority so it actually happens)".
Steven wrote:
Since it only makes sense to run one isolated task per cpu (not more than one on the same CPU), I wonder if we should add a new interface for this, that would force everything else off the CPU that it requests. That is, you bind a task to a CPU, and then change it to SCHED_ISOLATED (or what not), and the kernel will force all other tasks off that CPU.
Frederic wrote:
I think you'll have to make sure the task can not be concurrently reaffined to more CPUs. This may involve setting task_isolation_flags under the runqueue lock and thus move that tiny part to the scheduler code. And then we must forbid changing the affinity while the task has the isolation flag, or deactivate the flag.
These comments are all about the same high-level question, so I want to address it in this reply. The question is, should TASK_ISOLATION be "polite" or "aggressive"? The original design was "polite": it worked as long as no other thing on the system tried to mess with it. The suggestions above are for an "aggressive" design. The "polite" design basically tags a task as being interested in having the kernel help it out by staying away from it. It relies on running on a nohz_full cpu to keep scheduler ticks away from it. It relies on running on an isolcpus cpu to keep other processes from getting dynamically load-balanced onto it and messing it up. And, of course, it relies on the other applications and users running on the machine not to affinitize themselves onto its core and mess it up that way. But, as long as all those things are true, the kernel will try to help it out by never interrupting it. (And, it allows for the kernel to report when those expectations are violated.) The "aggressive" design would have an API that said "This is my core!". The kernel would enforce keeping other processes off the core. It would require nohz_full semantics on that core. It would lock the task to that core in some way that would override attempts to reset its sched_affinity. It would do whatever else was necessary to make that core unavailable to the rest of the system. Advantages of the "polite" design: - No special privileges required - As a result, no security issues to sort through (capabilities, etc.) - Therefore easy to use when running as an unprivileged user - Won't screw up the occasional kernel task that needs to run Advantages of the "aggressive" design: - Clearer that the application will get the task isolation it wants - More reasonable that it is enforcing kernel performance tweaks on the local core (e.g. flushing the per-cpu LRU cache) The "aggressive" design is certainly tempting, but there may be other negative consequences of this design: for example, if we need to run a usermode helper process as a result of some system call, we do want to ensure that it can run, and we need to allow it to be scheduled, even if it's just a regular scheduler class thing. The "polite" design allows the usermode helper to run and just waits until it's safe for the isolated task to return to userspace. Possibly we could arrange for a SCHED_ISOLATED class to allow that kind of behavior, though I'm not familiar enough with the scheduler code to say for sure. I think it's important that we're explicit about which of these two approaches feels like the more appropriate one. Possibly my Tilera background is part of which pushes me towards the "polite" design; we have a lot of cores, so they're a kind of trivial resource that we don't need to aggressively defend, and it's a more conservative design to enable task isolation only when all the relevant criteria have been met, rather than enforcing those criteria up front. I think if we adopt the "aggressive" model, it might likely make sense to express it as a scheduling policy, since it would include core scheduler changes such as denying other tasks the right to call sched_setaffinity() with an affinity that includes cores currently in use by SCHED_ISOLATED tasks. This would be something pretty deeply hooked into the scheduler and therefore might require some more substantial changes. In addition, of course, there's the cost of documenting yet another scheduler policy. In the "polite" model, we certainly could use a SCHED_ISOLATED scheduling policy (with static priority zero) to indicate task-isolation mode, rather than using prctl() to set a task_struct bit. I'm not sure how much it gains, though. It could allow the scheduler to detect that the only "runnable" task actually didn't want to be run, and switch briefly to the idle task, but since this would likely only be for a scheduler tick or two, the power advantages are pretty minimal, for a pretty reasonable additional piece of complexity both in the API (documenting a new scheduler class) and in the implementation (putting new requirements into the scheduler implementations). So I'm somewhat dubious, although willing to be pushed in that direction if that's the consensus. On balance I think it still feels to me like the original proposed direction (a "polite" task isolation mode with a prctl bit) feels better than the scheduler-based alternatives that have been proposed. -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html