On Wed, Dec 02 2020 at 17:43, Christoph Lameter wrote: > On Wed, 2 Dec 2020, Thomas Gleixner wrote: > >> prctl() is the right thing to do. > > Ok great consensus on that one. That's the easy part :) >> The current CPU isolation is a best effort approach and I agree that for >> more strict isolation modes we need to be able to enforce that and hunt >> down offenders and think about them one by one. > > There are two apprahces actually to make the OS quiet. One is the best > effort approach which is more like the current NOHZ one with additional > actions to flush things. The other is the strict approach were one wants a > guarantee that the OS does not do anything at all. And here the consensus stops again :) The point is that between the relaxed best effort / heuristics based scenario and the 'user space task asks for absolute silence' scenario is a huge difference: Is this really a black and white decision? Definitely not. That would be again an imposed policy decision which is wrong to begin with. We burnt ourself with that over and over so can we please and if it's just for this particular problem learn from history? The kernel provides mechanisms but does not impose policies unless there is no other choice. And as we know that there are quite some shades of grey, there is lots of choice and we need to come up with solutions for delegating the policy decision to the user/admin and not just provide a off/on knob. This 'isolate either perhaps or everything' appraoch is just wrong. The partisan thinking is obviously popular in the US, but it has no business in making technically sensible desicions. >> So you say some code can tolerate a few interrupts, then comes Alex and >> says 'no disturbance' at all. > > Yes that is the current NOHZ approach. You switch it on and after the OS > detects are polling loop it will quiet things down. Instead of detecting > it we are actively telling the OS to quiet down now. Kinda. We want to provide mechanisms to quiet certain aspects of the OS and to enable enforcement of that, but again, that's not on/off it has to be configurable / selectable. Again: I fundamentaly disagree with the proposed task isolation patches approach as they leave no choice at all. There is a reasonable middle ground where an application is willing to pay the price (delay) until the reqested quiescing has taken place in order to run undisturbed (hint: cache ...) and also is willing to take the addtional overhead of an occacional syscall in the slow path without tripping some OS imposed isolation safe guard. Aside of that such a granular approach does not necessarily require the application to be aware of it. If the admin knows the computational pattern of the application, e.g. 1 read_data_set() <- involving syscalls/OS obviously 2 compute_set() <- let me alone 3 save_data_set() <- involving syscalls/OS obviously repeat the above... then it's at his discretion to decide to inflict a particular isolation set on the task which is obviously ineffective while doing #1 and #3 but might provide the so desired 0.9% boost for compute_set() which dominates the judgement. That's what we need to think about and once we figured out how to do that it gives Marcelo the mechanism to solve his 'run virt undisturbed by vmstat or whatever' problem and it allows Alex to build his stuff on it. Summary: The problem to be solved cannot be restricted to self_defined_important_task(OWN_WORLD); Policy is not a binary on/off problem. It's manifold across all levels of the stack and only a kernel problem when it comes down to the last line of defence. Up to the point where the kernel puts the line of last defence, policy is defined by the user/admin via mechanims provided by the kernel. Emphasis on "mechanims provided by the kernel", aka. user API. Just in case, I hope that I don't have to explain what level of scrunity and thought this requires. Thanks, tglx