Steven Rostedt <rostedt@xxxxxxxxxxx> writes: > On Tue, 7 Nov 2023 20:52:39 -0800 (PST) > Christoph Lameter <cl@xxxxxxxxx> wrote: > >> On Tue, 7 Nov 2023, Ankur Arora wrote: >> >> > This came up in an earlier discussion (See >> > https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/) and Thomas mentioned >> > that preempt_enable/_disable() overhead was relatively minimal. >> > >> > Is your point that always-on preempt_count is far too expensive? >> >> Yes over the years distros have traditionally delivered their kernels by >> default without preemption because of these issues. If the overhead has >> been minimized then that may have changed. Even if so there is still a lot >> of code being generated that has questionable benefit and just >> bloats the kernel. >> >> >> These are needed to avoid adding preempt_enable/disable to a lot of primitives >> >> that are used for synchronization. You cannot remove those without changing a >> >> lot of synchronization primitives to always have to consider being preempted >> >> while operating. >> > >> > I'm afraid I don't understand why you would need to change any >> > synchronization primitives. The code that does preempt_enable/_disable() >> > is compiled out because CONFIG_PREEMPT_NONE/_VOLUNTARY don't define >> > CONFIG_PREEMPT_COUNT. >> >> In the trivial cases it is simple like that. But look f.e. >> in the slub allocator at the #ifdef CONFIG_PREEMPTION section. There is a >> overhead added to be able to allow the cpu to change under us. There are >> likely other examples in the source. >> > > preempt_disable() and preempt_enable() are much lower overhead today than > it use to be. > > If you are worried about changing CPUs, there's also migrate_disable() too. > >> And the whole business of local data >> access via per cpu areas suffers if we cannot rely on two accesses in a >> section being able to see consistent values. >> >> > The intent here is to always have CONFIG_PREEMPT_COUNT=y. >> >> Just for fun? Code is most efficient if it does not have to consider too >> many side conditions like suddenly running on a different processor. This >> introduces needless complexity into the code. It would be better to remove >> PREEMPT_COUNT for good to just rely on voluntary preemption. We could >> probably reduce the complexity of the kernel source significantly. > > That is what caused this thread in the first place. Randomly scattered > "preemption points" does not scale! > > And I'm sorry, we have latency sensitive use cases that require full > preemption. > >> >> I have never noticed a need to preemption at every instruction in the >> kernel (if that would be possible at all... Locks etc prevent that ideal >> scenario frequently). Preemption like that is more like a pipe dream. The intent isn't to preempt at every other instruction in the kernel. As Thomas describes, the idea is that for voluntary preemption kernels resched happens at cond_resched() points which have been distributed heuristically. As a consequence you might get both too little preemption and too much preemption. The intent is to bring preemption in control of the scheduler which can do a better job than randomly placed cond_resched() points. >> High performance kernel solution usually disable >> overhead like that. You are also missing all the ways in which voluntary preemption points are responsible for poor performance. For instance, if you look atq clear_huge_page() it does page by page copy with a cond_resched() call after clearing each page. But if you can expose the full extent to the CPU, it can optimize differently (for the 1GB page it can now elide cacheline allocation): *Milan* mm/clear_huge_page x86/clear_huge_page change (GB/s) (GB/s) pg-sz=2MB 14.55 19.29 +32.5% pg-sz=1GB 19.34 49.60 +156.4% (See https://lore.kernel.org/all/20230830184958.2333078-1-ankur.a.arora@xxxxxxxxxx/) > Please read the email from Thomas: > > https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/ > > This is not technically getting rid of PREEMPT_NONE. It is adding a new > NEED_RESCHED_LAZY flag, that will have the kernel preempt only when > entering or in user space. It will behave the same as PREEMPT_NONE, but > without the need for all the cond_resched() scattered randomly throughout > the kernel. And a corollary of that is that with a scheduler controlled PREEMPT_NONE a task might end up running to completion where earlier it could have been preempted early because it crossed a cond_resched(). > If the task is in the kernel for more than one tick (1ms at 1000Hz, 4ms at > 250Hz and 10ms at 100Hz), it will then set NEED_RESCHED, and you will > preempt at the next available location (preempt_count == 0). > > But yes, all locations that do not explicitly disable preemption, will now > possibly preempt (due to long running kernel threads). -- ankur