Re: [RFC PATCH 00/86] Make the kernel preemptible

Ankur Arora <ankur.a.arora@xxxxxxxxxx> · Tue, 07 Nov 2023 22:49:07 -0800

Steven Rostedt <rostedt@xxxxxxxxxxx> writes:

> On Tue, 7 Nov 2023 20:52:39 -0800 (PST)
> Christoph Lameter <cl@xxxxxxxxx> wrote:
>
>> On Tue, 7 Nov 2023, Ankur Arora wrote:
>>
>> > This came up in an earlier discussion (See
>> > https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/) and Thomas mentioned
>> > that preempt_enable/_disable() overhead was relatively minimal.
>> >
>> > Is your point that always-on preempt_count is far too expensive?
>>
>> Yes over the years distros have traditionally delivered their kernels by
>> default without preemption because of these issues. If the overhead has
>> been minimized then that may have changed. Even if so there is still a lot
>> of code being generated that has questionable benefit and just
>> bloats the kernel.
>>
>> >> These are needed to avoid adding preempt_enable/disable to a lot of primitives
>> >> that are used for synchronization. You cannot remove those without changing a
>> >> lot of synchronization primitives to always have to consider being preempted
>> >> while operating.
>> >
>> > I'm afraid I don't understand why you would need to change any
>> > synchronization primitives. The code that does preempt_enable/_disable()
>> > is compiled out because CONFIG_PREEMPT_NONE/_VOLUNTARY don't define
>> > CONFIG_PREEMPT_COUNT.
>>
>> In the trivial cases it is simple like that. But look f.e.
>> in the slub allocator at the #ifdef CONFIG_PREEMPTION section. There is a
>> overhead added to be able to allow the cpu to change under us. There are
>> likely other examples in the source.
>>
>
> preempt_disable() and preempt_enable() are much lower overhead today than
> it use to be.
>
> If you are worried about changing CPUs, there's also migrate_disable() too.
>
>> And the whole business of local data
>> access via per cpu areas suffers if we cannot rely on two accesses in a
>> section being able to see consistent values.
>>
>> > The intent here is to always have CONFIG_PREEMPT_COUNT=y.
>>
>> Just for fun? Code is most efficient if it does not have to consider too
>> many side conditions like suddenly running on a different processor. This
>> introduces needless complexity into the code. It would be better to remove
>> PREEMPT_COUNT for good to just rely on voluntary preemption. We could
>> probably reduce the complexity of the kernel source significantly.
>
> That is what caused this thread in the first place. Randomly scattered
> "preemption points" does not scale!
>
> And I'm sorry, we have latency sensitive use cases that require full
> preemption.
>
>>
>> I have never noticed a need to preemption at every instruction in the
>> kernel (if that would be possible at all... Locks etc prevent that ideal
>> scenario frequently). Preemption like that is more like a pipe dream.

The intent isn't to preempt at every other instruction in the kernel.

As Thomas describes, the idea is that for voluntary preemption kernels
resched happens at cond_resched() points which have been distributed
heuristically. As a consequence you might get both too little preemption
and too much preemption.

The intent is to bring preemption in control of the scheduler which
can do a better job than randomly placed cond_resched() points.

>> High performance kernel solution usually disable
>> overhead like that.

You are also missing all the ways in which voluntary preemption
points are responsible for poor performance. For instance, if you look
atq clear_huge_page() it does page by page copy with a cond_resched()
call after clearing each page.

But if you can expose the full extent to the CPU, it can optimize
differently (for the 1GB page it can now elide cacheline allocation):

  *Milan*     mm/clear_huge_page   x86/clear_huge_page   change
                          (GB/s)                (GB/s)

  pg-sz=2MB                14.55                 19.29    +32.5%
  pg-sz=1GB                19.34                 49.60   +156.4%

(See https://lore.kernel.org/all/20230830184958.2333078-1-ankur.a.arora@xxxxxxxxxx/)

> Please read the email from Thomas:
>
>    https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/
>
> This is not technically getting rid of PREEMPT_NONE. It is adding a new
> NEED_RESCHED_LAZY flag, that will have the kernel preempt only when
> entering or in user space. It will behave the same as PREEMPT_NONE, but
> without the need for all the cond_resched() scattered randomly throughout
> the kernel.

And a corollary of that is that with a scheduler controlled PREEMPT_NONE
a task might end up running to completion where earlier it could have been
preempted early because it crossed a cond_resched().

> If the task is in the kernel for more than one tick (1ms at 1000Hz, 4ms at
> 250Hz and 10ms at 100Hz), it will then set NEED_RESCHED, and you will
> preempt at the next available location (preempt_count == 0).
>
> But yes, all locations that do not explicitly disable preemption, will now
> possibly preempt (due to long running kernel threads).

--
ankur