Re: [RFC PATCH 00/86] Make the kernel preemptible

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 07.11.23 22:56, Ankur Arora wrote:
Hi,

We have two models of preemption: voluntary and full (and RT which is
a fuller form of full preemption.) In this series -- which is based
on Thomas' PoC (see [1]), we try to unify the two by letting the
scheduler enforce policy for the voluntary preemption models as well.

(Note that this is about preemption when executing in the kernel.
Userspace is always preemptible.)

Background
==

Why?: both of these preemption mechanisms are almost entirely disjoint.
There are four main sets of preemption points in the kernel:

  1. return to user
  2. explicit preemption points (cond_resched() and its ilk)
  3. return to kernel (tick/IPI/irq at irqexit)
  4. end of non-preemptible sections at (preempt_count() == preempt_offset)

Voluntary preemption uses mechanisms 1 and 2. Full preemption
uses 1, 3 and 4. In addition both use cond_resched_{rcu,lock,rwlock*}
which can be all things to all people because they internally
contain 2, and 4.

Now since there's no ideal placement of explicit preemption points,
they tend to be randomly spread over code and accumulate over time,
as they are are added when latency problems are seen. Plus fear of
regressions makes them difficult to remove.
(Presumably, asymptotically they would spead out evenly across the
instruction stream!)

In voluntary models, the scheduler's job is to match the demand
side of preemption points (a task that needs to be scheduled) with
the supply side (a task which calls cond_resched().)

Full preemption models track preemption count so the scheduler can
always knows if it is safe to preempt and can drive preemption
itself (ex. via dynamic preemption points in 3.)

Design
==

As Thomas outlines in [1], to unify the preemption models we
want to: always have the preempt_count enabled and allow the scheduler
to drive preemption policy based on the model in effect.

Policies:

- preemption=none: run to completion
- preemption=voluntary: run to completion, unless a task of higher
   sched-class awaits
- preemption=full: optimized for low-latency. Preempt whenever a higher
   priority task awaits.

To do this add a new flag, TIF_NEED_RESCHED_LAZY which allows the
scheduler to mark that a reschedule is needed, but is deferred until
the task finishes executing in the kernel -- voluntary preemption
as it were.

The TIF_NEED_RESCHED flag is evaluated at all three of the preemption
points. TIF_NEED_RESCHED_LAZY only needs to be evaluated at ret-to-user.

          ret-to-user    ret-to-kernel    preempt_count()
none           Y              N                N
voluntary      Y              Y                Y
full           Y              Y                Y


There's just one remaining issue: now that explicit preemption points are
gone, processes that spread a long time in the kernel have no way to give
up the CPU.

For full preemption, that is a non-issue as we always use TIF_NEED_RESCHED.

For none/voluntary preemption, we handle that by upgrading to TIF_NEED_RESCHED
if a task marked TIF_NEED_RESCHED_LAZY hasn't preempted away by the next tick.
(This would cause preemption either at ret-to-kernel, or if the task is in
a non-preemptible section, when it exits that section.)

Arguably this provides for much more consistent maximum latency (~2 tick
lengths + length of non-preemptible section) as compared to the old model
where the maximum latency depended on the dynamic distribution of
cond_resched() points.

(As a bonus it handles code that is preemptible but cannot call cond_resched()
  completely trivially: ex. long running Xen hypercalls, or this series
  which started this discussion:
  https://lore.kernel.org/all/20230830184958.2333078-8-ankur.a.arora@xxxxxxxxxx/)


Status
==

What works:
  - The system seems to keep ticking over with the normal scheduling policies
    (SCHED_OTHER). The support for the realtime policies is somewhat more
    half baked.)
  - The basic performance numbers seem pretty close to 6.6-rc7 baseline

What's broken:
  - ARCH_NO_PREEMPT (See patch-45 "preempt: ARCH_NO_PREEMPT only preempts
    lazily")
  - Non-x86 architectures. It's trivial to support other archs (only need
    to add TIF_NEED_RESCHED_LAZY) but wanted to hold off until I got some
    comments on the series.
    (From some testing on arm64, didn't find any surprises.)
  - livepatch: livepatch depends on using _cond_resched() to provide
    low-latency patching. That is obviously difficult with cond_resched()
    gone. We could get a similar effect by using a static_key in
    preempt_enable() but at least with inline locks, that might be end
    up bloating the kernel quite a bit.
  - Documentation/ and comments mention cond_resched()
  - ftrace support for need-resched-lazy is incomplete

What needs more discussion:
  - Should cond_resched_lock() etc be scheduling out for TIF_NEED_RESCHED
    only or both TIF_NEED_RESCHED_LAZY as well? (See patch 35 "thread_info:
    change to tif_need_resched(resched_t)")
  - Tracking whether a task in userspace or in the kernel (See patch-40
    "context_tracking: add ct_state_cpu()")
  - The right model for preempt=voluntary. (See patch 44 "sched: voluntary
    preemption")


Performance
==

Expectation:

* perf sched bench pipe

preemption               full           none

6.6-rc7              6.68 +- 0.10   6.69 +- 0.07
+series              6.69 +- 0.12   6.67 +- 0.10

This is rescheduling out of idle which should and does perform identically.

* schbench, preempt=none

   * 1 group, 16 threads each

                  6.6-rc7      +series
                  (usecs)      (usecs)
      50.0th:         6            6
      90.0th:         8            7
      99.0th:        11           11
      99.9th:        15           14
* 8 groups, 16 threads each

                 6.6-rc7       +series
                  (usecs)      (usecs)
      50.0th:         6            6
      90.0th:         8            8
      99.0th:        12           11
      99.9th:        20           21


* schbench, preempt=full

   * 1 group, 16 threads each

                 6.6-rc7       +series
                 (usecs)       (usecs)
      50.0th:         6            6
      90.0th:         8            7
      99.0th:        11           11
      99.9th:        14           14


   * 8 groups, 16 threads each

                 6.6-rc7       +series
                 (usecs)       (usecs)
      50.0th:         7            7
      90.0th:         9            9
      99.0th:        12           12
      99.9th:        21           22


   Not much in it either way.

* kernbench, preempt=full

   * half-load (-j 128)

            6.6-rc7                                    +series

   wall        149.2  +-     27.2             wall        132.8  +-     0.4
   utime      8097.1  +-     57.4             utime      8088.5  +-    14.1
   stime      1165.5  +-      9.4             stime      1159.2  +-     1.9
   %cpu       6337.6  +-   1072.8             %cpu       6959.6  +-    22.8
   csw      237618    +-   2190.6             %csw     240343    +-  1386.8


   * optimal-load (-j 1024)

            6.6-rc7                                    +series

   wall        137.8 +-       0.0             wall       137.7  +-       0.8
   utime     11115.0 +-    3306.1             utime    11041.7  +-    3235.0
   stime      1340.0 +-     191.3             stime     1323.1  +-     179.5
   %cpu       8846.3 +-    2830.6             %cpu      9101.3  +-    2346.7
   csw     2099910   +- 2040080.0             csw    2068210    +- 2002450.0


   The preempt=full path should effectively not see any change in
   behaviour. The optimal-loads are pretty much identical.
   For the half-load, however, the +series version does much better but that
   seems to be because of much higher run to run variability in the 6.6-rc7 load.

* kernbench, preempt=none

   * half-load (-j 128)

            6.6-rc7                                    +series

   wall        134.5  +-      4.2             wall        133.6  +-     2.7
   utime      8093.3  +-     39.3             utime      8099.0  +-    38.9
   stime      1175.7  +-     10.6             stime      1169.1  +-     8.4
   %cpu       6893.3  +-    233.2             %cpu       6936.3  +-   142.8
   csw      240723    +-    423.0             %csw     173152    +-  1126.8
   * optimal-load (-j 1024)

            6.6-rc7                                    +series

   wall        139.2 +-       0.3             wall       138.8  +-       0.2
   utime     11161.0 +-    3360.4             utime    11061.2  +-    3244.9
   stime      1357.6 +-     199.3             stime     1366.6  +-     216.3
   %cpu       9108.8 +-    2431.4             %cpu      9081.0  +-    2351.1
   csw     2078599   +- 2013320.0             csw    1970610    +- 1969030.0


   For both of these the wallclock, utime, stime etc are pretty much
   identical. The one interesting difference is that the number of
   context switches are fewer. This intuitively makes sense given that
   we reschedule threads lazily rather than rescheduling if we encounter
   a cond_resched() and there's a thread wanting to be scheduled.

   The max-load numbers (not posted here) also behave similarly.


Series
==

With that, this is how he series is laid out:

  - Patches 01-30: revert the PREEMPT_DYNAMIC code. Most of the infrastructure
    used by that is via static_calls() and this is a simpler approach which
    doesn't need any of that (and does away with cond_resched().)

    Some of the commits will be resurrected.
        089c02ae2771 ("ftrace: Use preemption model accessors for trace header printout")
        cfe43f478b79 ("preempt/dynamic: Introduce preemption model accessors")
        5693fa74f98a ("kcsan: Use preemption model accessors")

  - Patches 31-45: contain the scheduler changes to do this. Of these
    the critical ones are:
      patch 35 "thread_info: change to tif_need_resched(resched_t)"
      patch 41 "sched: handle resched policy in resched_curr()"
      patch 43 "sched: enable PREEMPT_COUNT, PREEMPTION for all preemption models"
      patch 44 "sched: voluntary preemption"
       (this needs more work to decide when a higher sched-policy task
        should preempt a lower sched-policy task)
      patch 45 "preempt: ARCH_NO_PREEMPT only preempts lazily"

  - Patches 47-50: contain RCU related changes. RCU now works in both
    PREEMPT_RCU=y and PREEMPT_RCU=n modes with CONFIG_PREEMPTION.
    (Until now PREEMPTION=y => PREEMPT_RCU)

  - Patches 51-56,86: contain cond_resched() related cleanups.
      patch 54 "sched: add cond_resched_stall()" adds a new cond_resched()
      interface. Pitchforks?

  - Patches 57-86: remove cond_resched() from the tree.


Also at: github.com/terminus/linux preemption-rfc


Please review.

Thanks
Ankur

[1] https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/


Ankur Arora (86):
   Revert "riscv: support PREEMPT_DYNAMIC with static keys"
   Revert "sched/core: Make sched_dynamic_mutex static"
   Revert "ftrace: Use preemption model accessors for trace header
     printout"
   Revert "preempt/dynamic: Introduce preemption model accessors"
   Revert "kcsan: Use preemption model accessors"
   Revert "entry: Fix compile error in
     dynamic_irqentry_exit_cond_resched()"
   Revert "livepatch,sched: Add livepatch task switching to
     cond_resched()"
   Revert "arm64: Support PREEMPT_DYNAMIC"
   Revert "sched/preempt: Add PREEMPT_DYNAMIC using static keys"
   Revert "sched/preempt: Decouple HAVE_PREEMPT_DYNAMIC from
     GENERIC_ENTRY"
   Revert "sched/preempt: Simplify irqentry_exit_cond_resched() callers"
   Revert "sched/preempt: Refactor sched_dynamic_update()"
   Revert "sched/preempt: Move PREEMPT_DYNAMIC logic later"
   Revert "preempt/dynamic: Fix setup_preempt_mode() return value"
   Revert "preempt: Restore preemption model selection configs"
   Revert "sched: Provide Kconfig support for default dynamic preempt
     mode"
   sched/preempt: remove PREEMPT_DYNAMIC from the build version
   Revert "preempt/dynamic: Fix typo in macro conditional statement"
   Revert "sched,preempt: Move preempt_dynamic to debug.c"
   Revert "static_call: Relax static_call_update() function argument
     type"
   Revert "sched/core: Use -EINVAL in sched_dynamic_mode()"
   Revert "sched/core: Stop using magic values in sched_dynamic_mode()"
   Revert "sched,x86: Allow !PREEMPT_DYNAMIC"
   Revert "sched: Harden PREEMPT_DYNAMIC"
   Revert "sched: Add /debug/sched_preempt"
   Revert "preempt/dynamic: Support dynamic preempt with preempt= boot
     option"
   Revert "preempt/dynamic: Provide irqentry_exit_cond_resched() static
     call"
   Revert "preempt/dynamic: Provide preempt_schedule[_notrace]() static
     calls"
   Revert "preempt/dynamic: Provide cond_resched() and might_resched()
     static calls"
   Revert "preempt: Introduce CONFIG_PREEMPT_DYNAMIC"
   x86/thread_info: add TIF_NEED_RESCHED_LAZY
   entry: handle TIF_NEED_RESCHED_LAZY
   entry/kvm: handle TIF_NEED_RESCHED_LAZY
   thread_info: accessors for TIF_NEED_RESCHED*
   thread_info: change to tif_need_resched(resched_t)
   entry: irqentry_exit only preempts TIF_NEED_RESCHED
   sched: make test_*_tsk_thread_flag() return bool
   sched: *_tsk_need_resched() now takes resched_t
   sched: handle lazy resched in set_nr_*_polling()
   context_tracking: add ct_state_cpu()
   sched: handle resched policy in resched_curr()
   sched: force preemption on tick expiration
   sched: enable PREEMPT_COUNT, PREEMPTION for all preemption models
   sched: voluntary preemption
   preempt: ARCH_NO_PREEMPT only preempts lazily
   tracing: handle lazy resched
   rcu: select PREEMPT_RCU if PREEMPT
   rcu: handle quiescent states for PREEMPT_RCU=n
   osnoise: handle quiescent states directly
   rcu: TASKS_RCU does not need to depend on PREEMPTION
   preempt: disallow !PREEMPT_COUNT or !PREEMPTION
   sched: remove CONFIG_PREEMPTION from *_needbreak()
   sched: fixup __cond_resched_*()
   sched: add cond_resched_stall()
   xarray: add cond_resched_xas_rcu() and cond_resched_xas_lock_irq()
   xarray: use cond_resched_xas*()
   coccinelle: script to remove cond_resched()
   treewide: x86: remove cond_resched()
   treewide: rcu: remove cond_resched()
   treewide: torture: remove cond_resched()
   treewide: bpf: remove cond_resched()
   treewide: trace: remove cond_resched()
   treewide: futex: remove cond_resched()
   treewide: printk: remove cond_resched()
   treewide: task_work: remove cond_resched()
   treewide: kernel: remove cond_resched()
   treewide: kernel: remove cond_reshed()
   treewide: mm: remove cond_resched()
   treewide: io_uring: remove cond_resched()
   treewide: ipc: remove cond_resched()
   treewide: lib: remove cond_resched()
   treewide: crypto: remove cond_resched()
   treewide: security: remove cond_resched()
   treewide: fs: remove cond_resched()
   treewide: virt: remove cond_resched()
   treewide: block: remove cond_resched()
   treewide: netfilter: remove cond_resched()
   treewide: net: remove cond_resched()
   treewide: net: remove cond_resched()
   treewide: sound: remove cond_resched()
   treewide: md: remove cond_resched()
   treewide: mtd: remove cond_resched()
   treewide: drm: remove cond_resched()
   treewide: net: remove cond_resched()
   treewide: drivers: remove cond_resched()
   sched: remove cond_resched()

I'm missing the removal of the Xen parts, which were one of the reasons to start
this whole work (xen_in_preemptible_hcall etc.).


Juergen

Attachment: OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key

Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature


[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux