[PATCH RFC 0/4] static key support for error injection functions

Vlastimil Babka <vbabka@xxxxxxx> · Fri, 31 May 2024 11:33:31 +0200

Incomplete, help needed from ftrace/kprobe and bpf folks.

As previously mentioned by myself [1] and others [2] the functions
designed for error injection can bring visible overhead in fastpaths
such as slab or page allocation, because even if nothing hooks into them
at a given moment, they are noninline function calls regardless of
CONFIG_ options since commits 4f6923fbb352 ("mm: make should_failslab
always available for fault injection") and af3b854492f3
("mm/page_alloc.c: allow error injection").

Live patching their callsites has been also suggested in both [1] and
[2] threads, and this is an attempt to do that with static keys that
guard the call sites. When disabled, the error injection functions still
exist and are noinline, but are not being called. Any of the existing
mechanisms that can inject errors should make sure to enable the
respective static key. I have added that support to some of them but
need help with the others.

- the legacy fault injection, i.e. CONFIG_FAILSLAB and
  CONFIG_FAIL_PAGE_ALLOC is handled in Patch 1, and can be passed the
  address of the static key if it exists. The key will be activated if the
  fault injection probability becomes non-zero, and deactivated in the
  opposite transition. This also removes the overhead of the evaluation
  (on top of the noninline function call) when these mechanisms are
  configured in the kernel but unused at the moment.

- the generic error injection using kretprobes with
  override_function_with_return is handled in patch 2. The
  ALLOW_ERROR_INJECTION() annotation is extended so that static key
  address can be passed, and the framework controls it when error
  injection is enabled or disabled in debugfs for the function.

There are two more users I know of but am not familiar enough to fix up
myself. I hope people that are more familiar can help me here.

- ftrace seems to be using override_function_with_return from
  #define ftrace_override_function_with_return but I found no place
  where the latter is used. I assume it might be hidden behind more
  macro magic? But the point is if ftrace can be instructed to act like
  an error injection, it would also have to use some form of metadata
  (from patch 2 presumably?) to get to the static key and control it.

  If ftrace can only observe the function being called, maybe it
  wouldn't be wrong to just observe nothing if the static key isn't
  enabled because nobody is doing the fault injection?

- bpftrace, as can be seen from the example in commit 4f6923fbb352
  description. I suppose bpf is already aware what functions the
  currently loaded bpf programs hook into, so that it could look up the
  static key and control it. Maybe using again the metadata from patch 2,
  or extending its own, as I've noticed there's e.g. BTF_ID(func,
  should_failslab)

Now I realize maybe handling this at the k(ret)probe level would be
sufficient for all cases except the legacy fault injection from Patch 1?
Also wanted to note that by AFAIU by using the static_key_slow_dec/inc
API (as done in patches 1/2) should allow all mechanisms to coexist
naturally without fighting each other on the static key state, and also
handle the reference count for e.g. active probes or bpf programs if
there's no similar internal mechanism.

Patches 3 and 4 implement the static keys for the two mm fault injection
sites in slab and page allocators. For a quick demonstration I've run a
VM and the simple test from [1] that stresses the slab allocator and got
this time before the series:

real    0m8.349s
user    0m0.694s
sys     0m7.648s

with perf showing

   0.61%  nonexistent  [kernel.kallsyms]  [k] should_failslab.constprop.0
   0.00%  nonexistent  [kernel.kallsyms]  [k] should_fail_alloc_page                                                                                                                                                                                        ▒

And after the series

real    0m7.924s
user    0m0.727s
sys     0m7.191s

and the functions gone from perf report.

There might be other such fault injection callsites in hotpaths of other
subsystems but I didn't search for them at this point.

[1] https://lore.kernel.org/all/6d5bb852-8703-4abf-a52b-90816bccbd7f@xxxxxxx/
[2] https://lore.kernel.org/all/3j5d3p22ssv7xoaghzraa7crcfih3h2qqjlhmjppbp6f42pg2t@kg7qoicog5ye/

Signed-off-by: Vlastimil Babka <vbabka@xxxxxxx>
---
Vlastimil Babka (4):
      fault-inject: add support for static keys around fault injection sites
      error-injection: support static keys around injectable functions
      mm, slab: add static key for should_failslab()
      mm, page_alloc: add static key for should_fail_alloc_page()

 include/asm-generic/error-injection.h | 13 ++++++++++-
 include/asm-generic/vmlinux.lds.h     |  2 +-
 include/linux/error-injection.h       |  9 +++++---
 include/linux/fault-inject.h          |  7 +++++-
 kernel/fail_function.c                | 22 +++++++++++++++---
 lib/error-inject.c                    |  6 ++++-
 lib/fault-inject.c                    | 43 ++++++++++++++++++++++++++++++++++-
 mm/fail_page_alloc.c                  |  3 ++-
 mm/failslab.c                         |  2 +-
 mm/internal.h                         |  2 ++
 mm/page_alloc.c                       | 11 ++++++---
 mm/slab.h                             |  3 +++
 mm/slub.c                             | 10 +++++---
 13 files changed, 114 insertions(+), 19 deletions(-)
---
base-commit: 1613e604df0cd359cf2a7fbd9be7a0bcfacfabd0
change-id: 20240530-fault-injection-statickeys-66b7222e91b7

Best regards,
-- 
Vlastimil Babka <vbabka@xxxxxxx>