Incomplete, help needed from ftrace/kprobe and bpf folks. As previously mentioned by myself [1] and others [2] the functions designed for error injection can bring visible overhead in fastpaths such as slab or page allocation, because even if nothing hooks into them at a given moment, they are noninline function calls regardless of CONFIG_ options since commits 4f6923fbb352 ("mm: make should_failslab always available for fault injection") and af3b854492f3 ("mm/page_alloc.c: allow error injection"). Live patching their callsites has been also suggested in both [1] and [2] threads, and this is an attempt to do that with static keys that guard the call sites. When disabled, the error injection functions still exist and are noinline, but are not being called. Any of the existing mechanisms that can inject errors should make sure to enable the respective static key. I have added that support to some of them but need help with the others. - the legacy fault injection, i.e. CONFIG_FAILSLAB and CONFIG_FAIL_PAGE_ALLOC is handled in Patch 1, and can be passed the address of the static key if it exists. The key will be activated if the fault injection probability becomes non-zero, and deactivated in the opposite transition. This also removes the overhead of the evaluation (on top of the noninline function call) when these mechanisms are configured in the kernel but unused at the moment. - the generic error injection using kretprobes with override_function_with_return is handled in patch 2. The ALLOW_ERROR_INJECTION() annotation is extended so that static key address can be passed, and the framework controls it when error injection is enabled or disabled in debugfs for the function. There are two more users I know of but am not familiar enough to fix up myself. I hope people that are more familiar can help me here. - ftrace seems to be using override_function_with_return from #define ftrace_override_function_with_return but I found no place where the latter is used. I assume it might be hidden behind more macro magic? But the point is if ftrace can be instructed to act like an error injection, it would also have to use some form of metadata (from patch 2 presumably?) to get to the static key and control it. If ftrace can only observe the function being called, maybe it wouldn't be wrong to just observe nothing if the static key isn't enabled because nobody is doing the fault injection? - bpftrace, as can be seen from the example in commit 4f6923fbb352 description. I suppose bpf is already aware what functions the currently loaded bpf programs hook into, so that it could look up the static key and control it. Maybe using again the metadata from patch 2, or extending its own, as I've noticed there's e.g. BTF_ID(func, should_failslab) Now I realize maybe handling this at the k(ret)probe level would be sufficient for all cases except the legacy fault injection from Patch 1? Also wanted to note that by AFAIU by using the static_key_slow_dec/inc API (as done in patches 1/2) should allow all mechanisms to coexist naturally without fighting each other on the static key state, and also handle the reference count for e.g. active probes or bpf programs if there's no similar internal mechanism. Patches 3 and 4 implement the static keys for the two mm fault injection sites in slab and page allocators. For a quick demonstration I've run a VM and the simple test from [1] that stresses the slab allocator and got this time before the series: real 0m8.349s user 0m0.694s sys 0m7.648s with perf showing 0.61% nonexistent [kernel.kallsyms] [k] should_failslab.constprop.0 0.00% nonexistent [kernel.kallsyms] [k] should_fail_alloc_page ▒ And after the series real 0m7.924s user 0m0.727s sys 0m7.191s and the functions gone from perf report. There might be other such fault injection callsites in hotpaths of other subsystems but I didn't search for them at this point. [1] https://lore.kernel.org/all/6d5bb852-8703-4abf-a52b-90816bccbd7f@xxxxxxx/ [2] https://lore.kernel.org/all/3j5d3p22ssv7xoaghzraa7crcfih3h2qqjlhmjppbp6f42pg2t@kg7qoicog5ye/ Signed-off-by: Vlastimil Babka <vbabka@xxxxxxx> --- Vlastimil Babka (4): fault-inject: add support for static keys around fault injection sites error-injection: support static keys around injectable functions mm, slab: add static key for should_failslab() mm, page_alloc: add static key for should_fail_alloc_page() include/asm-generic/error-injection.h | 13 ++++++++++- include/asm-generic/vmlinux.lds.h | 2 +- include/linux/error-injection.h | 9 +++++--- include/linux/fault-inject.h | 7 +++++- kernel/fail_function.c | 22 +++++++++++++++--- lib/error-inject.c | 6 ++++- lib/fault-inject.c | 43 ++++++++++++++++++++++++++++++++++- mm/fail_page_alloc.c | 3 ++- mm/failslab.c | 2 +- mm/internal.h | 2 ++ mm/page_alloc.c | 11 ++++++--- mm/slab.h | 3 +++ mm/slub.c | 10 +++++--- 13 files changed, 114 insertions(+), 19 deletions(-) --- base-commit: 1613e604df0cd359cf2a7fbd9be7a0bcfacfabd0 change-id: 20240530-fault-injection-statickeys-66b7222e91b7 Best regards, -- Vlastimil Babka <vbabka@xxxxxxx>